Document: STXT (Semantic Text) — Documents

	Metadata:
		Author: Joan Costa Mombiela
		Last modif: 2026-01-03
		
	Header: @STXT@ Documents
	
	Subheader: 1. Introduction
	
	Content >>
		This document defines the specification of the **@STXT@ (Semantic Text)** language.
		
		@STXT@ is a **Human-First** language, designed so that its natural form is readable,
		clear, and comfortable for people, while at the same time maintaining a precise structure
		and easily processable by machines.
		
		@STXT@ is a hierarchical and semantic textual format aimed at:
		
		* Representing documents and data clearly.
		* Being extremely simple to read and write.
		* Being trivial to parse in any language.
		* Allowing both structured content and free text.
		* Extending its semantics via `@stxt.schema` or `@stxt.template`.
		* Facilitating the creation of parsers while trying to minimize security errors
		
		This document describes the **base syntax** of the language.

	Subheader: 2. Terminology

	Content >>
		The key words **"MUST"**, **"MUST NOT"**, **"SHOULD"**, **"SHOULD NOT"**, and **"MAY"** are to be interpreted as described in **RFC 2119**.
		
	Subheader: 3. Document Encoding

	Content >>
		An @STXT@ document **SHOULD** be encoded in **UTF-8 without BOM**.
		
		A parser:
		
		* **SHOULD** accept documents that begin with a BOM.
		* **MAY** emit a warning for documents that begin with a BOM.
		
	Subheader: 4. Syntactic Unit: Node
		
	Content >>
		Each non-empty line of the document that is not a comment nor part of a `>>` block defines a **node**.
		
		There are two forms of node:
		
		1. **Inline container node (INLINE text node)**: `Node name: Inline value`
		2. **Text block node (BLOCK text node)**: `Node name >>`
		
		The node name **cannot be empty**. A line with only `:` or `>>` is not valid.
		
		**Example with INLINE nodes:**
	
	Code >>
		***Node 1***: Inline value
			***Node 2 without value***:
			***Node 3 with another value***: this is the other value
		
	Content >>	
		**Example with a BLOCK node:**
	
	Code >>
		***Block node >>***
			This is the content
			of the text block:
			
			  - Leading spaces and line breaks are preserved
			  - Right trim is applied
			  - Left trim is NOT applied
		
	Content>>
		A node may optionally include a namespace:
		
	Code>>
		Name (namespace.normal):
		Name (@namespace.special):
		
	Subsubheader: 4.1 Node name normalization
		
	Content >>
		The node name is taken from the text between:
		
		- The first character not belonging to the indentation, and
		- The first character that belongs to any of:
		  - The start of a namespace `(`,
		  - The character `:`,
		  - The operator `>>`,
		
		On that fragment, apply:
		
		- Removal of leading and trailing spaces and tabs (trim).
		- Compaction of spaces into a single one
		
		The result of this normalization is the **node name**.
		
		A node whose logical name is the empty string (`""`) is invalid and **MUST** cause a parse error.
		
		Equivalent examples at the `Node name` level:

	Code>>
		Node name:
		Node name: value
		Node  name   : value
		Node  name (@a.special.namespace):
		Node name(a.normal.namespace):
		Node  name >>
		Node name>>
		
	Content>>
		The definition of a node must always include either `:` (INLINE container node) or `>>` (BLOCK text node),
		**always preceded by a non-empty name**.
		
	Subsubheader: 4.2 Node name restrictions
		
	Content >>
		The node name will only allow alphanumeric characters and the characters `-`, `_`, ` `.
		Names with diacritics, uppercase and lowercase are allowed.		
		
	Subsubheader: 4.3 Canonical node name
		
	Content >>
		The canonical name is formed from the node name through the following process:
		
		* Unicode decomposition (NFKD)
		* Conversion to lowercase
		* Removal of diacritics
		* Space compaction (not necessary on an already normalized name)
		* Replace [^a-z0-9] with `-`. Two or more consecutive hyphens are not allowed; they must be compacted into a single one (`-`)
		* Remove hyphens (`-`) at the beginning and end if present
		
		The canonical name will be used to know whether a node has the same name as another.
		It will also be used internally by all search or check operations,
		to know whether it is the same element.
		
		Transformation examples:
		
	Code >>
		A namé with äccent: a-name-with-accent
		AN NAME with äccent: a-name-with-accent
		SIZe number 2__ and 3: size-number-2-y-3

	Subsubheader: 4.4 Style rules
	Content >>
		The recommended style rules are as follows:
		
		* Separate the name from the definition of a namespace with a single space
		* Separate `:` from the value with a single space
		* `:` goes immediately after the name or the namespace if present
		* `>>` has no character after it
		* Separate the node name or the namespace with a space before `>>`
		* Do not use more than one space in names
		* Namespace without spaces in the definition `(namespace.def)`
		
		Examples of correct style:
		
	Code>>
		Name with value: The value
		Name without value:
		Name with namespace (the.namespace):
		Text node >>
		
	Subheader: 5. Nodes with `:` (container nodes, allow inline value)
		
	Content >>
		The form with `:` defines a node that:
		
		* May have a value (optional).
		* May have no value (empty node).
		* May have children (nested nodes).
		* Its structured content includes:
		  * The node line itself.
		  * Its descendants with greater indentation
		
		Examples:
		
	Code >>
		Title: Report
		Author: Joan
		Node:
		Node: Value
		Node:
		    SubNode 1: 123
		    Another subnode: 456

	SubSubheader: 5.1 Value normalization
	
	Content>>
		The (INLINE) value of a node must be normalized with a trim (right and left).
		
		Example:
	Code>>
		Name: value 1
		Name:    value 1    
		# in both cases, the inline value of Name is "value 1", even though in the
		# second there are spaces before and after.		
	
	Content>>	
		Strong normalization applies only to structural identifiers.
		Values are literals, although a **simple normalization is applied: left and right trim**.

	Subheader: 6. Nodes with `>>` (text block)
		
	Content >>
		The form with `>>` defines a **literal text** block.
		
		Valid examples:

	Code >>		
		Description >>
		    Line 1
		    Line 2

	Code >>	
		Section>>
		    Accepts the operator without a space
		
	Subsubheader: 6.1 Formal rules
		
	Content >>
		* The `>>` node line **MUST NOT** contain meaningful content after `>>`, except optional spaces.
		* All lines with indentation **strictly greater** than that of the `>>` node belong to the **textual content of the block**.
		* Within the block:
		  * The parser **MUST NOT** interpret any line as a structured node, even if it contains `:` or other STXT syntax.
		  * The parser **MUST NOT** interpret lines that begin with `#` as comments; all lines are literal text.
		* The block ends when a **non-empty line that is not a comment** appears whose indentation is **less than or equal** to the indentation of the `>>` node.
		* Empty lines **within the block** are preserved and **MUST NOT** close the block, regardless of their indentation.
		
	Subsubheader: 6.2 Example
		
	Code >>
		Block >>
		    Text
		        Child: value YES allowed, it is text, it is not parsed
		        Another child: YES allowed
		    # This is also text
		Next Node: value

	Content >>		
		In this example:
		
		* Everything indented below `Block >>` is literal text.
		* `Child: value` and `Another child: YES allowed` are **not** nodes, but text.
		* `Next Node: value` is outside the `>>` block.
		
	Subheader: 7. Namespaces
		
	Content >>
		A namespace is optional and is specified like this:

	Code >>		
		Node ***(com.example.docs)***:
		Another node ***(another.namespace)***:
		More nodes ***(@a.special.name)***:
		
	Content >>
		Rules:
		
		* A namespace **MAY** start with `@`.
		* It **MUST** use hierarchical format (`a.b.c`), with at least 2 elements (`a.b`).
		* It is inherited by child nodes.
		* If no namespace is specified, by default it is the empty namespace `""`.
		* The empty namespace cannot be specified as `Node name ()`.
		* A node’s empty namespace is always overwritten with the parent’s namespace
		* A child node may redefine its namespace by indicating `(another.namespace)`, in which case it uses that namespace instead of the inherited one.
		* Only characters within the range [a-z0-9] are allowed, optionally preceded by `@`, to indicate that it is a special namespace.
		* A parser **MUST** lowercase a namespace. Thus it must convert from `Name (COM.DEMO.DOCS)` to `Name (com.demo.docs)` internally
		* By style rules a namespace should be written in lowercase 
		
	Subheader: 8. Indentation and Hierarchy
		
	Content >>
		Indentation defines the structured hierarchy of the document.
		
	Subsubheader: 8.1 Allowed Indentation
		
	Content >>
		An STXT document:
		
		* **MAY** use spaces or tabs for indentation.
		* **It is not recommended** to mix spaces and tabs on the same line. A parser **MAY** warn in that case.
		  If they are mixed, spaces fewer than 4 are discarded if a tab appears. Following the
		  ** Human First** principle, this rule is to ensure that a document that *looks* correct really is. That is, that it is not
		  necessary to review lines character by character to know whether something is correct or not; it should be visible at a glance.
		* If it uses spaces:
		  * It **MUST** use multiples of **4 spaces** to go up a level.
		* If it uses tabs:
		  * Each tab represents exactly 1 level.
		* All possible cases of going up a level are shown:
		  * 0 Spaces + 1 TAB
		  * 1 Spaces + 1 TAB
		  * 2 Spaces + 1 TAB
		  * 3 Spaces + 1 TAB
		  * 4 Spaces + 0 TAB
		* Once a level has been increased, the previous rules apply again. That is, the count is reset at each level increase.
		  
	Subsubheader: 8.2 Special indentation examples
	
	Content >>
		In the following examples `.` is shown to identify a space, and `|-->` to identify a Tab.
		The tab will be shown with the characters missing until reaching the next column, like a text editor.

	Content >>
		Example with tabs:
		
	Code >>
		Node level 0: Value level 0
		|-->Node level 1:
		|-->Another node level 1:
		|-->|-->Level 2:
		|-->|-->Level 2:
		|-->Level 1:
		|-->Level 1:
	
	Content >>		
		Example with spaces:
		
	Code >>
		Node level 0: Value level 0
		....Node level 1:
		....Another node level 1:
		........Level 2:
		........Level 2:
		....Level 1:
		....Level 1:
	
	Content >>
		Example with a mix of spaces and tabs.
		
		Allowed, though not recommended by style.
		A parser **MAY** warn about mixing on the same line.
		This example has the same indentation as the two previous ones.
		
	Code >>
		Node level 0: Value level 0
		.|->Node level 1: ***Space + 1 TAB: level 1***
		..|>Another node level 1: ***2 Spaces + 1 TAB: level 1***
		...>..|>Level 2: ***3 Spaces + 1 TAB, 2 Spaces + 1 TAB: level 2***
		|-->....Level 2: ***1 TAB, 4 Spaces: level 2***
		..|>Level 1: ***2 Spaces + 1 TAB: level 1***
		.|->Level 1: ***1 Space + 1 TAB: level 1***
		
		
	Subsubheader: 8.3 Level errors
	
	Content >>
		A parser **MUST** raise a parse error in the following cases:
		
		* Non-consecutive levels:
		
	Code>>
		Level 0:
		....Level 1:
		............Level3: ***ERROR, you cannot go from level 1 to level 3***
		
	Content >>
		* Not reaching a multiple of 4 when using spaces or mixing
		
	Code >>
		Level 0:
		....Level 1
		...Almost level 1: ***ERROR: 3 spaces (does not reach 4)***
		
		Level 0:
		....Level 1:
		.|->..Almost level 2: ***ERROR: 1 spaces + 1TAB, 2 spaces***
		
		Level 0:
		....Level 1:
		..........More than level 2: ***ERROR: 4 spaces, 4 spaces, 2 spaces*** 
		
	Subsubheader: 8.4 Hierarchy
		
	Content >>
		* Indentation **MUST** increase consecutively (no jumps allowed).
		* Child nodes **MUST** have greater indentation than their parent.
		* Indentation within a `>>` block **does not affect the structural hierarchy**: it is simply text.
		
	Subheader: 9. Comments
		
	Content >>
		Outside `>>` blocks, a line is a comment if, after its indentation, the first character is `#`.
		
		Example:
	Code >>		
		# Root comment
		Node:
		    # Inner comment
		
	Subsubheader: 9.1 Comments inside `>>` blocks
		
	Content >>
		Inside a `>>` block:
		
		* Any line with indentation equal to or greater than the minimum indentation of the block
		  **MUST** be treated as literal text, even if it starts with `#`.
		* A line less indented than the block that is not a comment ends the block
		
		Example:
	Code >>
		***# A normal comment (level 0)***
		Document:
			***# Another normal comment***
					***# This is also a comment! Outside >> block***
		    Text >>
		        # This is text
		        Normal line
		            # This is also normal text
		    ***# This is a comment***
		***# This is also a comment***
		    	Here the node text continues

	Subsubheader: 9.2 Comment style
	Content >>
		* It is recommended that the comment be at the same level as the next node.
		  That is, comments for the next node.
		* Comments inside a text block are not recommended, since visually
		  they are strange.

	Subheader: 10. Whitespace normalization
		
	Content >>
		This section defines how whitespace must be normalized to
		ensure that different implementations produce the same logical
		representation from the same STXT text.
		
	Subsubheader: 10.1 Inline values (`:`)
		
	Content >>
		When parsing a node with `:`:
		
		1. The parser takes all characters from immediately after `:` to the end of the line.
		2. The inline value **MUST** be normalized by applying:
		
		   * Removal of leading spaces and tabs (left trim).
		   * Removal of trailing spaces and tabs (right trim).
		
		This implies that the following lines are equivalent at the parsing level:
		
	Code >>
		Name: Joan
		Name:     Joan
		Name: Joan    
		Name:     Joan    

	Content >>		
		In all cases, the logical value of the `Name` node is `"Joan"`.
		
		If after `trim` the value is empty, the inline value is considered the empty string (`""`).
		
	Subsubheader: 10.2 Lines inside `>>` blocks
		
	Content >>
		For each line that belongs to a `>>` block:
		
		1. The parser determines the content of the line from the text that follows the minimum indentation of the block
		   (i.e., it removes only the block indentation, but preserves any additional indentation as part of the text).
		2. On that content, the parser **MUST** remove all trailing spaces and tabs (right trim).
		3. Empty lines are preserved in all cases, except lines that are real comments, with indentation lower than the block.
		
		Example of line canonicalization:
		
	Code >>
		Block >>
		    Hello    
		        World        

	Content >>		
		Logical representation of the block content:
		
		* Line 1: `"Hello"`
		* Line 2: `"    World"`  (the 4 additional spaces after the minimum indentation are preserved, trailing spaces are removed)
		
	Subsubheader: 10.3 Empty lines in `>>` blocks
		
	Content >>
		* **Intermediate** empty lines within the block (intermediate or final) **MUST** be preserved as empty lines (`""`) in the logical representation of the text.
		* Only right trim is applied to each individual line (remove spaces/tabs at end of line, as already done).
		* No empty line is removed, neither intermediate nor final.
		
		Example:
		
	Code >>
		Text >>
		    Line 1
		    
		    Line 2
		    
	Content >>		
		Logical content of the block:
		
		* Line 1: `"Line 1"`
		* Line 2: `""`
		* Line 3: `"Line 2"`
		* Line 4: `""`
		
	Subheader: 11. Error Rules
		
	Content >>
		A document is invalid if any of these conditions occur:
		
		1. Spaces that are not multiples of 4 (when spaces are used for indentation).
		2. Jumps in indentation levels.
		3. A `>>` node contains meaningful inline content on the same line as `>>`.
		4. A node contains neither `:` nor `>>`.
		
		A conforming parser **MUST** reject the document.
		
	Subheader: 12. Conformance
		
	Content >>
		An STXT implementation is conforming if:
		
		* It implements the syntax described in this document.
		* It applies the strict indentation and hierarchy rules.
		* It correctly interprets nodes with `:` and `>>` blocks.
		* It interprets comments outside `>>` blocks.
		* It treats **everything** inside `>>` blocks as literal text.
		* It applies the whitespace normalization rules of section 10.
		* It rejects invalid documents according to section 11.
		
	Subheader: 13. File Extension and Media Type
		
	Subsubheader: 13.1 File Extension
		
	Content >>
		STXT documents **SHOULD** use the extension: `.stxt`
		
	Subsubheader: 13.2 Media Type (MIME)
		
	Content >>
		* Official media type: `text/stxt`
		* Compatible alternative: `text/plain`

	Subheader: 14. Normative Examples
		
	Subsubheader: 14.1 Valid document
		
	Code >>
		Document (com.example.docs):
		    Author: Joan
		    Date: 2025/12/03
		    Summary >>
		        This is a text block.
		        With multiple lines.
		    Config:
		        Mode: Active
		
	Subsubheader: 14.2 Block with empty lines
		
	Code >>
		Text>>
		    
		    Line 2
	Content >>		
		Logical content of the block:
		
		1. `""`
		2. `"Line 2"`

	
	Subsubheader: 14.3 Comments inside and outside blocks
		
	Code >>
		Document:
		    Body >>
		        # This is text
		        More text
		    # This is a comment
		
	Subheader: 15. Security Considerations
	
	Content >>
		@STXT@ has been designed with parsing security as a fundamental priority,
		minimizing the attack surface compared to other structured textual formats.
		
		A conforming @STXT@ parser is inherently resistant to common classes of vulnerabilities:
		
		* **Immune to entity expansion attacks** (such as "billion laughs" or XXE):
		  the format defines no entities, external references, or inclusion of remote resources.
		* **Immune to arbitrary code execution**: there are no dynamic features, custom tags,
		  loaders, or object deserialization. The only resulting structure is a simple tree of nodes and textual values.
		* **Immune to injection inside literal blocks**: all content inside a `>>` node is treated
		  as literal text without any interpretation, even if it contains `:`, `>>`, `#`, or other STXT syntax.
		* **Low risk of denial of service**: strict consecutive indentation rules
		  and the absence of circular references or anchors limit structural complexity.
		  Implementations **SHOULD** impose a reasonable limit on nesting depth (recommended: ≤ 100 levels)
		  and total document size.
		* **Optional external schemas**: semantic validation is a separate layer.
		  A basic parser **MAY** operate without loading external schemas, eliminating risks associated with their resolution.
		
		Consequently, STXT is especially suitable for processing documents from untrusted sources
		(remote configurations, user input, data exchange) where parser security is critical.
		
		Implementations **MUST** reject invalid documents according to section 11
		and **MUST NOT** introduce extensions that allow external loading
		or dynamic evaluation without explicit security measures.
		
	Subheader: 16. Appendix A — Grammar (Informal)
		
	Code >>
		Document       = { Line }
		
		Line           = [Indentation] ( Comment | Node | BlockContinuation | EmptyLine )
		
		Node            = Indentation Name [Namespace] ( Inline | BlockStart )
		Inline          = ":" [Space] [InlineText]
		BlockStart      = [Space] ">>" [TrailingSpaces]
		
		Namespace       = "(" ["@"] Ident { "." Ident } ")"
		Ident           = [a-z0-9]+   ; lowercase and numbers only according to style and normalization rules
		
		Comment      = "#" { any character until end of line }
		
		BlockContinuation = IndentationGreaterThanPreviousBlock { any text }   ; literal text
		
		Indentation     = Allowed mix of spaces and tabs according to section 8
		                  - Pure spaces: exact multiples of 4 per level
		                  - Pure tabs: 1 tab = 1 level
		                  - Mixed in line: tab wins, spaces <4 are ignored
		
		Name          = Normalized text (trim + space compaction) according to section 4.1
		
	Content >>	
		**Key notes for implementers:**
		
		* The parser must process the document line by line, maintaining state of:
		  - Current indentation level of the parent node.
		  - Base indentation and active `>>` block state (if any).
		  - Current inherited namespace.
		
		* **Basic parsing flow:**
		  1. Read line and compute its effective indentation (according to section 8 rules).
		  2. If there is an active `>>` block:
		     - If indentation ≥ minimum indentation of the block → add line as literal text (right trim).
		     - If indentation ≤ indentation of the `>>` node and line is non-empty and line is not a comment → close block and process as a new node.
		  3. If there is no active block:
		     - Empty line → ignore (does not affect hierarchy).
		     - Starts with `#` → comment.
		     - Otherwise → new node (normalize name, detect namespace, type : or >>).
		
		* **Namespace inheritance:**
		  - The root node namespace is empty by default.
		  - Each node inherits its parent’s namespace.
		  - If a node defines its own namespace in `()`, it replaces the inherited one for it and all its descendants.
		
		* **Additional normalization:**
		  - Node names: according to sections 4.1–4.3.
		  - Namespaces: internally converted to lowercase (section 7).
		  - Inline values: left and right trim (section 10.1).
		  - Block lines: preserve relative indentation + right trim + preserve all empty lines (sections 10.2–10.3).
				Comment       = Indent "#" Text       ; Only outside '>>' blocks

	Subheader: 17. Appendix B — Interaction with `@stxt.schema`
		
	Content >>
		The schema system allows adding semantic validation to STXT documents **without modifying the base syntax** of the language.
		
		The STXT core does not define how an implementation should react: the behavior belongs exclusively to the schema system (*STXT-SCHEMA-SPEC*).
		
		A schema is an STXT document whose namespace is: `@stxt.schema`
		
		and whose goal is to define the structural rules, value types, and cardinalities of nodes belonging to a specific namespace.
		
		The STXT core **does not interpret** these rules; it only defines how they are expressed and how they are combined via namespaces.
		
	Subsubheader: 17.1. Associating a schema to a namespace
		
	Content >>
		To associate a schema to the namespace `com.example.docs`, write a document:
		
	Code>>
		Schema (@stxt.schema): com.example.docs
			Node: Email
				Children:
					Child: From
					Child: To
					Child: Cc
					Child: Bcc
					Child: Title
						Max: 1
					Child: Body Content
						Min: 1
						Max: 1
					Child: Metadata (com.google)
						Max: 1
			Node: From
			Node: To
			Node: Cc
			Node: Bcc
			Node: Title
			Node: Body Content
				Type: TEXT
		
	Subsubheader: 17.2. Application to STXT documents
		
	Content >>
		A document that declares the same namespace:
		
	Code >>
		Document (com.example.docs):
		    Field1: value
		    Text: one
		    Text: two
	
	Content >>
		
		can be validated by an implementation that supports STXT schemas:
		
		* Validating the presence of nodes according to `Node` in the schema.
		* Validating value types (`TEXT`, `DATE`, `NUMBER`, etc.).
		* Validating cardinalities defined in `Child`.
		
	Subsubheader: 17.3. Core independence
		
	Content >>
		STXT **MUST NOT** impose semantic rules coming from schemas.
		The schema system is a separate and optional component that operates **on** the already-parsed STXT.
		
		It also **MAY** act as part of the parsing process. In that case it **SHOULD** be weakly coupled with it.
		This would allow detecting errors without having to wait until the end of parsing.
		
	Subheader: 18. Appendix B — Interaction with `@stxt.template`

	Content>>	
		The template system allows adding semantic validation to STXT documents **without modifying the base syntax** of the language.
		
		The STXT core does not define how an implementation should react: the behavior belongs exclusively to the template system (*STXT-TEMPLATE-SPEC*).
		
		A template is an STXT document whose namespace is: `@stxt.template`
		
		and whose goal is to define the structural rules, value types, and cardinalities of nodes belonging to a specific namespace.
		
		The Template system is analogous to schemas, but with a simplified syntax, oriented toward rapid prototypes.
		Even so, it is a perfectly valid system for all kinds of documents. It could be considered *syntactic sugar*, since internally it can use
		the same representation as a schema.
		
		The template system **MAY** coexist alongside a schema system, since in the end a Template defines the same information as a schema.
	
	Subsubheader: 18.1. Associating a schema to a template
		
	Content >>
		To associate a schema to the namespace `com.example.docs` with templates, write a document:
		
	Code>>
		Template (@stxt.template): com.example.docs
			Structure >>
				Email:
					From:
					To:
					Cc:
					Bcc:
					Title: (?)
					Body    Content: (1) TEXT
					Metadata (com.google): (?)
					
	Content>>
		Once declared, templates fulfill the same function as schemas.
		A standard validator **SHOULD** prioritize a schema over a template.  
	
	Subheader: 19. End of Document