STxT: Semantic TexT
One Language, Infinite Possibilities

STxT Parsing

Generic Process

Line-by-line Parsing

The parsing process can be done line by line, so we can generally say that we have:

while not end of file
	read line
	process line
end while

During the process, it is appropriate to have a list of the last nodes we have found according to the level, as the correct processing depends on this.

Line Processing

The first step in processing the line is the normalization of the line. A line is normalized when it is in compact (or semi-compact) form, so it must be checked if it is, and if not, transformed. Normalization also removes comment lines.

It should be taken into account during normalization that if the previous node was a multitext node, when exceeding a certain level it will be part of that same node. That is, it will be text that follows. It will also be part of it if it does not reach the level but the line is completely blank, in which case it will be translated as text with a line break.

Once we have compacted the line, processing continues independently, and all that remains is to obtain the level of the new line and distinguish between a few cases:

In each case, the goal is to update the state of our variables and continue with the process.

Note: The most important thing here is to see that this is a process that can be done line by line, and the decisions to be made are relatively simple. This allows us to have a very efficient parser, which in turn can act as a validator of the grammar and nodes.

Node validations with namespace

Validations are done at several points during parsing:

When do we consider a node closed? There are two circumstances that cause a node to be considered closed. One of them is when another node appears with a level equal to or lower than this node. The other is when the entire file has been processed and there are no more nodes to validate. At these points, the node is considered closed and validations can begin.

Language nodes

In the language description, we said that data types have no limitation nor are they tied to a language, so validations should only be checked through regular expressions or methods that ensure this fact.

We have the following types of nodes:

The regular expressions we could use to validate nodes are:

BINARY       = ^(0|1|\s)+$
BOOLEAN      = ^0|1$
HEXADECIMAL  = ^([a-f0-9]|\s)+$
INTEGER      = ^(\-|\+)?\d+$
NATURAL      = ^\d+$
NUMBER       = ^(\-|\+)?\d+\.\d+(e(\-|\+)?\d+)?$
RATIONAL     = ^(\-|\+)?\d+\/\d+$

Namespaces

Storage

Namespaces must be obtained independently. One strategy is to have a namespace repository on disk, and always look them up there.

Details to keep in mind

There are some details to keep in mind during parsing: