STxT: Semantic TexT
One Language, Infinite Possibilities

STxT Parsing

Parsing an STxT file is much easier than parsing files from other technologies. It may seem paradoxical, since it is actually a very powerful language, but at the same time it is based on very simple principles.

I will explain my way of parsing a file. It may not be the best, nor the most optimal, but it is one way to do it. In fact, if you want to see the implementation I have made, it is available on the Internet:

This implementation has been done in Java, since it is the language I know best.

I hope STxT is successful, and that other implementations will appear very soon.

I am not going to go into all the details, but I would like to explain some points that require more attention.

If you do not intend to implement a parser, do not continue reading. The next chapter is much more interesting ;-)

Generic process

Line-by-line parsing

The parsing process can be done line by line, so we can generally say that we have:

while not end of file
	read line
	process line
end while

During the process it is appropriate to have a list of the last nodes we have found according to the level, as the correct processing depends on this.

Line processing

The first step in processing the line is the normalization of the line. A line is normalized when it is in compact (or semi-compact) form, so you have to check if it is, and if not, transform it. Normalization also removes comment lines.

You have to keep in mind during normalization that if the previous node was a text node, when exceeding a certain level it will be part of that same node. That is, it will be text that follows. It will also be part of it if it does not reach the level but the line is completely blank, in which case it will be translated as text with a line break ([[chapter_06.html#index_9|See advanced tutorial]]).

Once we have compacted the line, the processing continues independently, and all that remains is to obtain the level of the new line and distinguish between a few cases:

In each case, the idea is to update the state of our variables and continue with the process.

Note: The most important thing here is to see that it is a process that can be done line by line, and the decisions to be made are relatively simple. This allows us to have a very efficient parser, which in turn can act as a validator of the grammar and nodes.

Validations

Validations are done at several points during parsing:

When do we consider a node closed? This point is interesting, since there are two circumstances that cause a node to be considered closed. One of them is when another node appears with a level equal to or lower than this node. The other is when the entire file has been processed and there are no more nodes to validate. At these points, the node is considered closed and validations can begin.

Language nodes

In the language description we had said that data types have no limitation nor are they tied to a language, so validations should only be checked by regular expressions or methods that ensure this fact.

We have the following types of nodes:

For example, the regular expressions we could use to validate nodes are:

BINARY       = ^(0|1|\s)+$
BOOLEAN      = ^0|1$
HEXADECIMAL  = ^([a-f0-9]|\s)+$
INTEGER      = ^(\-|\+)?\d+$
NATURAL      = ^\d+$
NUMBER       = ^(\-|\+)?\d+\.\d+(e(\-|\+)?\d+)?$
RATIONAL     = ^(\-|\+)?\d+\/\d+$

Grammars

Storage

Grammars are obtained from the Internet, but it is neither practical nor efficient to always fetch the definitions remotely. The most efficient strategy is to have a kind of grammar repository, on disk, and always look for them there. If not found, they would be searched for on the Internet, and that repository would be updated. It is also possible to set check times or other strategies. The idea is that grammars do not change over time, or at least are retroactively compatible.

Initial grammar

Keep in mind that it is not possible to make a parser without having the grammar beforehand. To parse a grammar, you must have the base grammar definition already parsed. For this reason, there will be a definition of the initial grammar in the code itself.

Details to keep in mind

There are some details to keep in mind during parsing: