STxT Parsing
Generic Process
Line-by-line Parsing
The parsing process can be done line by line, so we can generally say that we have:
while not end of file read line process line end while
During the process, it is appropriate to have a list of the last nodes we have found according to the level, as the correct processing depends on this.
Line Processing
The first step in processing the line is the normalization of the line. A line is normalized when it is in compact (or semi-compact) form, so it must be checked if it is, and if not, transformed. Normalization also removes comment lines.
It should be taken into account during normalization that if the previous node was a multitext node, when exceeding a certain level it will be part of that same node. That is, it will be text that follows. It will also be part of it if it does not reach the level but the line is completely blank, in which case it will be translated as text with a line break.
Once we have compacted the line, processing continues independently, and all that remains is to obtain the level of the new line and distinguish between a few cases:
- Are we at the first node?
- Is the line text from a previous text node?
- Does a new node start?
In each case, the goal is to update the state of our variables and continue with the process.
Note: The most important thing here is to see that this is a process that can be done line by line, and the decisions to be made are relatively simple. This allows us to have a very efficient parser, which in turn can act as a validator of the grammar and nodes.
Node validations with namespace
Validations are done at several points during parsing:
- When creating a new node: When creating a new node, it is validated that its namespace can be deduced. Otherwise, it means it could not be created in that position and would be incorrect.
- When closing a node: When a node is considered closed, it is validated. ** If it is of type NODE, it is validated that the number of children is correct. ** If it is not NODE, it is validated that it has the appropriate content depending on its type.
When do we consider a node closed? There are two circumstances that cause a node to be considered closed. One of them is when another node appears with a level equal to or lower than this node. The other is when the entire file has been processed and there are no more nodes to validate. At these points, the node is considered closed and validations can begin.
Language nodes
In the language description, we said that data types have no limitation nor are they tied to a language, so validations should only be checked through regular expressions or methods that ensure this fact.
We have the following types of nodes:
- NODE
- TEXT
- URL
- NATURAL
- INTEGER
- RATIONAL
- NUMBER
- BINARY
- HEXADECIMAL
- BASE64
- BOOLEAN
The regular expressions we could use to validate nodes are:
BINARY = ^(0|1|\s)+$ BOOLEAN = ^0|1$ HEXADECIMAL = ^([a-f0-9]|\s)+$ INTEGER = ^(\-|\+)?\d+$ NATURAL = ^\d+$ NUMBER = ^(\-|\+)?\d+\.\d+(e(\-|\+)?\d+)?$ RATIONAL = ^(\-|\+)?\d+\/\d+$
Namespaces
Storage
Namespaces must be obtained independently. One strategy is to have a namespace repository on disk, and always look them up there.
Details to keep in mind
There are some details to keep in mind during parsing:
- Case-insensitive: All nodes are considered CASE-INSENSITIVE, so the appropriate transformations must be made during the parsing process.
- Base64: With BASE64 text, line breaks must be allowed, and a standard parsing of the content thus obtained must be performed.
- For line reading, both UNIX and DOS formats must be taken into account. Therefore, both line break and line break + carriage return will be allowed. This is done to allow quick file edits from any environment, although the most appropriate would always be to use the UNIX standard (line break character only).