Introduction to Parsing
Suppose we have the following expression:
10 + 5
This expression is called a binary expression — it consists of two
operands, 10
and 2
. Viewing it as a tree:
More generally, binary expressions have the form:
We can encapsulate this structure in code with some struct
:
struct expression {
type: 'binaryExpression',
operator: '+',
right: 5,
left: 10
}
Now suppose we have this expression:
10 + (3 ⨉ 1)
This, too, is just a binary expression:
So, how do we represent this structure above in terms of code? In other words, how do we take a string consisting of the expressions we've seen, and structure it in such a way that it yields the visualization above? The answer is through a lexer.
Tokens
The expressions we've seen consist of characters. For example, for the expression we have the characters:
Above, the red squares indicate characters that have inherent meaning (according to some language). The whitespace characters — the white boxes — have no meaning in our language. When we give the red-boxed characters meaning, they become tokens.
Stripping all of the whitespaces, the string we're working with is:
10+(3⨉1)
We'll start by creating a class called Lexer
.1
class Lexer:
private:
stream = "";
cursor = 0;
Footnotes
-
The examples used in this volume employ pseudocode to keep the discussion general. The pseudocode, however, was constructed with JavaScript in mind. ↩