Introduction to Parsing

Suppose we have the following expression:

10 + 5

This expression is called a binary expression — it consists of two operands, 10 and 2. Viewing it as a tree:

More generally, binary expressions have the form:

We can encapsulate this structure in code with some struct:

struct expression {
	type: 'binaryExpression',
	operator: '+',
	right: 5,
	left: 10
}

Now suppose we have this expression:

10 + (31)

This, too, is just a binary expression:

So, how do we represent this structure above in terms of code? In other words, how do we take a string consisting of the expressions we've seen, and structure it in such a way that it yields the visualization above? The answer is through a lexer.

Tokens

The expressions we've seen consist of characters. For example, for the expression 10+(3×1),{10 + (3 \times 1),} we have the characters:

Above, the red squares indicate characters that have inherent meaning (according to some language). The whitespace characters — the white boxes — have no meaning in our language. When we give the red-boxed characters meaning, they become tokens.

Stripping all of the whitespaces, the string we're working with is:

10+(31)

We'll start by creating a class called Lexer.1

class Lexer:
	private:
		stream = "";
		cursor = 0;

Footnotes

  1. The examples used in this volume employ pseudocode to keep the discussion general. The pseudocode, however, was constructed with JavaScript in mind.