Compilers

"Trying to outsmart a compiler defeats much of the purpose of using one."

Brian Kernighan & P.J. Plauger, The Elements of Programming Style 11 (1978).

Basic Concepts
1. What is a compiler?
  1. Anatomy of a Compiler
The Tokenizer
1. Lexical Analysis
2. Grammars
  1. Describing Grammar
Parsing
Code Generation

This chapter covers notes on compilers and language design. The notes below assume knowledge from the Virtual Machines chapter.

Basic Concepts

This section provides an overview of some basic concepts. The subsections are kept brief, as later sections will build on the concepts in further detail.

What is a compiler?

Simply put, a compiler is a program that translates text in one language (called the source language) to another (called the target language). It might perform numerous other tasks alongside translation — optimization, enforcing security, ensuring compliance with the target language's rules — but its principle task is translation.

For most compilers, the source language is some high-level language; e.g., Java, C++, C, Python, Swift, etc. The target language is some language a little "closer to the metal" — Java bytecode or some other virtual machine language, varieties of Assembly, or for some languages, machine code. This is a key point to keep in mind. Generally, compilers fall into three categories:

Two-tiered compilers are compilers that take source code inputs and return virtual machine code outputs. The virtual machine code outputs then translated by the relevant virtual machine into machine code.
Single-tiered compilers are compilers that take source code inputs and output straight to either the system's Assembly language or the system's machine language.

Anatomy of a Compiler

Most compilers — two- or single-tiered — have the following anatomy:

From the diagram, we see that a compiler consists of three main components:

A tokenizer,
A parser, and
A code generator.

The first two components are encapsulated in a larger module called the syntax analyzer. We begin by considering the tokenizer.

The Tokenizer

Lexical Analysis

Suppose we had the following code snippet:

if (x < 0) {
	// determines the sign
	return "negative input";
}

The snippet above contains various characters: i, f, (, x, <, and so on. To create meaning, a language can a particular character a meaning (e.g., the symbol < has the meaning "less than"), or a group of the characters meaning (e.g., the grouping if marks the introduction of a sufficient condition). In the world of programming languages, these basic units — a character or group of characters — are called tokens.

Informal Definition. A token is a string of characters that has a meaning.

The language can further classify these tokens into token types. For example, a simple language might have the following token types:

keywords,
symbols,
integers,
strings,
identifiers

For example, the program above might result in outputs that look like any of the following:

// in some implementation A
<keyword> if </keyword>
<symbol> { </symbol>
<identifier> x </identifier>

// in some implementation B
{ token: "if", type: "keyword" }
{ token: "{", type: "symbol" }
{ token: "x", type: "identifier" }

// in some implementation C
["if", "keyword"]
["{", "symbol"]
["x", "identifier"]

Ultimately, how the token is represented is entirely up to the designer. In the first approach, an XML-like output is used; in the second, a struct-like output; and in the third, a jagged array.

Broadly, a tokenizer is program the performs the following:

abstracts the compiler input as a token stream,
ensures the token stream continues flowing so long as there are inputs, and
supply the current token value and type, and

To put this in object-oriented pseudocode, the tokenizer might look like the following:

const tokenizer = new Tokenizer("program.file");
tokenizer.advance();
while (tokenizer.hasMoreTokens()) {
	// tokenize tokens
}

Grammars

A language needs more than just tokens to convey meaning. For example, consider this sequence of words:

hates sprouts she brussels

Each word has a meaning, but all together, the sequence doesn't make much sense. If we put the sequence in another order:

she hates brussels sprouts

Now it makes sense. Programming languages are the same way. For the language to convey meaning, it must have rules for how tokens should be arranged. The set of all these rules is called the language's grammar.

Informal Definition. A grammar is a set of rules for how tokens must be arranged to create valid language constructs.

Describing Grammar

A grammar consists of grammatical rules. To help us describe a grammatical rule, we'll use a simple notation system before graduate on to a more formal system called Backus-Naur Form.

All grammatical rules consist of a left-hand side naming the rule, and a right-hand side describing how the rule is composed. For example, we can define a small subset of the English language as follows:

\begin{aligned} &\ix{sentence}: \ix{nounPhrase} \ix{verbPhrase} \\ &\ix{nounPhrase}: \ix{determiner?} \ix{noun} \\ &\ix{verbPhrase}: \ix{verb} \ix{nounPhrase} \\ &\ix{noun}: \text{apple} | \text{he} | \text{she} | \text{bananas} | \ldots \\ &\ix{verb}: \text{ate} | \text{shared} | \text{loves} | \text{grows} | \ldots \\ &\ix{determiner}: \text{the} | \text{to} | \text{my} | \text{a} | \ldots \\ \end{aligned}

The grammatical rules above allow us to write sentences like:

he grows bananas
he grows the bananas
she ate the apple
he shared the bananas

From the simple notation, we see that grammatical rules fall under two broad categories:

Terminal rules are rules where the right hand side defines only constants. The name terminal is descriptive — it's the end of the road when we break down a language's construct.
Non-terminal rules are all other rules in the language. These are rules that can be further decomposed into simpler constructs, all the way down to terminal rules.

The simple notation allows us to define a programming language's grammar. For example, some language's grammar might be described as follows:

statement :  ifStatement
				  |  whileStatement
				  |  letStatement

The notation above tells us the following:

A statement is either (1) an ifStatement, (2) a whileStatement, or (3) a letStatement.

Of course, we can have one or more statements in the language. To describe this, we write:

statement  :  ifStatement
				   |  whileStatement
				   |  letStatement
statements : statement*

Here, we denote one or more statements with an asterisk. Let's describe the ifStatement:

ifStatement : 'if' '(' expression ')' '{' statement '}'
					  | 'if' '(' expression ')' '{' statements '}'

The description above tells us the following:

An ifStatement starts with the token if, followed by the token (, followed by an expression, followed by the token {, followed by one or more statements, and ending with the token }.

The whileStatement's description follows the same form:

whileStatement : 'while' '(' expression ')' '{' statement '}'
						   | 'while' '(' expression ')' '{' statements '}'

The letStatement has a fairly short description, but includes some new rules:

letStatement : 'let' identifier '=' expression ';'

Let's first define the identifier rule:

identifier : a string not beginning with a digit

This is a terminal rule. It tells us that an identifier is a string not beginning with a digit. For the expression rule, we have the following:

expression : term (op term)?

This rule provides that an expression is a term, followed by one or more (op term). The question mark denotes "one or more" optional parts. What's a term? Let's define it:

term : identifier | constant

This description provides that a term is either (1) an identifier, or (2) a constant. A constant is then defined as a decimal number:

constant : a decimal number

And finally, an op is defined as any one of the following tokens:

op : '+' | '-' | '=' | '>' | '<'

Putting all of this together, we have:

statement  :  ifStatement
				   |  whileStatement
				   |  letStatement
statements : statement*
ifStatement : 'if' '(' expression ')' '{' statement '}'
					  | 'if' '(' expression ')' '{' statements '}'
whileStatement : 'while' '(' expression ')' '{' statement '}'
						   | 'while' '(' expression ')' '{' statements '}'
letStatement : 'let' identifier '=' expression ';'
identifier : a string not beginning with a digit
expression : term (op term)?
term : identifier | constant
constant : a decimal number
op : '+' | '-' | '=' | '>' | '<'

The compiler uses these rules to determine if we have syntax errors. For example, from the rules above, we can determine that this expression is valid in the language:

let pi = 3.14;

but this expression is not:

let pi = 3.14

Why? Because we're missing the token ;. Likewise, this statement is valid in the language:

if (x < 0) {
	let x = 0;
}

but this statement is not:

if x < 0 {
	let x = 0;
}

Here, we're missing the tokens ( and ). How does the compiler know we've violated a grammatical rule? Through the parser.

Parsing

Parsing is the act of verifying that a given input conforms to a language's grammar. More interestingly, as we verify the grammar, we steadily uncover the input's grammatical structure. This reveals a key insight that separates computers from humans. Humans have a truly remarkable ability to uncover grammatical structure quickly. Even if the input we receive doesn't follow a grammatical rule we're familiar with, we can quickly uncover the input's grammatical structure, and ultimately, meaning. This ability is most obvious when we read poetry.

Computers, on the other hand, do not have this ability. The tiniest deviation from the established rules — a missing semicolon or closing curly brace — is enough for it to throw its hands in the air and say, "You don't make sense."¹

Code Generation

Consider the following pseudocode in some high-level language:

int sum = x * (1 + rate)

The code generator's job is to take this source code, and output something that looks like:

push x
push 1
push rate
+
*
pop sum

This is pseudo VM code. From the virtual machine code section, we know that real VM code has no notion of x, rate, or sum. Instead, all it has our virtual memory segments. Thus, a more accurate pseudo VM code would look like:

push argument 2
push constant 1
push static 0
add
call Math.multiply 2
pop local 3

Accordingly, the code generator's job is fairly challenging. It must output code that matches the VM code's specifications. This requires the code generator determine a number of things, including:

Whether a variable is a field, static, local, or an argument,
whether each variable is the first, second, third, fourth, etc. of its particular type.

For the code generator to make these determinations, the language's designers must first make distinctions betwee variables. For example, some object-oriented language might have code that looks like the following:

class Foo {
	field int x, y;
	static int pointCount;

	method int distance(Point other) {
		var int dx, dy;
		let dx = x - other.getX();
		let dy = y - other.getY();
		return Math.sqrt((dx*dx) + (dy*dy));
	}
}

Design-wise, the language might distinguish the variables as follows:

Class-level variables map to field and static variables (the variables inside the class but outside methods).
Subroutine-level variables map to argument and local variables (the variables inside the methods).

For all these variables, there are several properties at the virtual machine level:

name (an identifier in the higher-level language),
type (int, char, bool, a class name),
kind (field, static, local, argument),
scope (class level, subroutine level)

For a sufficiently complex language, these properties can comprise a great amount of data. For most modern compilers, that data is handled with a symbol table.

Symbol Tables

Using the same pseudocode from earlier:

class Foo {
	field int x, y;
	static int pointCount;

	method int distance(Point other) {
		var int dx, dy;
		let dx = x - other.getX();
		let dy = y - other.getY();
		return Math.sqrt((dx*dx) + (dy*dy));
	}
}

a symbol table might look like the following:

`name`	`type`	`kind`	`#`
`x`	`int`	`field`	`0`
`y`	`int`	`field`	`1`
`pointCount`	`int`	`static`	`0`

Above, we see four columns. The name column provides the name for each variable, the type the type, etc. The # is a running count of times the kind variable kind has appeared, starting at 0. Thus, after the variable y, we have two field variables, yielding the 1 we see in the # column

The class method is a little different. Here, the symbol table might look like the following:

`name`	`type`	`kind`	`#`
`this`	`Point`	`argument`	`0`

Why doesn't this look like the other symbol table? Because this is a class method. If it were a function, we would have just three variables: dx, dy, and other. And because it's a method, it can only act on the current object. The rest of the symbol table is as we'd expect:

`name`	`type`	`kind`	`#`
`this`	`Point`	`argument`	`0`
`other`	`Point`	`argument`	`1`
`dx`	`int`	`local`	`0`
`dy`	`int`	`local`	`1`

Managing Symbol Table Memory Usage

Because symbol tables can grow very large, languages like Java and C# generate symbol tables at two levels:

At the class-level (called class-level symbol tables), and
at the subroutine-level (called subroutine-level symbol tables)

Once the classes or subroutines are no longer needed, the symbol tables are forgotten. This ensures that the symbol tables do not consume an inordinate amount of memory.

As we can likely tell, the sizes of our symbol tables depend on how many types and kinds the language has. The more primitive types or variable kinds the language has, the larger the resulting symbol table. The less primitive types and kinds, the smaller the resulting symbol table. This indicates a tradeoff.

The fewer primitive types and kinds a language has, the more efficient it is, but the less expressive. The more primitive types and kinds a language has, the the more expressive it is, but at the cost of efficiency.

The Code Writer

The program responsible for creating and updating the symbol table is the codewriter. To do so, the codewriter must do two things:

Handle variable declarations, and
Handle variable usage

Handling Variable Declarations

All of the following are examples of variable declarations in some language:

double y = 1;
class Foo {
	field n = 0;
	static int count;
}

When the codewriter encounters these declarations, it writes to the symbol table. Importantly, that's the only thing the codewriter does with variable declarations. It does not output code of its own.

For function arguments, the codewriter performs the same task: It writes each argument to the symbol table, just as it would the other variables above. But again, as we saw previously, in an OOP language, the first symbol table entry a codewriter will enter when it encounters a class method is an entry for the this identifier.

Handling Variable Usage

Variable usage occurs when we have something like the following:

let x = y - other.getX();

Suppose this statement appears in a class method. When the codewriter sees the variable y, it looks it up in the current symbol table (the subroutine-level symbol table). If the codwriter does not find it, it looks it up in the class-level symbol table. If the codewriter doesn't find it there, it throws an error — something along the lines of "variable undefined".

Nested Scoping

In languages like Java and C++, programmers have access to unlimited nested scoping. For example, say we had the following pseudocode in such a language:

class foo {
	// code_block_1
	method f() {
		// code_block_2
		let x;
		{
			// code_block_3
			let x;
			{
				// code_block_4
				let x;
				for (let i = 0; i < arr.length; i++) {
					// code_block_5
					{
						// code_block_6_a
						let x;
					}
					{
						// code_block_6_b
						let x;
					}
				}
			}
		}
	}
}

In the code above, every opening curly brace and its closing curly brace denotes scope. Because these are all separate scopes, the identifier x can be reused multiple times throughout the code. How is this accomplished? By creating a linked list of symbol tables.

As we saw earlier, when a program is fed to the compiler, it will first look for the main entry point. In languages like C++, this is indicated by the function main() and in Java, by the class Main. Focusing on the compiler's codewriter, a symbol table is created for the program. When the codewriter encounters a scope (e.g., a method), it creates symbol table for that method, and then links that symbol table to the class- or program-level symbol table. When it encounters a scope within the method, it creates another symbol table for that scope, and links that scope to the method's symbol table. And when it encounters another scope it links that, and so on. This results in a linked list that looks like:

where each node in the linked list is a symbol table.

Handling Expressions

Of note, some compilers are getting better at handling these small mistakes, parsing the remaining expressions to determine if it can uncover structure and meaning despite the error. ↩