4. Compilers

Compilers vs Interpreters

Given a high-level program $P$ , input $I$ and output $O$ , an interpreter acts like the function converting the program and code into the output: $\text{interpreter}(P, I) \rightarrow O$ . On the other hand, the compiler, given the program, outputs low-level machine code $M$ which can then be fed to a processor: $\text{processor}(\text{compiler}(P), I) \rightarrow O$ .

Structure of a Compiler

Scanner/lexical analyser
- Convert source code into tokens
Parser/syntactic analyser
- Parse into syntax tree
Semantic analyser: annotate syntax tree
- Type checking, some optimizations
Machine-independent optimizer
- Converts to intermediate form
Code generator
- Converts to target code
Machine-dependent optimizer
- Generates optimized target code

Lexical Analysis

Reads source code character-by-character, and outputs sequence of tokens to feed to parser.

In reality, the scanner is usually called by the parser, producing a token at a time.

Examples:

Reserved words
- if, then, else, while
Symbols
- +, -, <, (, ), +=
Identifiers: variable names etc.
Number constants
String constants

The scanner needs to distinguish between reserved words and symbols, and identifiers and constants.

Each type of token can be described using regular expressions. Reserved words/symbols may be identified by string literals, while others may be defined via a sequence of definitions.

The sequence of definitions must not be cyclic. Use NFA to do so. Example:

digit := 0-9, where 0-9 is a character class
lower := a-z
upper := A-Z
letter := lower|upper
identifier := (letter|_)(letter|digit|_)*, where _ is $\epsilon$

Scanners and Automata Example: Integers

Construct an NFA for each type of integer: end state should identify the type of integer
- dec_integer := 0+ | non_zero_digit digit*
- oct_integer := 0(o|O)oct_digit+
Combine NFAs using new initial state and $\epsilon$ transitions to the start of each existing NFA
Convert to a DFA and minimize
- Mark accept states using subset construction

Operation:

Be greedy: consume as much input as possible
- That is, while there is more input and a transition is possible
If more than one accept state is possible, the user must specify which to return
If the result is not accepting, return to the last accepting state
- Put corresponding symbols back into the input
- If none is possible, return an error

Extensions

Extended regular expressions:

Eliminate at regular expression state e.g. $[p] \rightarrow p/ \epsilon$
Reducing $p^+$ to $pp^*$ can cause a state explosion if the expression is nested
- Special automata can be devised in these cases
Eliminate at the automaton stage

Example: C style /**/ Comments

Delimited by multiple characters, so can’t use character classes
Use regular expression: $/* \overline{\Sigma^* */ \Sigma^*} */$
- Where $\Sigma$ is any character

Constructing the minimal DFA: $p\Sigma^* r \Sigma^* r$

Construct NFA for $\Sigma^* r \Sigma^*$
Convert to DFA
Complement the DFA
Plug result in between NFAs for $p$ and $r$
Convert to DFA
Minimize

This can be automated:

Lexical structure defined with a sequence of definitions using extended regular expressions
Scanner generator uses the description to produce a scanner
- PLY is an example of such generator

Syntax Analysis

Parser produces the syntax tree
Usually produces AST
Lookahead tokens allow it to more efficiently recognize the structure
May generate code on the fly, instead of constructing a full syntax tree

Backus-Naur form:

Each rule describes structure of program fragment
Rules can be recursive
Example:
- Expression := Expression Arithmetic Expression | (Expression) | number | identifier
- Comparison := Expression Relation Expression
- Relation := = | != | < | <= | >= | >
- Statement := Statement| Statement;Statement
Terminals are single tokens (e.g. Number, identifier)
- They are the leaves of the AST
Item on the LHS of a rule are non-terminal

Syntax Trees

Each leaf is labelled with a terminal
Each inner node corresponds to the application of a rule
- e.g. if, x, >, y would be the leaves of the tree
- x and y are expressions, and > is a relation
- Together, they form a comparison
- The if and the comparison together are part of the If rule

The BNF grammar can be ambiguous for a given token syntax. To fix this, the language may specify precedence rules, or the programmer may use parentheses.

In regular expressions, the choice on the left of the ‘or’ will have precedence, and this can be used for precedence. For example:

Expression := Term | Expression Additive Term
Additive := +|-
Term := Factor | Term Multiplication Factor
Multiplicative := *|-
Factor := (Expression) | number | identifier

Types:

Top down: LL, recursive descent
Bottom up: LR, LALR, SLR, shift-reduce
LR more expressive than LL: defers decisions until more information is available
Shift-reduce more expressive, but way harder to understand

BNF Extensions

[p] means p is optional: shorthand for $(\epsilon \mid p)$
$*$ is also allowed

Syntax Diagram

Terminals in circles
Non-terminals in rectangles
Connect using arrows, loops allowed
Start of diagram: line, with name of end result above it
End of diagram: arrow

Recursive-Descent Parsers

Mutually recursive functions: Every addition/subtraction operand is a term, which is either a number/identifier or multiplication/division operation. Thus, this hierarchy enforces the order of operations: for an addition to occur, all of the multiplication must have already been run;

# Expression  =  Term((+|-)Term)*
def expression():
  term() # Call function to parse term
  while lookahead() in [ADD, SUB]: # Check if next term is plus or minus
    consume(ADD, SUB) # Consume that next token, which is in expected tokens:
    add or subtract
    term()

# Term  =  Factor((*|/)Term)*
def term():
  factor()
  while lookahead() in [MUL, DIV]:
    consume(MUL, DIV)
  factor()

# Factor  =  (Expression) | number | identifier
def factor():
  if lookahead()  = = LPAR:
    consume(LPAR)
    expression()
    consume(RPAR)
  elif lookahead() in [NUM, ID]:
    consume(NUM, ID)
  else:
    raise Exception

Extended NBF to Recursive-Descent Parser

Two types of recursion:

Recursion over the structure of the regular expression
Recursion over structure of the grammar

Operation:

Every non-terminal n becomes a function
Body of function from RHS of the BNF rule: parse(RHS)
Parse function:
- If there are conditions, use lookahead and if statements
- For star, use while loops
- For multiple tokens, just run parse on them sequentially
- If the passed token is a terminal, consume the token

Even with single token lookahead, there can be ambiguities. The language may need to be changed to resolve this.

Abstract Syntax Tree

Classes for each type token (e.g. Expression, Number, Identifier). If number or identifier, needs to store the value. If it is an expression, it needs to store the left- and right-hand operands and the operator.

Details such as parentheses are omitted: the structure of the tree contains this information.

Previously, there needed to be multiple types of arithmetic nodes e.g. (+, *), but with this, only one type of node is required.

Basically the same as parser, except it returns a object.

Semantic Analysis

Type checking
Type inference
Declaration: declared multiple times in different scopes?
Definite assignment? Assigned before use?

Type systems

Type: a set of possible values
Many functions cannot apply to all arguments e.g. Addition on lists won’t work
Processors just work on bits. Garbage in, garbage out
Type checking
- Dynamic: check at run-time
- Static: check during complication
  - Also used for compiler optimizations

Attribute Grammars

An extension of CFGs:

Define functions to calculate attributes

\text{typeof}(op, t_1, t_2 ) = \begin{cases} \text{int} &{op } \in (+, -, *) \text{ and } t_1 \text{ and } t_2 \text{ integers} \\ \text{float} &\text{otherwise} \end{cases}

Define attributes for specific terminals/non-terminals: store the name of the attribute, domain for the attribute and a list of terminals/non-terminals it applies to
Define rules and attribute values e.g.

\begin{aligned} E = T &E.\text{type} = T.\text{type} \\ E = EAT \quad &E_0.\text{type} = \text{typeof}(A.op, E_1.\text{type}, T.\text{type}) \\ A = + \quad &a.\text{op} = + \\ A = - &a.\text{op} = - \end{aligned}

Here, $E$ is an addition/subtraction operation, $T$ is a number/identifier/multiplication/division operation, and $A$ is the plus or minus symbol.

Instead of defining it using the star symbol, it has been defined recursively using two rules.

$E$ has the type attribute, so the type must be set, either inline or with a function defined previously

Synthesized attributes:

The information is propagated bottom-up
Type of number constant determined by scanner
Type of identifier determined by declaration or assignments

Type checking:

Type of identifier and expression must match, or be able to be automatically converted

Inherited attributes:

Value of non-terminal on RHS of rule depends on other values
Information is propagated top-down

Machine-Independent Optimization

Takes place on a syntax tree or other intermediate form. Whatever optimization is done must not change the meaning of the program.

Constant Propagation
- When a variable is used, if that variable is assigned a constant value, the variable can be substituted with the constant.
Constant Folding
- Constant expressions can be evaluated
  - The compiler must implement the semantics of the program
  - e.g. Python compiler compiling Java: must ensure integer addition uses the correct integer size
Dead code elimination
- Code that does not affect the result can be eliminated
Moving calculations out of loops
Reorder calculations
Unroll loops
Removing tail recursion
Eliminating common sub-expressions

Optimizations often lead to further optimization:

Constant folding enables constant propagation
Constant propagation enables further constant folding
Constant folding enables dead-code elimination

Constant Folding: evaluating arithmetic expressions that only use constants. Use similar structure to attribute grammars, with a helper function that evaluates arithmetic expressions if the operands are known, and rules which

Code Generation

Traverses the AST, emitting code as soon as a sufficient portion is processed. It outputs a data structure representing code for a virtual machine, after which it can be further analysed and optimized.

Expression: program fragments that yield a value
Statements: fragments that modify the values of variables
Declarations: information for type checking etc.

JVM

Stack:

Calculations performed on stack
Operations pop operands from the stack, then push the result onto the stack
- sipush stands for short integer push
- iadd, isub, imul, idiv
e_1 + e_2 pushes the code for e_1 and e_2 onto the stack, and then runs iadd
- The invariant is that the code for e_1will end up adding one element to the stack
- In the intermediate steps, it may add any number of elements onto the stack, but they will all be popped away

Variables:

Identifiers refer to variables
Value of variable stored in memory: this is known as their R-value
To access and modify the variable, use the location/address of the variable: the L value
The L-value is fixed at compile time
During compile time, keep a map between the variable name and L-value
Frames
- Contains arguments, local data, calculation stack
- Local/statically allocated data offset relative to frame
- Global/dynamically allocated data in absolute memory
- Assume a single frame for this example
iload n pushes value stored in location n on to the stack
istore n pops value from stack, and stores it in location n

Control flow

Labels mark positions in the code
- l1
Unconditional jumps tells the machine to go to where the label is defined
- goto l1
Conditional jumps do so if a given condition holds
- Removes top two elements from stack
- e.g. if_icmpeq l1: f integer compare equal
- if_icmpeq := =, if_icmpne := !=
- if_icmpge := <=, if_icmpgt := <
- if_icmple := >=, if_icmplt := >

Machine-Dependent Optimization

Peephole optimization: looks at only a small number of generated instructions, replacing them with a shorter/faster sequence of instructions.

Example: it may read a window of 3 instructions, working its way down one instruction at a time.