Compilers vs Interpreters
Given a high-level program
Structure of a Compiler
- Scanner/lexical analyser
- Convert source code into tokens
- Parser/syntactic analyser
- Parse into syntax tree
- Semantic analyser: annotate syntax tree
- Type checking, some optimizations
- Machine-independent optimizer
- Converts to intermediate form
- Code generator
- Converts to target code
- Machine-dependent optimizer
- Generates optimized target code
Lexical Analysis
Reads source code character-by-character, and outputs sequence of tokens to feed to parser.
In reality, the scanner is usually called by the parser, producing a token at a time.
Examples:
- Reserved words
if,then,else,while
- Symbols
+,-,<,(,),+=
- Identifiers: variable names etc.
- Number constants
- String constants
The scanner needs to distinguish between reserved words and symbols, and identifiers and constants.
Each type of token can be described using regular expressions. Reserved words/symbols may be identified by string literals, while others may be defined via a sequence of definitions.
The sequence of definitions must not be cyclic. Use NFA to do so. Example:
digit := 0-9, where0-9is a character classlower := a-zupper := A-Zletter := lower|upperidentifier := (letter|_)(letter|digit|_)*, where_is
Scanners and Automata Example: Integers
- Construct an NFA for each type of integer: end state should identify the type of integer
dec_integer := 0+ | non_zero_digit digit*oct_integer := 0(o|O)oct_digit+
- Combine NFAs using new initial state and
transitions to the start of each existing NFA - Convert to a DFA and minimize
- Mark accept states using subset construction
Operation:
- Be greedy: consume as much input as possible
- That is, while there is more input and a transition is possible
- If more than one accept state is possible, the user must specify which to return
- If the result is not accepting, return to the last accepting state
- Put corresponding symbols back into the input
- If none is possible, return an error
Extensions
Extended regular expressions:
- Eliminate at regular expression state e.g.
- Reducing
to can cause a state explosion if the expression is nested - Special automata can be devised in these cases
- Eliminate at the automaton stage
Example: C style /**/ Comments
- Delimited by multiple characters, so can’t use character classes
- Use regular expression:
- Where
is any character
- Where
Constructing the minimal DFA:
- Construct NFA for
- Convert to DFA
- Complement the DFA
- Plug result in between NFAs for
and - Convert to DFA
- Minimize
This can be automated:
- Lexical structure defined with a sequence of definitions using extended regular expressions
- Scanner generator uses the description to produce a scanner
- PLY is an example of such generator
Syntax Analysis
- Parser produces the syntax tree
- Usually produces AST
- Lookahead tokens allow it to more efficiently recognize the structure
- May generate code on the fly, instead of constructing a full syntax tree
Backus-Naur form:
- Each rule describes structure of program fragment
- Rules can be recursive
- Example:
Expression := Expression Arithmetic Expression | (Expression) | number | identifierComparison := Expression Relation ExpressionRelation := = | != | < | <= | >= | >Statement := Statement| Statement;Statement
- Terminals are single tokens (e.g.
Number,identifier)- They are the leaves of the AST
- Item on the LHS of a rule are non-terminal
Syntax Trees
- Each leaf is labelled with a terminal
- Each inner node corresponds to the application of a rule
- e.g.
if,x,>,ywould be the leaves of the tree xandyare expressions, and>is a relation- Together, they form a comparison
- The if and the comparison together are part of the
Ifrule
- e.g.
The BNF grammar can be ambiguous for a given token syntax. To fix this, the language may specify precedence rules, or the programmer may use parentheses.
In regular expressions, the choice on the left of the ‘or’ will have precedence, and this can be used for precedence. For example:
Expression := Term | Expression Additive TermAdditive := +|-Term := Factor | Term Multiplication FactorMultiplicative := *|-Factor := (Expression) | number | identifier
Types:
- Top down: LL, recursive descent
- Bottom up: LR, LALR, SLR, shift-reduce
- LR more expressive than LL: defers decisions until more information is available
- Shift-reduce more expressive, but way harder to understand
BNF Extensions
[p]meanspis optional: shorthand for-
is also allowed
Syntax Diagram
- Terminals in circles
- Non-terminals in rectangles
- Connect using arrows, loops allowed
- Start of diagram: line, with name of end result above it
- End of diagram: arrow
Recursive-Descent Parsers
Mutually recursive functions: Every addition/subtraction operand is a term, which is either a number/identifier or multiplication/division operation. Thus, this hierarchy enforces the order of operations: for an addition to occur, all of the multiplication must have already been run;
# Expression = Term((+|-)Term)*
def expression():
term() # Call function to parse term
while lookahead() in [ADD, SUB]: # Check if next term is plus or minus
consume(ADD, SUB) # Consume that next token, which is in expected tokens:
add or subtract
term()
# Term = Factor((*|/)Term)*
def term():
factor()
while lookahead() in [MUL, DIV]:
consume(MUL, DIV)
factor()
# Factor = (Expression) | number | identifier
def factor():
if lookahead() = = LPAR:
consume(LPAR)
expression()
consume(RPAR)
elif lookahead() in [NUM, ID]:
consume(NUM, ID)
else:
raise Exception
Extended NBF to Recursive-Descent Parser
Two types of recursion:
- Recursion over the structure of the regular expression
- Recursion over structure of the grammar
Operation:
- Every non-terminal
nbecomes a function - Body of function from RHS of the BNF rule:
parse(RHS) - Parse function:
- If there are conditions, use lookahead and if statements
- For star, use while loops
- For multiple tokens, just run parse on them sequentially
- If the passed token is a terminal, consume the token
Even with single token lookahead, there can be ambiguities. The language may need to be changed to resolve this.
Abstract Syntax Tree
Classes for each type token (e.g. Expression, Number, Identifier). If number or identifier, needs to store the value. If it is an expression, it needs to store the left- and right-hand operands and the operator.
Details such as parentheses are omitted: the structure of the tree contains this information.
Previously, there needed to be multiple types of arithmetic nodes e.g. (+, *), but with this, only one type of node is required.
Basically the same as parser, except it returns a object.
Semantic Analysis
- Type checking
- Type inference
- Declaration: declared multiple times in different scopes?
- Definite assignment? Assigned before use?
Type systems
- Type: a set of possible values
- Many functions cannot apply to all arguments e.g. Addition on lists won’t work
- Processors just work on bits. Garbage in, garbage out
- Type checking
- Dynamic: check at run-time
- Static: check during complication
- Also used for compiler optimizations
Attribute Grammars
An extension of CFGs:
- Define functions to calculate attributes
- Define attributes for specific terminals/non-terminals: store the name of the attribute, domain for the attribute and a list of terminals/non-terminals it applies to
- Define rules and attribute values e.g.
Here,
Instead of defining it using the star symbol, it has been defined recursively using two rules.
Synthesized attributes:
- The information is propagated bottom-up
- Type of number constant determined by scanner
- Type of identifier determined by declaration or assignments
Type checking:
- Type of identifier and expression must match, or be able to be automatically converted
Inherited attributes:
- Value of non-terminal on RHS of rule depends on other values
- Information is propagated top-down
Machine-Independent Optimization
Takes place on a syntax tree or other intermediate form. Whatever optimization is done must not change the meaning of the program.
- Constant Propagation
- When a variable is used, if that variable is assigned a constant value, the variable can be substituted with the constant.
- Constant Folding
- Constant expressions can be evaluated
- The compiler must implement the semantics of the program
- e.g. Python compiler compiling Java: must ensure integer addition uses the correct integer size
- Constant expressions can be evaluated
- Dead code elimination
- Code that does not affect the result can be eliminated
- Moving calculations out of loops
- Reorder calculations
- Unroll loops
- Removing tail recursion
- Eliminating common sub-expressions
Optimizations often lead to further optimization:
- Constant folding enables constant propagation
- Constant propagation enables further constant folding
- Constant folding enables dead-code elimination
Constant Folding: evaluating arithmetic expressions that only use constants. Use similar structure to attribute grammars, with a helper function that evaluates arithmetic expressions if the operands are known, and rules which
Code Generation
Traverses the AST, emitting code as soon as a sufficient portion is processed. It outputs a data structure representing code for a virtual machine, after which it can be further analysed and optimized.
- Expression: program fragments that yield a value
- Statements: fragments that modify the values of variables
- Declarations: information for type checking etc.
JVM
Stack:
- Calculations performed on stack
- Operations pop operands from the stack, then push the result onto the stack
sipushstands for short integer pushiadd,isub,imul,idiv
e_1 + e_2pushes the code fore_1ande_2onto the stack, and then runsiadd- The invariant is that the code for
e_1will end up adding one element to the stack - In the intermediate steps, it may add any number of elements onto the stack, but they will all be popped away
- The invariant is that the code for
Variables:
- Identifiers refer to variables
- Value of variable stored in memory: this is known as their R-value
- To access and modify the variable, use the location/address of the variable: the L value
- The L-value is fixed at compile time
- During compile time, keep a map between the variable name and L-value
- Frames
- Contains arguments, local data, calculation stack
- Local/statically allocated data offset relative to frame
- Global/dynamically allocated data in absolute memory
- Assume a single frame for this example
iload npushes value stored in locationnon to the stackistore npops value from stack, and stores it in locationn
Control flow
- Labels mark positions in the code
l1
- Unconditional jumps tells the machine to go to where the label is defined
goto l1
- Conditional jumps do so if a given condition holds
- Removes top two elements from stack
- e.g.
if_icmpeq l1: f integer compare equal if_icmpeq := =,if_icmpne := !=if_icmpge := <=,if_icmpgt := <if_icmple := >=,if_icmplt := >
Machine-Dependent Optimization
Peephole optimization: looks at only a small number of generated instructions, replacing them with a shorter/faster sequence of instructions.
Example: it may read a window of 3 instructions, working its way down one instruction at a time.