4. Compilers

Compilers vs Interpreters

Given a high-level program PP, input II and output OO, an interpreter acts like the function converting the program and code into the output: interpreter(P,I)O\text{interpreter}(P, I) \rightarrow O. On the other hand, the compiler, given the program, outputs low-level machine code MM which can then be fed to a processor: processor(compiler(P),I)O\text{processor}(\text{compiler}(P), I) \rightarrow O.

Structure of a Compiler

Lexical Analysis

Reads source code character-by-character, and outputs sequence of tokens to feed to parser.

In reality, the scanner is usually called by the parser, producing a token at a time.

Examples:

The scanner needs to distinguish between reserved words and symbols, and identifiers and constants.

Each type of token can be described using regular expressions. Reserved words/symbols may be identified by string literals, while others may be defined via a sequence of definitions.

The sequence of definitions must not be cyclic. Use NFA to do so. Example:

Scanners and Automata Example: Integers

Operation:

Extensions

Extended regular expressions:

Example: C style /**/ Comments

Constructing the minimal DFA: pΣrΣrp\Sigma^* r \Sigma^* r

This can be automated:

Syntax Analysis

Backus-Naur form:

Syntax Trees

The BNF grammar can be ambiguous for a given token syntax. To fix this, the language may specify precedence rules, or the programmer may use parentheses.

In regular expressions, the choice on the left of the ‘or’ will have precedence, and this can be used for precedence. For example:

Types:

BNF Extensions

Syntax Diagram

Recursive-Descent Parsers

Mutually recursive functions: Every addition/subtraction operand is a term, which is either a number/identifier or multiplication/division operation. Thus, this hierarchy enforces the order of operations: for an addition to occur, all of the multiplication must have already been run;

# Expression  =  Term((+|-)Term)*
def expression():
  term() # Call function to parse term
  while lookahead() in [ADD, SUB]: # Check if next term is plus or minus
    consume(ADD, SUB) # Consume that next token, which is in expected tokens:
    add or subtract
    term()

# Term  =  Factor((*|/)Term)*
def term():
  factor()
  while lookahead() in [MUL, DIV]:
    consume(MUL, DIV)
  factor()

# Factor  =  (Expression) | number | identifier
def factor():
  if lookahead()  = = LPAR:
    consume(LPAR)
    expression()
    consume(RPAR)
  elif lookahead() in [NUM, ID]:
    consume(NUM, ID)
  else:
    raise Exception

Extended NBF to Recursive-Descent Parser

Two types of recursion:

Operation:

Even with single token lookahead, there can be ambiguities. The language may need to be changed to resolve this.

Abstract Syntax Tree

Classes for each type token (e.g. Expression, Number, Identifier). If number or identifier, needs to store the value. If it is an expression, it needs to store the left- and right-hand operands and the operator.

Details such as parentheses are omitted: the structure of the tree contains this information.

Previously, there needed to be multiple types of arithmetic nodes e.g. (+, *), but with this, only one type of node is required.

Basically the same as parser, except it returns a object.

Semantic Analysis

Type systems

Attribute Grammars

An extension of CFGs:

  1. Define functions to calculate attributes
typeof(op,t1,t2)={intop(+,,) and t1 and t2 integersfloatotherwise \text{typeof}(op, t_1, t_2 ) = \begin{cases} \text{int} &{op } \in (+, -, *) \text{ and } t_1 \text{ and } t_2 \text{ integers} \\ \text{float} &\text{otherwise} \end{cases}
  1. Define attributes for specific terminals/non-terminals: store the name of the attribute, domain for the attribute and a list of terminals/non-terminals it applies to
  2. Define rules and attribute values e.g.
E=TE.type=T.typeE=EATE0.type=typeof(A.op,E1.type,T.type)A=+a.op=+A=a.op= \begin{aligned} E = T &E.\text{type} = T.\text{type} \\ E = EAT \quad &E_0.\text{type} = \text{typeof}(A.op, E_1.\text{type}, T.\text{type}) \\ A = + \quad &a.\text{op} = + \\ A = - &a.\text{op} = - \end{aligned}

Here, EE is an addition/subtraction operation, TT is a number/identifier/multiplication/division operation, and AA is the plus or minus symbol.

Instead of defining it using the star symbol, it has been defined recursively using two rules.

EE has the type attribute, so the type must be set, either inline or with a function defined previously

Synthesized attributes:

Type checking:

Inherited attributes:

Machine-Independent Optimization

Takes place on a syntax tree or other intermediate form. Whatever optimization is done must not change the meaning of the program.

Optimizations often lead to further optimization:

Constant Folding: evaluating arithmetic expressions that only use constants. Use similar structure to attribute grammars, with a helper function that evaluates arithmetic expressions if the operands are known, and rules which

Code Generation

Traverses the AST, emitting code as soon as a sufficient portion is processed. It outputs a data structure representing code for a virtual machine, after which it can be further analysed and optimized.

JVM

Stack:

Variables:

Control flow

Machine-Dependent Optimization

Peephole optimization: looks at only a small number of generated instructions, replacing them with a shorter/faster sequence of instructions.

Example: it may read a window of 3 instructions, working its way down one instruction at a time.