All Files in ‘COSC261 (2020-S1)’ Merged

1. DFAs and NFAs

Introduction

Definitions

Concatenation

Strings

xyxy is y concatenated to xx, where xx and yy are either strings or characters.

Concatenation of strings is associative: brackets are not needed.

Languages

AB={xyxA and yB} AB = \{ xy \mid x\in A \text{ and } y\in B \}

i.e. A.map(a => B.map(b => a+b))

Language concatenation is associative.

{ϵ}\{ \epsilon \} is the identity language: {ϵ}A=A{ϵ}=A\{ \epsilon \}A = A\{ \epsilon \} = A.

Powers: Strings/Characters

ana^n is the string/character concatenated nn times

Base case: a0=ϵan+1=anaa^0 = \epsilon \therefore a^{n+1} = a^n a.

Powers: Languages

A0={ϵ}A1=AA=nNAn=An=A0A1A2A3A+=A1A2A3 \begin{aligned} A^0 &= \{ \epsilon \} \\ A^1 &= A \\ A^* &= \bigcup_{n\in N}{A^n} = A^n = A^0 \cup A^1 \cup A^2 \cup A^3 \cup \dots \\ A^+ &= A^1 \cup A^2 \cup A^3 \cup \dots \end{aligned}

Thus, AA^* will always contain the empty string, while A+A^+ will only contain it if it is a member of AA.

Properties
AA=A=AA==A \begin{aligned} A^* A^* &= A^{**} = A^* \\ \emptyset A &= \emptyset = A\emptyset \end{aligned}

Deterministic Finite Automata

M=(Q,Σ,δ,q0,F) M = (Q, \Sigma, \delta, q_0, F)

Where:

MM reading the input string wZw\in Z^* symbol by symbol. For each symbol, the transition function is run using with the current symbol and state. The input string is accepted if MM ends in an end state.

Extended Transition Function

δ^\hat{\delta} extends δ\delta to take in strings (and not just symbols) as the transition by processing each symbol in the string and passing the remaining substring recursively.

δ^:Q×ΣQδ^(q,ϵ)=qδ^(q,ax)=δ^(δ(q,a),x) where aΣ,xΣ and the input string is ax \begin{aligned} \hat{\delta}: Q\times \Sigma^* \rightarrow Q \\ \hat{\delta}(q, \epsilon) = q \\ \hat{\delta}(q, ax) = \hat{\delta}(\delta (q,a), x) \text{ where } a \in \Sigma, x \in \Sigma^* \text{ and the input string is } ax \end{aligned}

All finite languages are regular (e.g. OR every single string) but not all infinite languages are regular.

Regular languages are closed over:

Proving Regular Languages Are Closed under the Complement Operator

Let AΣA\subseteq \Sigma^* be regular. Show Aˉ\bar{A} is regular.

By their definitions, A=L(M)A = L(M) for some DFA M=(Q,Σ,δ,q0,F)M = (Q, \Sigma, \delta , q_0, F).

Swap accepting and non-accepting states:

Let M=(Q,Σ,δ,q0,QF)M' = (Q, \Sigma, \delta , q_0, Q - F). Now, show that L(M)=AˉL(M') = \bar{A}.

To do this, we first show that MM' is regular. This means that L(M)={wZδ^(q0,w)F}L(M) = \{ w \in Z^* \mid \hat{\delta}(q_0, w) \in F \}.

The first requirement, wZw\in Z^* is obviously fulfilled. The second bit will be:

δ^(q0,w)QF \hat{\delta}(q_0,w) \in Q-F

QFQ-F is the complement of FF, so: δ^(q0,w)F\hat{\delta}(q_0, w) \notin F

The definition of the extended transition function is δ^(q0,w)F\hat{\delta}(q_0, w)\in F, which is the complement of above. Thus, this means it satisfies the opposite of L(M)L(M):

wL(M)wL(M)wA \begin{aligned} w &\notin L(M) \\ w &\in \overline{L(M)} \\ w &\in A \end{aligned}

Intersection of Two Automata

Let A,BΣA, B \subseteq \Sigma^* be regular. Show that ABA \cap B will be regular.

A=L(M1)A = L(M_1) for DFA M1=(Q1,Σ,δ1,q1,F1)M_1 = (Q_1, \Sigma, \delta _1, q_1, F_1).

B=L(M1)B = L(M_1) for DFA M2=(Q2,Σ,δ2,q2,F2)M_2 = (Q_2, \Sigma, \delta _2,q_2,F_2 )

Only the alphabet is the same between the automata

Idea: keep track of the state M1M_1 is in and the state M2M_2 is in at the same time. Define the state as a pair of two states.

Let M=(Q1×Q2,Σ,δ,(q1,q2),F1×F2) \text{Let } M = (Q_1 \times Q_2, \Sigma, \delta, (q_1,q_2), F_1 \times F_2)

Now, we need to define the transition function where aΣa \in \Sigma is the transition symbol, wΣw \in \Sigma^* is the transition string, p1Q1p_1 \in Q_1 and p2Q2p_2 \in Q_2:

δ((p1,p2),a)=(δ1(p1,a),δ2(p2,a))δ^((p1,p2),w)=(δ1^(p1,w),δ2^(p2,w)) \begin{aligned} \delta((p_1,p_2), a) &= (\delta _1(p_1, a), \delta _2(p_2, a)) \\ \hat{\delta}((p_1, p_2), w) &= (\widehat{\delta_1}(p_1, w), \widehat{\delta_2}(p_2, w)) \end{aligned}

Need to show L(M)=L(M1)L(M2)L(M) = L(M_1) \cap L(M_2). For any string wΣw \in \Sigma^* where wL(M)w \in L(M):

δ^((q1,q2),w)F1×F2(δ1^(q1,w),δ2^(q2,w))F1×F2 \begin{aligned} \hat{\delta}((q_1, q_2), w) &\in F_1 \times F_2 \\ (\widehat{\delta_1}(q_1, w),\widehat{\delta_2}(q_2, w)) &\in F_1 \times F_2 \end{aligned}
δ1^(q1,w)F1 and δ2^(q2,w)F2wL(M1) and wL(M2)wL(M1)L(M2) as required \begin{aligned} \widehat{\delta_1}(q_1, w) \in F_1 \text{ and } \widehat{\delta_2}(q_2, w) \in F_2 \\ w\in L(M_1) \text{ and } w\in L(M_2) \\ w\in L(M_1) \cap L(M_2) \text{ as required} \end{aligned}

Union of Two Automata

Let A,BΣA, B \subseteq \Sigma^* be regular. Show that ABA \cup B will be regular.

ABAˉBˉA \cup B \leftrightarrow \overline{\bar{A} \cap \bar{B}} by De Morgan’s law. There is closure under the complement and intersection, so there must be closure under the union operation.

Non-Deterministic Finite Automata

NFA: a state can have zero or multiple transitions using a single symbol.

M=(Q,Σ,δ,q0,F) M = (Q, \Sigma, \delta , q_0, F)

δ:Q×ΣP(Q)\delta: Q\times \Sigma \rightarrow P(Q), where P(Q)P(Q) is the power set. That is, the transition function will return a set of states. If multiple states are returned, MM can move to any of those states. If the empty set is returned, MM will get stuck and that route should be ignored.

If any route leads to an accept state, the string should be accepted.

Extended Transition Relation

δ^:Q×ΣP(Q) \hat{\delta}: Q\times \Sigma^* \rightarrow P(Q)
δ^(q,ϵ)={q}δ^(q,ax)=pδ(q,a)δ^(p,x),aΣ,xΣ \begin{aligned} \hat{\delta}(q, \epsilon) &= \{ q \} \\ \hat{\delta}(q, ax) &= \bigcup_{p\in \delta(q, a)}{\hat{\delta}(p, x)}, a\in \Sigma, x\in \Sigma^* \end{aligned}

That is, for every state it can be in, calculate the possible transition(s) for the next symbol, put them all in a set, and repeat until the input string is empty.

ww is accepted by MM if and only if δ^(q0,w)F\hat{\delta}(q_0, w) \in F \neq \emptyset.

The language accepted by MM is the set of all strings where the above is true:

L(M)={wΣδ^(q0,w)F} L(M) = \{ w\in \Sigma^* \mid \hat{\delta}(q_0, w)\in F \neq \emptyset \}

NFDA to DFA Conversion Through Subset Construction

States become a set containing all states it could be in given the transitions. If it can be in the states {qx,qy,qz}\{ q_x, q_y, q_z \} then denote the state as qxyzq_{xyz}. Accept all states where the set contains any element from the accept states set. Add a new reject state with a self-transition under all symbols to ensure that every state has transitions for every symbol.

Proving Every NFA is Accepted by a DFA

M=(Q,Σ,δ,q0,F) M = (Q, \Sigma, \delta, q_0, F)

Idea: keep track of all states the NFA MM can be in while reading the input.

Subset automaton (DFA) M=(P(Q),Σ,δ,{q0},F)M' = (P(Q), \Sigma, \delta', \{ q_0 \}, F') where F={SQSF}F' = \{ S\subseteq Q \mid S \cap F \neq \emptyset \} (any state where one or more elements of the state contains an element from the accept states).

δ:P(Q)×ΣP(Q) \delta': P(Q) \times \Sigma \rightarrow P(Q)
δ(S,a)=qSδ(q,a)δ^(S,w)=qSδ^(q,w) \begin{aligned} \delta'(S, a) &= \bigcup_{q \in S}{\delta (q, a)} \\ \hat{\delta}'(S, w) &= \bigcup _{q \in S}{\hat{\delta}(q,w)} \end{aligned}

Now show that L(M)=L(M)L(M') = L(M). For any wΣw \in \Sigma^*, if wL(M)w \in L(M'):

The definition of L(M)L(M'): δ^({q0},w)F\hat{\delta'}(\{ q_0 \}, w) \in F'.

δ^({q0},w)F(q{q0}δ^(q,w))Fδ^(q0,w)F \begin{aligned} \hat{\delta'}(\{ q_0 \}, w) \cap F &\neq \emptyset \\ \left(\bigcup_{q\{ q_0 \}}{\hat{\delta}(q, w)}\right) \cap F &\neq \emptyset \\ \hat{\delta}(q_0, w) \cap F &\neq \emptyset \end{aligned}

Hence, wL(M)w \in L(M).

NDFA with ϵ\epsilon-Transitions

Example use case: union of two regular languages: one new start state, with two ϵ\epsilon transitions to each regular language.

ϵΣδ:Q×(Σ{ϵ})P(Q) \epsilon \notin \Sigma \\ \delta: Q \times (\Sigma \cup \{ \epsilon \}) \rightarrow P(Q)

ϵ\epsilon-closure of qq: E(q)={pQp is reachable from q with a sequence of ϵ transitions}E(q) = \{ p \in Q \mid p \text{ is reachable from } q \text{ with a sequence of } \epsilon \text{ transitions} \}.

The sequence can be of length zero, or be arbitrarily long. Note that the E(q)E(q) will always contain qq.

To convert an NDFA with ϵ\epsilon-transitions to a DFA, run the same process as for NFAs:

Extended Transition Relation

δ^(q,ϵ)=E(q)δ^(q,ax)=pE(q)pδ(q,a))δ^(p,x),aΣ,xΣ \begin{aligned} \hat{\delta}(q, \epsilon) &= E(q) \\ \hat{\delta}(q, ax) & = \bigcup_{p\in E(q)}{\bigcup_{p\in \delta(q, a))}{\hat{\delta}(p,x)}}, a\in \Sigma, x\in \Sigma^* \end{aligned}

2. Regular Expressions

Patterns

Atomic

Compound

Language Definition

L(r)L(r) is the language generated by the regular expression rr.

Proving Every Language Generated by a Regular Expression is an NFA

Where aΣa \in \Sigma:

∅:
---> ◯      ⭘


a:                     ε:
        a                      ε
---> ◯ ---> ⭘          ---> ◯ ---> ⭘


p|q:
                ε             ε
               ---> ◯  p   ◯ ---
        ε    /                   \
---> ◯ ---> ◯                     ⭘
             \  ε             ε  /
               ---> ◯  q   ◯ ---


pq:
        ε             ε              ε
---> ◯ ---> ◯  p  ◯ -----> ◯  q   ◯ ---> ⭘ 


p*:
        ε             ε
---> ◯ ---> ◯  p   ◯ ---> ⭘ 
     |      ^      |      ^
     |      --------      |
     ----------------------

Additionally:

Finite Automata to RegExp

Every language accepted by an NFA is generated by a RegExp. We can successively remove states and replace them with longer and longer RegExps as transitions. To remove qjq_j, for all qiq_i and qjq_j where iji \neq j and kjk \neq j:

regexp-state-removal

To remove q_j, for every qi -> qj -> qk, where qj not qi or qj:

      rij           rjk
qi ---------> qj ---------> qk
|            ^  |           ^
|            |__|           |
|            rjj            |
|___________________________|
             rik

    rik | (rij (rjj)* rjk)
qi ----------------------> qk


For two-state loops:

      rij           rjk
qi ---------> qj ---------> qk
 ^    rji     |
 |------------|

Remove q_j: i -> j -> k, and i -> j -> i

        r_ij r_jk
     qi ---------> qk
    ^  |
    |__|
   rij rji


      (rij rji)* r_ij r_jk
qi -----------------------> qk

Minimization of DFA

Equivalence of states pp and qq:

δ^(p,w)Fδ^(q,w)F for each wΣ \hat{\delta}(p, w) \in F \Leftrightarrow \hat{\delta}(q, w) \in F \text{ for each } w \in \Sigma^*

i.e. states are equivalent if, for all strings, if you start at pp or qq and both reach an accept state (or if both are rejected).

If states are not equivalent, they are distinguishable:

Algorithm to find distinguishable states

pqp \sim q if two states are equivalent. The $ \sim $ relation is:

Thus it is an equivalence relation   A×A\sim \; \subseteq A \times A:

Minimization Algorithm

Notes:

Decision Problems

Languages AA ,BB, string xx:

Membership

If the string xx is accepted by the language.

Run MM with input xx, check if it’s accepting. Linear time in length xx and size of MM.

Emptiness

If no string is accepted by the language.

Reachability to an accepting state. Use BFS or DFS.

Running time is linear in the size of the input MM: O(Q+QΣ)O(|Q|+|Q\|\Sigma|).

Finiteness

If the number of strings accepted by the language is finite.

Running time is linear in the size of MM.

Universality

If the language accepts all possible strings.

True if and only if L(M)=\overline{L(M)} = \emptyset. Construct the complement DFA by inverting end states; check emptiness.

Running time is linear.

Intersection Emptiness

True if L(M)L(M)=L(M) \cap L(M') = \emptyset.

Reduce the problem to the emptiness problem by constructing the product automaton.

Inclusion

If L(M)L(M)L(M) \subseteq L(M' ). L(M)L(M)=L(M) \cap \overline{L(M')} = \emptyset if this is the case.

Hence, take the complement of the MM' and use the intersection emptiness property. Running time is quadratic.

Equivalence

Use inclusion property going both ways.

L(M)=L(M)L(M) = L(M') if and only if L(M)L(M)L(M) \subseteq L(M') and L(M)L(M)L(M' ) \subseteq L(M').

Running time is quadratic.

Non-Regular Languages

Not all languages are regular. Counterexample: A={anbnnN}A = \{ a^n b^n \mid n \in N \}.

Assume AA is a DFA with kk states:

Pumping Lemma

If AA is regular, there is a number nn such that every wAw \in A, wn|w| \ge n can be expressed as w=xyzw = xyz where:

Three parts: getting to cycle, looping through the cycle ii times (or zero times), getting to end state.

To show A={anbnnN}A = \{ a^n b^n \mid n \in N \} is not regular, first assume it is regular:

Set w=anbnw = a^n b^n. It is clearly longer than nn, so w=xyzw = xyz where yϵy \neq \epsilon and xyn|xy| \le n.

Therefore y=ajy = a^j, where 1jn1 \le j \le n. Let i=2i = 2:

Hence, xyizAxy^i z \in A by the pumping lemma, but xy2z=xyyz=anybn=an+jbnAxy^2 z = xyyz = a^n yb^n = a^{n+j} b^n \notin A.

Modelling Independent Processes

Shuffling:

Shuffle of the same language: NFA with Σ2|\Sigma|^2 states, with transitions going horizontally or vertically, but not diagonally: one symbol can only move it in one of the DFAs. Both are the same, so there is symmetry; the upper or lower triangle can be cut off. The accepts states are the intersection.

Product Automaton

Product of two automaton accepts the intersection of the two languages - strings that satisfy conditions for both languages. The languages must have the same alphabet.

Create by drawing a matrix of states, each axis representing states from one automaton (useful to have different naming schemes for each automaton such as 1, 2, 3 etc. for one and a, b, c etc. for the other).

To be in a state, it is in both states simultaneously. To determine transitions, determine the state transitions individually, and find the ‘coordinates’ for the product by concatenating the two. e.g. if you are in state a3, to determine the transition for symbol 0, let transition(dfa1, a, 0) = a and transition(dfa2, 3, 0) = 4, then draw a transition to the resultant state a4. Accept states are states with are accepting in both automata.

3. Context-Free Grammars

Context-Free Grammars

G=(N,Σ,P,S) G = (N, \Sigma, P, S)

Where:

Productions

A production, (A,ω)P(A, \omega) \in P is written as AωA \rightarrow \omega.

Multiple productions written as Aω1ωnA \rightarrow \omega_1 | \dots | \omega_n.

The RHS can be empty: in this case, the RHS is ϵ\epsilon.

The LHS is always one non-terminal but the RHS can be any length and be made up of terminals and non-terminals.

In E+FG1E+NE+F \Rightarrow_G^1 E + N, GG denotes the grammar being used and the 11 denotes that the LHS can be transformed to the RHS in one step.

Derivability

On a relation (NΣ){(N \cup \Sigma)}^*:

Context-Free Languages

Building Blocks

Leftmost and Rightmost

Leftmost: if there are multiple non-terminals, always replace the left-most one first. The choice of production will be determined by the sentence you are trying to get.

If there is more than one leftmost derivation, the sentence is ambiguous. The CFG that generates it is also ambiguous.

The CFL is inherently ambiguous if every CFG generating it is ambiguous e.g. {aibjcki=j or j=k}\{ a^i b^j c^k \mid i = j \text{ or } j = k \}.

Regular Grammar

G=(N,Σ,P,S) G = (N, \Sigma, P, S)

For each AωPA \rightarrow \omega \in P:

Every regular language is generated by a regular grammar:

Every regular language is accepted by a DFA, NFA (with or without ϵ\epsilon transitions), regular expression, regular grammar.

Regular languages are a strict subset of CFLs.

Left/Right Regular Grammars

If ωP\omega \in P is the set of productions, then:

$ \omega \in \Sigma N\cup N\Sigma \cup { \epsilon }$ and ωNNΣ\omega \in NN \cup \Sigma do NOT yield regular languages.

Chomsky Normal Form

CFG G=(N,Σ,P,S)G = (N, \Sigma, P, S) if every production’s RHS is made up of:

Every CFG can be transformed into Chomsky normal form (unless ϵL(G)\epsilon \in L(G)) by eliminating:

  1. ϵ\epsilon-productions of the form AϵA \rightarrow \epsilon
  1. Unit-productions of the form ABA \rightarrow B where bwb \rightarrow w exists
  1. Non-generating non-terminals
  1. Non-reachable non-terminals
  1. Productions with RHS length 2\ge 2 that contain terminals
  1. Productions with RHS length 3\ge 3 (should only be left with productions with only non-terminals of this length)

If ϵ\epsilon is needed:

Cocke-Younger-Kasami Algorithm

Given a string ωΣ\omega \in \Sigma ^*, CFA AA, check if ωA\omega \in A

Chomsky normal form means that each non-terminal produces at least one terminal. Therefore, there is an upper bound on the length of derivations that need to be checked (exponential).

Use dynamic programming to improve speeds.

If the original problem is SwS \Rightarrow ^* w, solve NwijN \Rightarrow ^* w_{ij} for all non-terminals and for all substrings of ww, where ii and jj are the lengths of the strings e.g. abcde13abcde_{13} returns bcbc

Hence, the algorithm is O(n3)O(n^3).

If SN0,len(w)S \in N_{0, \text{len}(w)}, the string ww is in the language of the grammar.

Pushdown Automata

Uses a stack. Transitions can depend on symbols at the top of the stack, and modify symbols at the top of the stack. The stack and transition alphabet can be different.

M=(Q,Σ,Γ,δ,q0,F) M = (Q, \Sigma, \Gamma, \delta, q_0, F)

Where:

For a transition to take place, there must be a normal transition, but the character on the top of the stack must be the same one as specified in the transition (unless it is ϵ\epsilon, in which case nothing needs to be read).

Upon reading, the element is removed from the stack, and an element is added to the stack. The syntax is character, stack symbol to read/pop stack symbol to push e.g. a,ϵ/1a, \epsilon /1. Strings can be pushed and popped as well (when pushing, the first character of the string is pushed last).

For a string to be read, it must be on an accept state AND the stack must be empty.

Configuration

(q,x,α)Q×Σ×Γ (q, x, \alpha) \in Q \times \Sigma ^* \times \Gamma^*

Where:

The configuration is a snapshot of the current state of MM. The next state is:

(p,ax,αΓ)(q,x,BΓ) if (q,B)δ(p,a,α) (p, ax, \alpha \Gamma) \rightarrow (q, x, \Beta \Gamma) \text{ if } (q, \Beta) \in \delta (p, a, \alpha)

\rightarrow^* means the configuration is reachable in zero or more transitions.

The language accepted by MM:

L(M)={xΣ(q0,x,ϵ)(q,ϵ,ϵ),qF} L(M) = \{ x \in \Sigma^* \mid (q_0, x, \epsilon) \rightarrow^* (q, \epsilon, \epsilon), q \in F \}

Converting CFGs to PDAs

The resultant PDA will have two states: a start state q0q_0 and accept state q1q_1.

A transition from q0q_0 to q1q_1, ϵ,ϵ/S\epsilon , \epsilon / S, pushes the start state onto the stack.

q1q_1 will have transitions to itself for each production: for each production and RHS, an epsilon transition reading nothing and pushing the RHS of the production onto the stack should be made: ϵ,LHSproduction,RHSproduction\epsilon, \text{LHS}_\text{production}, \text{RHS}_\text{production}.

For each terminal α\alpha, there should also be a read transition taking the terminal as input and popping that terminal from the stack: α,α/ϵ\alpha,\alpha/ \epsilon.

4. Compilers

Compilers vs Interpreters

Given a high-level program PP, input II and output OO, an interpreter acts like the function converting the program and code into the output: interpreter(P,I)O\text{interpreter}(P, I) \rightarrow O. On the other hand, the compiler, given the program, outputs low-level machine code MM which can then be fed to a processor: processor(compiler(P),I)O\text{processor}(\text{compiler}(P), I) \rightarrow O.

Structure of a Compiler

Lexical Analysis

Reads source code character-by-character, and outputs sequence of tokens to feed to parser.

In reality, the scanner is usually called by the parser, producing a token at a time.

Examples:

The scanner needs to distinguish between reserved words and symbols, and identifiers and constants.

Each type of token can be described using regular expressions. Reserved words/symbols may be identified by string literals, while others may be defined via a sequence of definitions.

The sequence of definitions must not be cyclic. Use NFA to do so. Example:

Scanners and Automata Example: Integers

Operation:

Extensions

Extended regular expressions:

Example: C style /**/ Comments

Constructing the minimal DFA: pΣrΣrp\Sigma^* r \Sigma^* r

This can be automated:

Syntax Analysis

Backus-Naur form:

Syntax Trees

The BNF grammar can be ambiguous for a given token syntax. To fix this, the language may specify precedence rules, or the programmer may use parentheses.

In regular expressions, the choice on the left of the ‘or’ will have precedence, and this can be used for precedence. For example:

Types:

BNF Extensions

Syntax Diagram

Recursive-Descent Parsers

Mutually recursive functions: Every addition/subtraction operand is a term, which is either a number/identifier or multiplication/division operation. Thus, this hierarchy enforces the order of operations: for an addition to occur, all of the multiplication must have already been run;

# Expression  =  Term((+|-)Term)*
def expression():
  term() # Call function to parse term
  while lookahead() in [ADD, SUB]: # Check if next term is plus or minus
    consume(ADD, SUB) # Consume that next token, which is in expected tokens:
    add or subtract
    term()

# Term  =  Factor((*|/)Term)*
def term():
  factor()
  while lookahead() in [MUL, DIV]:
    consume(MUL, DIV)
  factor()

# Factor  =  (Expression) | number | identifier
def factor():
  if lookahead()  = = LPAR:
    consume(LPAR)
    expression()
    consume(RPAR)
  elif lookahead() in [NUM, ID]:
    consume(NUM, ID)
  else:
    raise Exception

Extended NBF to Recursive-Descent Parser

Two types of recursion:

Operation:

Even with single token lookahead, there can be ambiguities. The language may need to be changed to resolve this.

Abstract Syntax Tree

Classes for each type token (e.g. Expression, Number, Identifier). If number or identifier, needs to store the value. If it is an expression, it needs to store the left- and right-hand operands and the operator.

Details such as parentheses are omitted: the structure of the tree contains this information.

Previously, there needed to be multiple types of arithmetic nodes e.g. (+, *), but with this, only one type of node is required.

Basically the same as parser, except it returns a object.

Semantic Analysis

Type systems

Attribute Grammars

An extension of CFGs:

  1. Define functions to calculate attributes
typeof(op,t1,t2)={intop(+,,) and t1 and t2 integersfloatotherwise \text{typeof}(op, t_1, t_2 ) = \begin{cases} \text{int} &{op } \in (+, -, *) \text{ and } t_1 \text{ and } t_2 \text{ integers} \\ \text{float} &\text{otherwise} \end{cases}
  1. Define attributes for specific terminals/non-terminals: store the name of the attribute, domain for the attribute and a list of terminals/non-terminals it applies to
  2. Define rules and attribute values e.g.
E=TE.type=T.typeE=EATE0.type=typeof(A.op,E1.type,T.type)A=+a.op=+A=a.op= \begin{aligned} E = T &E.\text{type} = T.\text{type} \\ E = EAT \quad &E_0.\text{type} = \text{typeof}(A.op, E_1.\text{type}, T.\text{type}) \\ A = + \quad &a.\text{op} = + \\ A = - &a.\text{op} = - \end{aligned}

Here, EE is an addition/subtraction operation, TT is a number/identifier/multiplication/division operation, and AA is the plus or minus symbol.

Instead of defining it using the star symbol, it has been defined recursively using two rules.

EE has the type attribute, so the type must be set, either inline or with a function defined previously

Synthesized attributes:

Type checking:

Inherited attributes:

Machine-Independent Optimization

Takes place on a syntax tree or other intermediate form. Whatever optimization is done must not change the meaning of the program.

Optimizations often lead to further optimization:

Constant Folding: evaluating arithmetic expressions that only use constants. Use similar structure to attribute grammars, with a helper function that evaluates arithmetic expressions if the operands are known, and rules which

Code Generation

Traverses the AST, emitting code as soon as a sufficient portion is processed. It outputs a data structure representing code for a virtual machine, after which it can be further analysed and optimized.

JVM

Stack:

Variables:

Control flow

Machine-Dependent Optimization

Peephole optimization: looks at only a small number of generated instructions, replacing them with a shorter/faster sequence of instructions.

Example: it may read a window of 3 instructions, working its way down one instruction at a time.

5. Computability and Complexity

Decision Problems

LL is a language over the alphabet Σ\Sigma, and the string wΣw \in \Sigma^*. ωL\omega \in L is a decision problem.

LL is decidable, reversible, computable, iff there is an algorithm that outputs if wLw \in L or not.

The algorithm must terminate for all inputs. If there is no such algorithm, L is undecidable.

Optimization problems can be encoded into decision problems: for example, to find the length of shortest path, simply check if there a path with length less than kk, incrementing kk until you get a yes.

There are uncountably many languages over the alphabet: P(Σ)=2N=R|\mathbb{P}(\Sigma^*)| = 2^{|\mathbb{N}|} =|\mathbb{R}|. However, there are only a countable number of algorithms: Σ=N|\Sigma^{'*}| = |\mathbb{N}|. Hence for most languages, there is no decision algorithm.

LL is semi-decidable, recursively enumerable, or partially computable, iff the algorithm outputs if wLw \in L, and doesn’t output anything if not - i.e. It does not terminate if wLw \notin L. Hence, you do not know if it is decidable or not unless it outputs the result.

Every decidable language is also semi-decidable: if no is outputted, just pass it through to an endless loop.

Some undecidable languages are semi-decidable.

If LL and L\overline{L} are both semi-decidable, then LL is decidable: run both algorithms simultaneously for each string; one of them is guaranteed to terminate.

Algorithms correspond to a function ff: ΣΣ\Sigma^* \rightarrow \Sigma^*; map an input to an output, both of which are encoding using the alphabet Σ\Sigma.

NB: Functions on numbers can be encoded using binary (otherwise, the alphabet would be infinite).

Semi-decidable:

Turing Machines

M=(Q,Σ,Γ,δ,q0,_,qa,qr) M = (Q, \Sigma, \Gamma, \delta, q_0,\__, q_a, q_r)

Where:

The TM operates as follows:

Notes:

Configuration

Storing the current state of the Turing machine: (x,q,y)Γ×Q×Γ(x, q, y) \in \Gamma^* \times Q \times \Gamma^* where:

Since only one cell on the tape can be modified at a time, a configuration relation can be used to describe transitions. e.g. where a,b,cΓa, b, c \in \Gamma and p,qQp, q \in Q:

(xa,p,by){(x,q,acy)δ(p,b)=(q,c,L)(xa,q,cy)δ(p,b)=(q,c,N) (xa, p, by) \rightarrow \begin{cases} (x, q, acy) &\delta (p, b) = (q, c, L) \\ (xa, q, cy) &\delta (p,b) = (q, c, N) \end{cases}

Notes:

Computation Models

There are many computation models e.g. TMs with multiple tapes, pushdown automata with two stacks etc. Many of these are equivalent: they accept the same languages and compute the same functions.

Church-Turing thesis: the intuitively computable functions (that is, that humans can compute - hard to formalize) are those partially computable by a Turing machine.

Chomsky Hierarchy

G=(N,Σ,P,S) G = (N, \Sigma, P, S)

Where:

Different grammars are obtained by restricting the form of each production xyx \rightarrow y:

Notes:

Grammar Language Automata
Type-0 General/Recursively Enumerable Turing Machine (or NTM)
Type-1 Context-sensitive Linearly space-bounded non-deterministic automata
Type-2 Context-free Push-down automata (not deterministic PDA)
Type-3 Regular Deterministic-Finite Automata (or NFA)

Undecidability

Programs can be the input of other programs e.g. compiler takes a program as input, Python interpreter.

Special Halting Problem

The following, called the special halting problem, is undecidable:

SHP={MM halts on input M} SHP = \{ \langle M\rangle \mid M \text{ halts on input } \langle M \rangle \}

M\langle M\rangle is the representation of the machine in tape.

Thus, the SHP is the set of all programs which, when given itself as input, halts.

Proof
def M_SHP(program_str):
  return True if exec(program_str)(program_str) halts else False

def M_dash(program_str):
  if M_SHP(program_str):
    while True:
      continue
  return False

M_dash(as_string(M_dash))
# If false is returned, then it does not halt so it cannot have returned false
# If it does not terminate, then it must halt, but it cannot have halted
# Hence, M_SHP cannot exist

Halting Problem

{M#wM halts on input w}\{ \langle M \rangle \#w \mid M \text{ halts on input } w \}, where #\# is a symbol that does not appear in M\langle M\rangle, allowing it to act as a separator.

Assume there is some total TM which determines if some program given some arbitrary string terminates. If this is possible, solving the special halting problem is trivial - just pass the specific input of M\langle M \rangle. Since its not possible, neither is the halting problem

Some other undecidable languages:

Complexity Classes

Running times of programs:

t(M,x)t(M, x) returns the number of steps the (total?) TM M takes for a given input xx until it halts.

Function f:NNf:N \rightarrow N gives upper bound on how long a function can take for a given input size: T(f) is the set of languages, with the language generated by a total TM MM being in T(f)T(f) if t(M,x)<f(x)t(M, x) < f(|x|) for all values of xx.

The program can be solved efficiently if it can be solved in polynomial time. These languages are in the complexity class PP. Examples include:

Non-Deterministic Turing Machine (NTM)

Some problems known to be in NPNP (and may or may not be PP):

Problem Complexity

Compare problem complexities by reducing one to another:

NP-Complete Problems

If these two conditions are met, then every other NPNP problem is polynomial-time reducible to it. Thus, if one NPNP-complete problem is solved, then P=NPP = NP

Showing that a problem is NPNP-hard is difficult - there are a lot of NPNP problems.

The satisfiability problem (SAT) was the first problem shown to be NPNP complete:

Hence, there is a tree of reductions.

CLIQUE

Finding the largest subset of the graph that is complete - all pairs of nodes are connected.

SAT

Given logic formula:

=Variable  !Formula  Formula AND Formula  Formula OR Formula \begin{aligned} = &\text{Variable} \\ | \; &\text{!Formula} \\ | \; &\text{Formula AND Formula} \\ | \; &\text{Formula OR Formula} \end{aligned}

Output: can the variables be set such that the formula evaluates to true?

On a NTM, guess and verify:

3-SAT

Limit the structure of logic formulas using Conjunctive Normal Form (CNF):

CNF=Clause (AND Clause)Clause=Literal (OR Literal)Literal=Variable OR !Variable \begin{aligned} \text{CNF} &= \text{Clause } \text{(AND Clause)}^* \\ \text{Clause} &= \text{Literal } \text{(OR Literal)}^* \\ \text{Literal} &= \text{Variable OR !Variable} \end{aligned}

Every formula can be converted into CNF.

If the distributivity rule is used in highly nested formulas there is exponential blow-up as each use of it causes duplication: thus, it cannot be used.

S-SAT

Given formula in CNF with exactly 3 literals per clause: Clause = Literal OR Literal OR Literal\text{Clause = Literal OR Literal OR Literal}

Result is in 3-CNF, obtained in polynomial time, is satisfiable if and only if the original formula is satisfiable.