$Q$ is a non-finite, empty set (vertices) of states
$q_0$ is the start state (arrow from nowhere)
$F \subseteq Q$ is the set of end states (double circle)
$\Sigma$ is the input alphabet - a non-finite, empty set. Each symbol represents a transition. Every transition can occur at any state, even if it just loops back
$\delta$ is the transition function: $Q \times \Sigma \rightarrow Q$ - a function taking the state and transition as inputs, and outputting an end state

$M$ reading the input string $w\in Z^*$ symbol by symbol. For each symbol, the transition function is run using with the current symbol and state. The input string is accepted if $M$ ends in an end state.

Extended Transition Function

$\hat{\delta}$ extends $\delta$ to take in strings (and not just symbols) as the transition by processing each symbol in the string and passing the remaining substring recursively.

\begin{aligned} \hat{\delta}: Q\times \Sigma^* \rightarrow Q \\ \hat{\delta}(q, \epsilon) = q \\ \hat{\delta}(q, ax) = \hat{\delta}(\delta (q,a), x) \text{ where } a \in \Sigma, x \in \Sigma^* \text{ and the input string is } ax \end{aligned}

The string $w$ is accepted by $M$ if $\hat{\delta}(q_0, w) \in F$
The language accepted by $M$ , $L(M) = \{ w\in Z^* \mid \hat{\delta}(q_0,w)\in F \}$
The language $A \in \Sigma^*$ is regular if $A = L(M)$ for some DFA $M$

All finite languages are regular (e.g. OR every single string) but not all infinite languages are regular.

Regular languages are closed over:

Complement (inverse)
Intersection
Union
Concatenation
Star

Proving Regular Languages Are Closed under the Complement Operator

Let $A\subseteq \Sigma^*$ be regular. Show $\bar{A}$ is regular.

By their definitions, $A = L(M)$ for some DFA $M = (Q, \Sigma, \delta , q_0, F)$ .

Swap accepting and non-accepting states:

Let $M' = (Q, \Sigma, \delta , q_0, Q - F)$ . Now, show that $L(M') = \bar{A}$ .

To do this, we first show that $M'$ is regular. This means that $L(M) = \{ w \in Z^* \mid \hat{\delta}(q_0, w) \in F \}$ .

The first requirement, $w\in Z^*$ is obviously fulfilled. The second bit will be:

\hat{\delta}(q_0,w) \in Q-F

$Q-F$ is the complement of $F$ , so: $\hat{\delta}(q_0, w) \notin F$

The definition of the extended transition function is $\hat{\delta}(q_0, w)\in F$ , which is the complement of above. Thus, this means it satisfies the opposite of $L(M)$ :

\begin{aligned} w &\notin L(M) \\ w &\in \overline{L(M)} \\ w &\in A \end{aligned}

Intersection of Two Automata

Let $A, B \subseteq \Sigma^*$ be regular. Show that $A \cap B$ will be regular.

$A = L(M_1)$ for DFA $M_1 = (Q_1, \Sigma, \delta _1, q_1, F_1)$ .

$B = L(M_1)$ for DFA $M_2 = (Q_2, \Sigma, \delta _2,q_2,F_2 )$

Only the alphabet is the same between the automata

Idea: keep track of the state $M_1$ is in and the state $M_2$ is in at the same time. Define the state as a pair of two states.

\text{Let } M = (Q_1 \times Q_2, \Sigma, \delta, (q_1,q_2), F_1 \times F_2)

Now, we need to define the transition function where $a \in \Sigma$ is the transition symbol, $w \in \Sigma^*$ is the transition string, $p_1 \in Q_1$ and $p_2 \in Q_2$ :

\begin{aligned} \delta((p_1,p_2), a) &= (\delta _1(p_1, a), \delta _2(p_2, a)) \\ \hat{\delta}((p_1, p_2), w) &= (\widehat{\delta_1}(p_1, w), \widehat{\delta_2}(p_2, w)) \end{aligned}

Need to show $L(M) = L(M_1) \cap L(M_2)$ . For any string $w \in \Sigma^*$ where $w \in L(M)$ :

\begin{aligned} \hat{\delta}((q_1, q_2), w) &\in F_1 \times F_2 \\ (\widehat{\delta_1}(q_1, w),\widehat{\delta_2}(q_2, w)) &\in F_1 \times F_2 \end{aligned}

\begin{aligned} \widehat{\delta_1}(q_1, w) \in F_1 \text{ and } \widehat{\delta_2}(q_2, w) \in F_2 \\ w\in L(M_1) \text{ and } w\in L(M_2) \\ w\in L(M_1) \cap L(M_2) \text{ as required} \end{aligned}

Union of Two Automata

Let $A, B \subseteq \Sigma^*$ be regular. Show that $A \cup B$ will be regular.

$A \cup B \leftrightarrow \overline{\bar{A} \cap \bar{B}}$ by De Morgan’s law. There is closure under the complement and intersection, so there must be closure under the union operation.

Non-Deterministic Finite Automata

NFA: a state can have zero or multiple transitions using a single symbol.

M = (Q, \Sigma, \delta , q_0, F)

$\delta: Q\times \Sigma \rightarrow P(Q)$ , where $P(Q)$ is the power set. That is, the transition function will return a set of states. If multiple states are returned, $M$ can move to any of those states. If the empty set is returned, $M$ will get stuck and that route should be ignored.

If any route leads to an accept state, the string should be accepted.

Extended Transition Relation

\hat{\delta}: Q\times \Sigma^* \rightarrow P(Q)

\begin{aligned} \hat{\delta}(q, \epsilon) &= \{ q \} \\ \hat{\delta}(q, ax) &= \bigcup_{p\in \delta(q, a)}{\hat{\delta}(p, x)}, a\in \Sigma, x\in \Sigma^* \end{aligned}

That is, for every state it can be in, calculate the possible transition(s) for the next symbol, put them all in a set, and repeat until the input string is empty.

$w$ is accepted by $M$ if and only if $\hat{\delta}(q_0, w) \in F \neq \emptyset$ .

The language accepted by $M$ is the set of all strings where the above is true:

L(M) = \{ w\in \Sigma^* \mid \hat{\delta}(q_0, w)\in F \neq \emptyset \}

NFDA to DFA Conversion Through Subset Construction

States become a set containing all states it could be in given the transitions. If it can be in the states $\{ q_x, q_y, q_z \}$ then denote the state as $q_{xyz}$ . Accept all states where the set contains any element from the accept states set. Add a new reject state with a self-transition under all symbols to ensure that every state has transitions for every symbol.

Proving Every NFA is Accepted by a DFA

M = (Q, \Sigma, \delta, q_0, F)

Idea: keep track of all states the NFA $M$ can be in while reading the input.

Subset automaton (DFA) $M' = (P(Q), \Sigma, \delta', \{ q_0 \}, F')$ where $F' = \{ S\subseteq Q \mid S \cap F \neq \emptyset \}$ (any state where one or more elements of the state contains an element from the accept states).

\delta': P(Q) \times \Sigma \rightarrow P(Q)

\begin{aligned} \delta'(S, a) &= \bigcup_{q \in S}{\delta (q, a)} \\ \hat{\delta}'(S, w) &= \bigcup _{q \in S}{\hat{\delta}(q,w)} \end{aligned}

Now show that $L(M') = L(M)$ . For any $w \in \Sigma^*$ , if $w \in L(M')$ :

The definition of $L(M')$ : $\hat{\delta'}(\{ q_0 \}, w) \in F'$ .

\begin{aligned} \hat{\delta'}(\{ q_0 \}, w) \cap F &\neq \emptyset \\ \left(\bigcup_{q\{ q_0 \}}{\hat{\delta}(q, w)}\right) \cap F &\neq \emptyset \\ \hat{\delta}(q_0, w) \cap F &\neq \emptyset \end{aligned}

Hence, $w \in L(M)$ .

NDFA with $\epsilon$ -Transitions

Example use case: union of two regular languages: one new start state, with two $\epsilon$ transitions to each regular language.

\epsilon \notin \Sigma \\ \delta: Q \times (\Sigma \cup \{ \epsilon \}) \rightarrow P(Q)

$\epsilon$ -closure of $q$ : $E(q) = \{ p \in Q \mid p \text{ is reachable from } q \text{ with a sequence of } \epsilon \text{ transitions} \}$ .

The sequence can be of length zero, or be arbitrarily long. Note that the $E(q)$ will always contain $q$ .

To convert an NDFA with $\epsilon$ -transitions to a DFA, run the same process as for NFAs:

Replace each state by its $\epsilon$ closure
Add state transitions to any state that is accessible from any member of the closure
- $\sigma'(S, a) = \bigcup_{q \in S}{\bigcup_{p \in \sigma(q, a)}{E(p)}}$
Replace the start state with $E(q_0)$

Extended Transition Relation

\begin{aligned} \hat{\delta}(q, \epsilon) &= E(q) \\ \hat{\delta}(q, ax) & = \bigcup_{p\in E(q)}{\bigcup_{p\in \delta(q, a))}{\hat{\delta}(p,x)}}, a\in \Sigma, x\in \Sigma^* \end{aligned}

2. Regular Expressions

Patterns

Atomic

$\epsilon$ matched by the empty string
$\emptyset$ matched by nothing
$?$ matches any symbol in $\Sigma$

Compound

$p|q$ matched by string $w$ if $w$ matches $p$ or $q$
$p\&q$ : matches both
$pq$ : if string $w = xy$ , $x$ matches $p$ and $y$ matches $q$
$\bar{p}$ : doesn’t match $p$
$[p]$ : matches $p$ or if $w$ is an empty string
$p^*$ : zero or more characters
$p^+$ : one or more characters

Language Definition

$L(r)$ is the language generated by the regular expression $r$ .

$L(a) = \{ a \}$
$L(\epsilon \} = \{ \epsilon \}$
$L(\emptyset) = \emptyset$
$L(p|q) = L(p) \cup L(q)$
$L(pq) = L(p)L(q)$
$L(p^*) = L(p)^*$
- RHS: iterated language operation (concatenating language with itself infinitely many times)
$(p|q)|r = p|q|r$
$(pq)r = pqr$
Order of operations: $*$ , concatenation, union

Proving Every Language Generated by a Regular Expression is an NFA

Where $a \in \Sigma$ :

∅:
---> ◯      ⭘


a:                     ε:
        a                      ε
---> ◯ ---> ⭘          ---> ◯ ---> ⭘


p|q:
                ε             ε
               ---> ◯  p   ◯ ---
        ε    /                   \
---> ◯ ---> ◯                     ⭘
             \  ε             ε  /
               ---> ◯  q   ◯ ---


pq:
        ε             ε              ε
---> ◯ ---> ◯  p  ◯ -----> ◯  q   ◯ ---> ⭘ 


p*:
        ε             ε
---> ◯ ---> ◯  p   ◯ ---> ⭘ 
     |      ^      |      ^
     |      --------      |
     ----------------------

Additionally:

$p^+ = pp^*$
$[p] = p|\epsilon$
$? = a_1 |a_2 |\dots|a_{n-1} | a_n$

Finite Automata to RegExp

Labelled with RegExps as well as symbols from the alphabet (or $\epsilon$ )
One accept state
At most one transition between states
No transitions into the start state or out of the end state

Every language accepted by an NFA is generated by a RegExp. We can successively remove states and replace them with longer and longer RegExps as transitions. To remove $q_j$ , for all $q_i$ and $q_j$ where $i \neq j$ and $k \neq j$ :

regexp-state-removal

To remove q_j, for every qi -> qj -> qk, where qj not qi or qj:

      rij           rjk
qi ---------> qj ---------> qk
|            ^  |           ^
|            |__|           |
|            rjj            |
|___________________________|
             rik

    rik | (rij (rjj)* rjk)
qi ----------------------> qk


For two-state loops:

      rij           rjk
qi ---------> qj ---------> qk
 ^    rji     |
 |------------|

Remove q_j: i -> j -> k, and i -> j -> i

        r_ij r_jk
     qi ---------> qk
    ^  |
    |__|
   rij rji


      (rij rji)* r_ij r_jk
qi -----------------------> qk

Minimization of DFA

Eliminate states that can’t be reached from the start state
Find equivalent states
Collapse equivalent states

Equivalence of states $p$ and $q$ :

\hat{\delta}(p, w) \in F \Leftrightarrow \hat{\delta}(q, w) \in F \text{ for each } w \in \Sigma^*

i.e. states are equivalent if, for all strings, if you start at $p$ or $q$ and both reach an accept state (or if both are rejected).

If states are not equivalent, they are distinguishable:

If one state is accepting and another is not accepting, $\epsilon$ can be passed and they are clearly distinguishable
If two states $r$ and $s$ are distinguishable by the string $x$ , if $\delta (p, a) = r$ and $\delta (q, a) = s$ , then this means $p$ and $q$ produce different outputs for the string $w = ax$ so they are also distinguishable

Algorithm to find distinguishable states

Create a table (staircase) of pairs where one is accepting and one is not (i.e. distinguishable)
For remaining pairs, see what happens under all transitions. If the pair of states you end up in are known to be distinguishable for any transition, then the pair you started with are also distinguishable. If not, no new information is obtained
Repeat until no new information is obtained

$p \sim q$ if two states are equivalent. The $ \sim $ relation is:

Reflexive: $p \sim p$
Symmetric: $p \sim q$ implies $q \sim p$
Transitive: $p \sim q$ and $q \sim r$ implies $p \sim r$

Thus it is an equivalence relation $\sim \; \subseteq A \times A$ :

$[a] = \{ b \in A:a \sim b \}$ is the equivalence class of $a$
$a$ is an representative of $[a]$
- e.g. if $a,b \in [a]$ , $[a] = [b]$
$A/\!\!\sim \; = \{ [a] \mid a \in A \}$ is the quotient of $A$ by $\sim$
- i.e. the set of all equivalence classes

Minimization Algorithm

Let DFA $M = (Q, \Sigma, \delta, q_0, F)$
The quotient automation $M' = (Q/\!\!\sim, \Sigma, \delta ', [q_0 ],F' )$
A state of $M'$ is an equivalence class of $\sim$
$F' = \{ [q] \mid q \in F \}$
$\delta':Q/\!\!\sim \times \Sigma \rightarrow Q/\!\!\sim$
- $\delta' ([q], a) = [\delta(q,a)]$
- The transition function is well-defined: it doesn’t matter which representative of $q$ is chosen

Notes:

Two minimal DFAs for the same language are identical, except for the names of the states
Minimization does not work for NFAs

Decision Problems

Languages $A$ , $B$ , string $x$ :

Membership: $x \in A$ ?
Emptiness: $A = \emptyset$ ?
Finiteness: $|A| \in N$ ?
Universality: $A = \Sigma^*$ ?
Intersection Emptiness: $A \cap B = \emptyset$ ?
Inclusion: $A \subseteq B$
Equivalence: $A = B$

Membership

If the string $x$ is accepted by the language.

Run $M$ with input $x$ , check if it’s accepting. Linear time in length $x$ and size of $M$ .

Emptiness

If no string is accepted by the language.

Reachability to an accepting state. Use BFS or DFS.

Running time is linear in the size of the input $M$ : $O(|Q|+|Q\|\Sigma|)$ .

Finiteness

If the number of strings accepted by the language is finite.

Check if there is a cycle on path from start state to an accept state
Remove all states from which accept state is not reachable
Check if the cycle is still reachable from start state
- Infinite if this is true

Running time is linear in the size of $M$ .

Universality

If the language accepts all possible strings.

True if and only if $\overline{L(M)} = \emptyset$ . Construct the complement DFA by inverting end states; check emptiness.

Running time is linear.

Intersection Emptiness

True if $L(M) \cap L(M') = \emptyset$ .

Reduce the problem to the emptiness problem by constructing the product automaton.

Constructing product DFA is proportional to $M$ times $M'$ : it is quadratic
Running time is linear in the size of the input: the product automaton, not $M$
So the running time is quadratic as well

Inclusion

If $L(M) \subseteq L(M' )$ . $L(M) \cap \overline{L(M')} = \emptyset$ if this is the case.

Hence, take the complement of the $M'$ and use the intersection emptiness property. Running time is quadratic.

Equivalence

Use inclusion property going both ways.

$L(M) = L(M')$ if and only if $L(M) \subseteq L(M')$ and $L(M' ) \subseteq L(M')$ .

Running time is quadratic.

Non-Regular Languages

Not all languages are regular. Counterexample: $A = \{ a^n b^n \mid n \in N \}$ .

In every state, there is exactly one transition for each symbol
There are finitely many states
A DFA can accept infinitely long strings
So there must be a loop
- Which means its behavior will repeat

Assume $A$ is a DFA with $k$ states:

Pass in a string longer than $k$ : it must go through $k+1$ states. Assume it is $a^k$
Therefore it must repeat through one state
Let $i$ be the number of $a$ ’s required to get to the repeated states the first time
- And $j$ the number of $a$ ’s to get to the repeated state the second time
- $\hat{\delta}(q_0, a^i) = \hat{\delta}(q_0, a^j)$
- $\therefore \hat{\delta}(q_0, a^i b^i) = \hat{\delta}(q_0, a^j b^i)$
- So $a^i b^i$ must be in the language, but $a^j b^i$ can’t be in the language
- So by contradiction, $A$ can’t be a DFA

Pumping Lemma

If $A$ is regular, there is a number $n$ such that every $w \in A$ , $|w| \ge n$ can be expressed as $w = xyz$ where:

$y \neq \epsilon$
$|xy| \le n$
$xy^i z \in A$ for every $i \in N$

Three parts: getting to cycle, looping through the cycle $i$ times (or zero times), getting to end state.

To show $A = \{ a^n b^n \mid n \in N \}$ is not regular, first assume it is regular:

Set $w = a^n b^n$ . It is clearly longer than $n$ , so $w = xyz$ where $y \neq \epsilon$ and $|xy| \le n$ .

Therefore $y = a^j$ , where $1 \le j \le n$ . Let $i = 2$ :

Hence, $xy^i z \in A$ by the pumping lemma, but $xy^2 z = xyyz = a^n yb^n = a^{n+j} b^n \notin A$ .

Modelling Independent Processes

Shuffling:

$x\|y$ : insert next character from $x$ or $y$ into output. Returns set of all possible shuffles

$\begin{aligned} \epsilon \Vert y &= \{ y \} \\ x \Vert \epsilon &= \{ x \} \\ ax \Vert by &= \{ a \}(x \Vert by) \cup \{ b \}(ax \Vert y) \\ \end{aligned}$
Shuffling languages: every possible shuffle of every combination of two strings (one from each language)

Shuffle of the same language: NFA with $|\Sigma|^2$ states, with transitions going horizontally or vertically, but not diagonally: one symbol can only move it in one of the DFAs. Both are the same, so there is symmetry; the upper or lower triangle can be cut off. The accepts states are the intersection.

Product Automaton

Product of two automaton accepts the intersection of the two languages - strings that satisfy conditions for both languages. The languages must have the same alphabet.

Create by drawing a matrix of states, each axis representing states from one automaton (useful to have different naming schemes for each automaton such as 1, 2, 3 etc. for one and a, b, c etc. for the other).

To be in a state, it is in both states simultaneously. To determine transitions, determine the state transitions individually, and find the ‘coordinates’ for the product by concatenating the two. e.g. if you are in state a3, to determine the transition for symbol 0, let transition(dfa1, a, 0) = a and transition(dfa2, 3, 0) = 4, then draw a transition to the resultant state a4. Accept states are states with are accepting in both automata.

3. Context-Free Grammars

Context-Free Grammars

G = (N, \Sigma, P, S)

Where:

$N$ is a finite set of non-terminals. They are always denoted with upper case letters
$\Sigma$ is a finite set of terminals that is disjoint from $N$
$P \subseteq N \times {(N \cup \Sigma)}^*$ is a finite set of productions
$S \in N$ is the start symbol

Productions

A production, $(A, \omega) \in P$ is written as $A \rightarrow \omega$ .

Multiple productions written as $A \rightarrow \omega_1 | \dots | \omega_n$ .

The RHS can be empty: in this case, the RHS is $\epsilon$ .

The LHS is always one non-terminal but the RHS can be any length and be made up of terminals and non-terminals.

In $E+F \Rightarrow_G^1 E + N$ , $G$ denotes the grammar being used and the $1$ denotes that the LHS can be transformed to the RHS in one step.

Derivability

On a relation ${(N \cup \Sigma)}^*$ :

$x_i \in {(N \cup \Sigma)}^*$ for each $i \in N$
$x_n$ is derivable from $x_0$ in $n$ steps if $x_i \Rightarrow_G^1 x_{i+1}$ for each $0 \le i<n$ : $x_0 \Rightarrow_G^n x_n$
$x \Rightarrow_G^0 y$ iff $x = y$
$y$ is derivable from $x$ if it is derivable in any number of steps: $x \Rightarrow_G^* y$ . This relation is the reflexive-transitive closure of the relation $\Rightarrow_G^1$

Context-Free Languages

Sentential form: any $x \in (N \cup \Sigma )^*$ derivable from start symbol $S:$ $S \Rightarrow_G^* x$
Sentence: sentential form containing only terminal symbols
$L(G) = \{ x \in \Sigma ^* \mid S \Rightarrow_G^* x \}$ is the *language generated by $G$ . i.e. all possible strings of terminals and all sentential forms
The language $A \subseteq \Sigma^*$ is context free if $A = L(G)$ for some CFG $G$

Building Blocks

$T \rightarrow aTb \mid ab$
- Gives $a^n b^n$ where $n \ge 1$
$C \rightarrow a \bar aA$
- Gives the RegExp equivalent to $a^+$
$S \rightarrow (S) \mid SS \mid \epsilon$
- Balanced parentheses

Leftmost and Rightmost

Leftmost: if there are multiple non-terminals, always replace the left-most one first. The choice of production will be determined by the sentence you are trying to get.

If there is more than one leftmost derivation, the sentence is ambiguous. The CFG that generates it is also ambiguous.

The CFL is inherently ambiguous if every CFG generating it is ambiguous e.g. $\{ a^i b^j c^k \mid i = j \text{ or } j = k \}$ .

Regular Grammar

G = (N, \Sigma, P, S)

For each $A \rightarrow \omega \in P$ :

$\omega = \epsilon$
$\omega \in \Sigma N$ That is, the RHS is $\epsilon$ or a terminal followed by a non-terminal

Every regular language is generated by a regular grammar:

For each state and transition, the production will be the concatenation of the transition symbol and state it goes to
Accept states will also have the epsilon

Every regular language is accepted by a DFA, NFA (with or without $\epsilon$ transitions), regular expression, regular grammar.

Regular languages are a strict subset of CFLs.

Left/Right Regular Grammars

If $\omega \in P$ is the set of productions, then:

Right regular: $\omega \in \Sigma N\cup \{ \epsilon \}$ e.g. $A \rightarrow bC$
Left regular: $\omega \in N\Sigma \cup \{ \epsilon \}$ e.g. $A \rightarrow Bc$ There are other forms such as $\omega \in \Sigma^* N \cup \Sigma ^*$ , $\omega \in N \Sigma \cup \Sigma \cup \{ \epsilon \}$ .

$ \omega \in \Sigma N\cup N\Sigma \cup { \epsilon }$ and $\omega \in NN \cup \Sigma$ do NOT yield regular languages.

Chomsky Normal Form

CFG $G = (N, \Sigma, P, S)$ if every production’s RHS is made up of:

two non-terminals or
one terminal

Every CFG can be transformed into Chomsky normal form (unless $\epsilon \in L(G)$ ) by eliminating:

$\epsilon$ -productions of the form $A \rightarrow \epsilon$

For productions of the form $B \rightarrow uAv$ , add the production $A \rightarrow uv$
- $u,v \in (\Sigma \cup N)^*$ : $u$ and $v$ can be empty strings or be multiple characters long
- Do this recursively until there are no more changes
- Then, delete all $\epsilon$ -productions

Unit-productions of the form $A \rightarrow B$ where $b \rightarrow w$ exists

$w \in (\Sigma \cup N)^*$
Replace with $A \rightarrow w$
Then remove all unit productions $A \rightarrow B$

Non-generating non-terminals

Any non-terminal that can never be transformed into a terminal
Create a set of generating non-terminals: start by including any non-terminals that have a production that generates only terminals
Then add non-terminals that have a production that generates only terminals or generating non-terminals
Remove all productions containing non-terminals that are not in the set of generating non-terminals

Non-reachable non-terminals

Reachable if $S \Rightarrow ^* uAv$ for $u, v \in (\Sigma \cup N)*$
- That is, the start symbol can be transformed into the non-terminal
Remove all productions containing non-reachable non-terminals

Productions with RHS length $\ge 2$ that contain terminals

Create a new production $B \rightarrow b$ for each terminal $b$ that meets the criteria
Then replace the productions to use $B$ instead of $b$

Productions with RHS length $\ge 3$ (should only be left with productions with only non-terminals of this length)

For $A \rightarrow B_1 B_2 \dots B_n$ , create a new production and non-terminal $C \rightarrow B_2\dots B_n$ , then let $A \rightarrow B_1 C$
- Repeat if $C$ is too long

If $\epsilon$ is needed:

Create a new non terminal $S'$ , production $S' \rightarrow \epsilon$
For each production $S \rightarrow \omega$ of the start symbol, add $S' \rightarrow \omega$
Make $S'$ the new start symbol

Cocke-Younger-Kasami Algorithm

Given a string $\omega \in \Sigma ^*$ , CFA $A$ , check if $\omega \in A$

Test membership of CFL
$A = L(G) = \{ x \in \Sigma^* \mid S \Rightarrow_G^* x \} \therefore w \in A \text{ iff } S \Rightarrow_G^* \omega$
But there are a lot of (infinitely many) strings

Chomsky normal form means that each non-terminal produces at least one terminal. Therefore, there is an upper bound on the length of derivations that need to be checked (exponential).

Use dynamic programming to improve speeds.

If the original problem is $S \Rightarrow ^* w$ , solve $N \Rightarrow ^* w_{ij}$ for all non-terminals and for all substrings of $w$ , where $i$ and $j$ are the lengths of the strings e.g. $abcde_{13}$ returns $bc$

Build a staircase table of character indexes (one-indexing)
Starting with the main diagonal (thus one-length substrings), for each cell, build a set, $N_{i, i+1}$ , of non-terminals from which you can derive that character
With the next diagonals:
- Split up the string into two substrings, $w_(i, i+n)$ and $w_(i+n, j)$
- For all non-terminals $B \in N_(i, i+n)$ and $C \in N_(i+n, j)$ , if there is a production such that $A \rightarrow BC$ , add $A$ to $N_{i,j}$

Hence, the algorithm is $O(n^3)$ .

If $S \in N_{0, \text{len}(w)}$ , the string $w$ is in the language of the grammar.

Pushdown Automata

Uses a stack. Transitions can depend on symbols at the top of the stack, and modify symbols at the top of the stack. The stack and transition alphabet can be different.

M = (Q, \Sigma, \Gamma, \delta, q_0, F)

Where:

$\Gamma$ is the stack alphabet, a finite set
$\epsilon \neq \Sigma \cup \Gamma$
$\delta$ is a transition function of the form $Q \times (\Sigma \cup \{ \epsilon \}) \times \Gamma^* \rightarrow P(Q \times \Gamma)$ i.e. based on NFA with $\epsilon$ transitions, but with a stack

For a transition to take place, there must be a normal transition, but the character on the top of the stack must be the same one as specified in the transition (unless it is $\epsilon$ , in which case nothing needs to be read).

Upon reading, the element is removed from the stack, and an element is added to the stack. The syntax is character, stack symbol to read/pop stack symbol to push e.g. $a, \epsilon /1$ . Strings can be pushed and popped as well (when pushing, the first character of the string is pushed last).

For a string to be read, it must be on an accept state AND the stack must be empty.

Configuration

(q, x, \alpha) \in Q \times \Sigma ^* \times \Gamma^*

Where:

$q$ is the current state
$x$ is the remaining input
$\alpha$ is contents of the stack

The configuration is a snapshot of the current state of $M$ . The next state is:

(p, ax, \alpha \Gamma) \rightarrow (q, x, \Beta \Gamma) \text{ if } (q, \Beta) \in \delta (p, a, \alpha)

$\rightarrow^*$ means the configuration is reachable in zero or more transitions.

The language accepted by $M$ :

L(M) = \{ x \in \Sigma^* \mid (q_0, x, \epsilon) \rightarrow^* (q, \epsilon, \epsilon), q \in F \}

Converting CFGs to PDAs

The resultant PDA will have two states: a start state $q_0$ and accept state $q_1$ .

A transition from $q_0$ to $q_1$ , $\epsilon , \epsilon / S$ , pushes the start state onto the stack.

$q_1$ will have transitions to itself for each production: for each production and RHS, an epsilon transition reading nothing and pushing the RHS of the production onto the stack should be made: $\epsilon, \text{LHS}_\text{production}, \text{RHS}_\text{production}$ .

For each terminal $\alpha$ , there should also be a read transition taking the terminal as input and popping that terminal from the stack: $\alpha,\alpha/ \epsilon$ .

4. Compilers

Compilers vs Interpreters

Given a high-level program $P$ , input $I$ and output $O$ , an interpreter acts like the function converting the program and code into the output: $\text{interpreter}(P, I) \rightarrow O$ . On the other hand, the compiler, given the program, outputs low-level machine code $M$ which can then be fed to a processor: $\text{processor}(\text{compiler}(P), I) \rightarrow O$ .

Structure of a Compiler

Scanner/lexical analyser
- Convert source code into tokens
Parser/syntactic analyser
- Parse into syntax tree
Semantic analyser: annotate syntax tree
- Type checking, some optimizations
Machine-independent optimizer
- Converts to intermediate form
Code generator
- Converts to target code
Machine-dependent optimizer
- Generates optimized target code

Lexical Analysis

Reads source code character-by-character, and outputs sequence of tokens to feed to parser.

In reality, the scanner is usually called by the parser, producing a token at a time.

Examples:

Reserved words
- if, then, else, while
Symbols
- +, -, <, (, ), +=
Identifiers: variable names etc.
Number constants
String constants

The scanner needs to distinguish between reserved words and symbols, and identifiers and constants.

Each type of token can be described using regular expressions. Reserved words/symbols may be identified by string literals, while others may be defined via a sequence of definitions.

The sequence of definitions must not be cyclic. Use NFA to do so. Example:

digit := 0-9, where 0-9 is a character class
lower := a-z
upper := A-Z
letter := lower|upper
identifier := (letter|_)(letter|digit|_)*, where _ is $\epsilon$

Scanners and Automata Example: Integers

Construct an NFA for each type of integer: end state should identify the type of integer
- dec_integer := 0+ | non_zero_digit digit*
- oct_integer := 0(o|O)oct_digit+
Combine NFAs using new initial state and $\epsilon$ transitions to the start of each existing NFA
Convert to a DFA and minimize
- Mark accept states using subset construction

Operation:

Be greedy: consume as much input as possible
- That is, while there is more input and a transition is possible
If more than one accept state is possible, the user must specify which to return
If the result is not accepting, return to the last accepting state
- Put corresponding symbols back into the input
- If none is possible, return an error

Extensions

Extended regular expressions:

Eliminate at regular expression state e.g. $[p] \rightarrow p/ \epsilon$
Reducing $p^+$ to $pp^*$ can cause a state explosion if the expression is nested
- Special automata can be devised in these cases
Eliminate at the automaton stage

Example: C style /**/ Comments

Delimited by multiple characters, so can’t use character classes
Use regular expression: $/* \overline{\Sigma^* */ \Sigma^*} */$
- Where $\Sigma$ is any character

Constructing the minimal DFA: $p\Sigma^* r \Sigma^* r$

Construct NFA for $\Sigma^* r \Sigma^*$
Convert to DFA
Complement the DFA
Plug result in between NFAs for $p$ and $r$
Convert to DFA
Minimize

This can be automated:

Lexical structure defined with a sequence of definitions using extended regular expressions
Scanner generator uses the description to produce a scanner
- PLY is an example of such generator

Syntax Analysis

Parser produces the syntax tree
Usually produces AST
Lookahead tokens allow it to more efficiently recognize the structure
May generate code on the fly, instead of constructing a full syntax tree

Backus-Naur form:

Each rule describes structure of program fragment
Rules can be recursive
Example:
- Expression := Expression Arithmetic Expression | (Expression) | number | identifier
- Comparison := Expression Relation Expression
- Relation := = | != | < | <= | >= | >
- Statement := Statement| Statement;Statement
Terminals are single tokens (e.g. Number, identifier)
- They are the leaves of the AST
Item on the LHS of a rule are non-terminal

Syntax Trees

Each leaf is labelled with a terminal
Each inner node corresponds to the application of a rule
- e.g. if, x, >, y would be the leaves of the tree
- x and y are expressions, and > is a relation
- Together, they form a comparison
- The if and the comparison together are part of the If rule

The BNF grammar can be ambiguous for a given token syntax. To fix this, the language may specify precedence rules, or the programmer may use parentheses.

In regular expressions, the choice on the left of the ‘or’ will have precedence, and this can be used for precedence. For example:

Expression := Term | Expression Additive Term
Additive := +|-
Term := Factor | Term Multiplication Factor
Multiplicative := *|-
Factor := (Expression) | number | identifier

Types:

Top down: LL, recursive descent
Bottom up: LR, LALR, SLR, shift-reduce
LR more expressive than LL: defers decisions until more information is available
Shift-reduce more expressive, but way harder to understand

BNF Extensions

[p] means p is optional: shorthand for $(\epsilon \mid p)$
$*$ is also allowed

Syntax Diagram

Terminals in circles
Non-terminals in rectangles
Connect using arrows, loops allowed
Start of diagram: line, with name of end result above it
End of diagram: arrow

Recursive-Descent Parsers

Mutually recursive functions: Every addition/subtraction operand is a term, which is either a number/identifier or multiplication/division operation. Thus, this hierarchy enforces the order of operations: for an addition to occur, all of the multiplication must have already been run;

# Expression  =  Term((+|-)Term)*
def expression():
  term() # Call function to parse term
  while lookahead() in [ADD, SUB]: # Check if next term is plus or minus
    consume(ADD, SUB) # Consume that next token, which is in expected tokens:
    add or subtract
    term()

# Term  =  Factor((*|/)Term)*
def term():
  factor()
  while lookahead() in [MUL, DIV]:
    consume(MUL, DIV)
  factor()

# Factor  =  (Expression) | number | identifier
def factor():
  if lookahead()  = = LPAR:
    consume(LPAR)
    expression()
    consume(RPAR)
  elif lookahead() in [NUM, ID]:
    consume(NUM, ID)
  else:
    raise Exception

Extended NBF to Recursive-Descent Parser

Two types of recursion:

Recursion over the structure of the regular expression
Recursion over structure of the grammar

Operation:

Every non-terminal n becomes a function
Body of function from RHS of the BNF rule: parse(RHS)
Parse function:
- If there are conditions, use lookahead and if statements
- For star, use while loops
- For multiple tokens, just run parse on them sequentially
- If the passed token is a terminal, consume the token

Even with single token lookahead, there can be ambiguities. The language may need to be changed to resolve this.

Abstract Syntax Tree

Classes for each type token (e.g. Expression, Number, Identifier). If number or identifier, needs to store the value. If it is an expression, it needs to store the left- and right-hand operands and the operator.

Details such as parentheses are omitted: the structure of the tree contains this information.

Previously, there needed to be multiple types of arithmetic nodes e.g. (+, *), but with this, only one type of node is required.

Basically the same as parser, except it returns a object.

Semantic Analysis

Type checking
Type inference
Declaration: declared multiple times in different scopes?
Definite assignment? Assigned before use?

Type systems

Type: a set of possible values
Many functions cannot apply to all arguments e.g. Addition on lists won’t work
Processors just work on bits. Garbage in, garbage out
Type checking
- Dynamic: check at run-time
- Static: check during complication
  - Also used for compiler optimizations

Attribute Grammars

An extension of CFGs:

Define functions to calculate attributes

\text{typeof}(op, t_1, t_2 ) = \begin{cases} \text{int} &{op } \in (+, -, *) \text{ and } t_1 \text{ and } t_2 \text{ integers} \\ \text{float} &\text{otherwise} \end{cases}

Define attributes for specific terminals/non-terminals: store the name of the attribute, domain for the attribute and a list of terminals/non-terminals it applies to
Define rules and attribute values e.g.

\begin{aligned} E = T &E.\text{type} = T.\text{type} \\ E = EAT \quad &E_0.\text{type} = \text{typeof}(A.op, E_1.\text{type}, T.\text{type}) \\ A = + \quad &a.\text{op} = + \\ A = - &a.\text{op} = - \end{aligned}

Here, $E$ is an addition/subtraction operation, $T$ is a number/identifier/multiplication/division operation, and $A$ is the plus or minus symbol.

Instead of defining it using the star symbol, it has been defined recursively using two rules.

$E$ has the type attribute, so the type must be set, either inline or with a function defined previously

Synthesized attributes:

The information is propagated bottom-up
Type of number constant determined by scanner
Type of identifier determined by declaration or assignments

Type checking:

Type of identifier and expression must match, or be able to be automatically converted

Inherited attributes:

Value of non-terminal on RHS of rule depends on other values
Information is propagated top-down

Machine-Independent Optimization

Takes place on a syntax tree or other intermediate form. Whatever optimization is done must not change the meaning of the program.

Constant Propagation
- When a variable is used, if that variable is assigned a constant value, the variable can be substituted with the constant.
Constant Folding
- Constant expressions can be evaluated
  - The compiler must implement the semantics of the program
  - e.g. Python compiler compiling Java: must ensure integer addition uses the correct integer size
Dead code elimination
- Code that does not affect the result can be eliminated
Moving calculations out of loops
Reorder calculations
Unroll loops
Removing tail recursion
Eliminating common sub-expressions

Optimizations often lead to further optimization:

Constant folding enables constant propagation
Constant propagation enables further constant folding
Constant folding enables dead-code elimination

Constant Folding: evaluating arithmetic expressions that only use constants. Use similar structure to attribute grammars, with a helper function that evaluates arithmetic expressions if the operands are known, and rules which

Code Generation

Traverses the AST, emitting code as soon as a sufficient portion is processed. It outputs a data structure representing code for a virtual machine, after which it can be further analysed and optimized.

Expression: program fragments that yield a value
Statements: fragments that modify the values of variables
Declarations: information for type checking etc.

JVM

Stack:

Calculations performed on stack
Operations pop operands from the stack, then push the result onto the stack
- sipush stands for short integer push
- iadd, isub, imul, idiv
e_1 + e_2 pushes the code for e_1 and e_2 onto the stack, and then runs iadd
- The invariant is that the code for e_1will end up adding one element to the stack
- In the intermediate steps, it may add any number of elements onto the stack, but they will all be popped away

Variables:

Identifiers refer to variables
Value of variable stored in memory: this is known as their R-value
To access and modify the variable, use the location/address of the variable: the L value
The L-value is fixed at compile time
During compile time, keep a map between the variable name and L-value
Frames
- Contains arguments, local data, calculation stack
- Local/statically allocated data offset relative to frame
- Global/dynamically allocated data in absolute memory
- Assume a single frame for this example
iload n pushes value stored in location n on to the stack
istore n pops value from stack, and stores it in location n

Control flow

Labels mark positions in the code
- l1
Unconditional jumps tells the machine to go to where the label is defined
- goto l1
Conditional jumps do so if a given condition holds
- Removes top two elements from stack
- e.g. if_icmpeq l1: f integer compare equal
- if_icmpeq := =, if_icmpne := !=
- if_icmpge := <=, if_icmpgt := <
- if_icmple := >=, if_icmplt := >

Machine-Dependent Optimization

Peephole optimization: looks at only a small number of generated instructions, replacing them with a shorter/faster sequence of instructions.

Example: it may read a window of 3 instructions, working its way down one instruction at a time.

5. Computability and Complexity

Decision Problems

$L$ is a language over the alphabet $\Sigma$ , and the string $w \in \Sigma^*$ . $\omega \in L$ is a decision problem.

$L$ is decidable, reversible, computable, iff there is an algorithm that outputs if $w \in L$ or not.

The algorithm must terminate for all inputs. If there is no such algorithm, L is undecidable.

Optimization problems can be encoded into decision problems: for example, to find the length of shortest path, simply check if there a path with length less than $k$ , incrementing $k$ until you get a yes.

There are uncountably many languages over the alphabet: $|\mathbb{P}(\Sigma^*)| = 2^{|\mathbb{N}|} =|\mathbb{R}|$ . However, there are only a countable number of algorithms: $|\Sigma^{'*}| = |\mathbb{N}|$ . Hence for most languages, there is no decision algorithm.

$L$ is semi-decidable, recursively enumerable, or partially computable, iff the algorithm outputs if $w \in L$ , and doesn’t output anything if not - i.e. It does not terminate if $w \notin L$ . Hence, you do not know if it is decidable or not unless it outputs the result.

Every decidable language is also semi-decidable: if no is outputted, just pass it through to an endless loop.

Some undecidable languages are semi-decidable.

If $L$ and $\overline{L}$ are both semi-decidable, then $L$ is decidable: run both algorithms simultaneously for each string; one of them is guaranteed to terminate.

Algorithms correspond to a function $f$ : $\Sigma^* \rightarrow \Sigma^*$ ; map an input to an output, both of which are encoding using the alphabet $\Sigma$ .

NB: Functions on numbers can be encoded using binary (otherwise, the alphabet would be infinite).

Semi-decidable:

$f$ : $A \rightarrow B$ is computable iff there is an algorithm that outputs $f(x) \; \forall x \in A$ and terminates for all inputs
Partial function: $f(x) \in B$ or $f(x)$ undefined
Partially computable if there is an algorithm that outputs $f(x)$ whenever $f(x)$ is defined

Turing Machines

Extension of finite automata
Infinite tape: cells that store a symbol
Machine has read/write head over one cell on the tape
Finite automaton portion of the Turing machine called the finite control
Transitions may depend on the symbol under the head
Transitions may change the symbol under the head
The machine halts as soon as it reaches the accept or reject state

M = (Q, \Sigma, \Gamma, \delta, q_0,\__, q_a, q_r)

Where:

$Q$ is the set of states
$\Sigma$ is the input alphabet, which cannot contain the blank symbol
$\Gamma$ is the tape alphabet; $\Sigma \subseteq \Gamma$
- That is, the input alphabet is a subset of the tape alphabet
- $\_ \in \Gamma \setminus \Sigma$ is the blank symbol
$q_0$ is the start state
$q_a \in Q$ is the accept state
$q_r \in Q$ , $q_a \neq q_r$ is the reject state
$\delta$ is the transition function: $(Q \setminus \{ q_a, q_r \}) \times \Gamma \rightarrow Q \times \Gamma \times \{ L, R, N \}$

The TM operates as follows:

Input is placed on the tape - all other cells are set to blank
Starts with head over the first input symbol
If $M$ is in state $p$ with symbol $a$ under the head, $\delta(p, a) = (q, b, X)$ is consulted
Tape head moves in direction $X$ : $L$ (left), $R$ (right), or $N$ (none)
Transitions in the format $a/b, X$ , where $a$ is the symbol that should be under the head for the transition to occur, $b$ is the symbol it is replaced with, and $X$ is the direction the head should move in
Blank symbols can be introduced and removed

Notes:

$M$ is deterministic
$M$ may or may not halt
$M$ is total if $M$ accepts all inputs $w \in Z^*$
$M$ can be used to accept a language (ignoring tape contents), or to compute a function

Configuration

Storing the current state of the Turing machine: $(x, q, y) \in \Gamma^* \times Q \times \Gamma^*$ where:

$q$ is the current state
$x$ is the tape contents to the left of the head
$y$ is the tape contents under and to the right of the head

Since only one cell on the tape can be modified at a time, a configuration relation can be used to describe transitions. e.g. where $a, b, c \in \Gamma$ and $p, q \in Q$ :

(xa, p, by) \rightarrow \begin{cases} (x, q, acy) &\delta (p, b) = (q, c, L) \\ (xa, q, cy) &\delta (p,b) = (q, c, N) \end{cases}

$M$ accepts $w$ if $(\epsilon , q_0, w) \rightarrow^* (x, q_a, y)$
$M$ rejects w if $(\epsilon, q_0, w) \rightarrow^* (x, q_r, y)$
$M$ is total if it halts on all inputs
$L(M) = \{ w \in \Sigma^* \mid (\epsilon, q_0, w) \rightarrow^* (x, q_a, y) \text{ for some } x,y \in \Gamma^* \}$
A language is semi-decidable if it is a language generated by a Turing machine
A language is decidable if it is a language generated by a total Turing machine

Notes:

$M$ computes a partial function $F(M)$ : $\Sigma^* \rightarrow \Gamma^*$
$F(M)(\omega ) = xy$ if $(\epsilon, q_0, w) \rightarrow^* (x, q, y)$ where $q \in \{ q_a, q_r \}$
The function is partially computable if it is defined by a Turing machine
The function is computable if it is defined by a total Turing machine

Computation Models

There are many computation models e.g. TMs with multiple tapes, pushdown automata with two stacks etc. Many of these are equivalent: they accept the same languages and compute the same functions.

Church-Turing thesis: the intuitively computable functions (that is, that humans can compute - hard to formalize) are those partially computable by a Turing machine.

Chomsky Hierarchy

G = (N, \Sigma, P, S)

Where:

$N$ is the non-terminal set
$\Sigma$ is the terminal set, disjoint from $N$
$P \subseteq (N \cup \Sigma)^+ \times (N \cup \Sigma)^*$ is a finite set of productions
$S$ is the start symbol, a non-terminal

Different grammars are obtained by restricting the form of each production $x \rightarrow y$ :

Type-0 (general): $1 \le |x|$
- That is, the number of terminals and non-terminals on the LHS is bigger than $1$
Type-1 (context-sensitive): $1 \le |x| \le |y|$
- LHS can’t be smaller than the RHS
Type-2 (context-free): $x \in N$
- The LHS is a non-terminal
Type-3 (regular): x$ \in N, y \in \Sigma \cup \Sigma N$
- The LHS is non-terminal, and the RHS non-terminal, or a non-terminal followed by a terminal

Notes:

For type-1 and type-3 grammars, the $\epsilon$ string is not covered
As you go to more general grammars, everything gets more difficult (e.g. membership problem exponential for type-1, impossible for type-0)
Type-1 languages are closed under the complement operator
Type-1 languages are defined by a linearly space-bounded non-deterministic automata
- The amount of tape that can be used is some multiple of the input size
Every type-1 language is decidable

Grammar	Language	Automata
Type-0	General/Recursively Enumerable	Turing Machine (or NTM)
Type-1	Context-sensitive	Linearly space-bounded non-deterministic automata
Type-2	Context-free	Push-down automata (not deterministic PDA)
Type-3	Regular	Deterministic-Finite Automata (or NFA)

Undecidability

Programs can be the input of other programs e.g. compiler takes a program as input, Python interpreter.

Special Halting Problem

The following, called the special halting problem, is undecidable:

SHP = \{ \langle M\rangle \mid M \text{ halts on input } \langle M \rangle \}

$\langle M\rangle$ is the representation of the machine in tape.

Thus, the SHP is the set of all programs which, when given itself as input, halts.

Proof

If the SHP is decidable, $SHP = L(M_{SHP})$ for a total TM $M_{SHP}$
- That is, $M_{SHP}$ is a total TM that checks if a given TM halts when given itself as an input
Construct a TM $M'$ that is like $M_{SHP}$ , except that it goes into an endless loop if $M_{SHP}$ accepts
Thus, $M'$ halts when given itself as input. Why?
- $M_{SHP}$ rejects $\langle M' \rangle$
- $\langle M' \rangle \notin SHP$
- $M'$ does not halt on $\langle M' \rangle$

def M_SHP(program_str):
  return True if exec(program_str)(program_str) halts else False

def M_dash(program_str):
  if M_SHP(program_str):
    while True:
      continue
  return False

M_dash(as_string(M_dash))
# If false is returned, then it does not halt so it cannot have returned false
# If it does not terminate, then it must halt, but it cannot have halted
# Hence, M_SHP cannot exist

Halting Problem

$\{ \langle M \rangle \#w \mid M \text{ halts on input } w \}$ , where $\#$ is a symbol that does not appear in $\langle M\rangle$ , allowing it to act as a separator.

Assume there is some total TM which determines if some program given some arbitrary string terminates. If this is possible, solving the special halting problem is trivial - just pass the specific input of $\langle M \rangle$ . Since its not possible, neither is the halting problem

Some other undecidable languages:

If problem halts on empty tape
If two TMs are equivalent
If TM enters a given state
If the language generated by a given TM is a subset of some arbitrary se (as long as it is not the empty or universal set)

Complexity Classes

Running times of programs:

$t(M, x)$ returns the number of steps the (total?) TM M takes for a given input $x$ until it halts.

Function $f:N \rightarrow N$ gives upper bound on how long a function can take for a given input size: T(f) is the set of languages, with the language generated by a total TM $M$ being in $T(f)$ if $t(M, x) < f(|x|)$ for all values of $x$ .

The program can be solved efficiently if it can be solved in polynomial time. These languages are in the complexity class $P$ . Examples include:

Shortest path between two nodes in a graph
Every CFL
$\{ x|x \text{ is a palindrome} \}$
Deterministic primality testing

Non-Deterministic Turing Machine (NTM)

$t(M, x)$ is the number of steps in the longest transition sequence
$\delta: (Q \setminus \{ q_a, q_r \}) \times P( \Gamma \rightarrow Q \times \Gamma \times \{ L, R, N \})$
Non-deterministic polynomial time is the complexity class denoted by $NP$
$P \subseteq NP$ , but does $P = NP$ ?
An NTM can guess the output and then verify its correctness in polynomial time

Some problems known to be in $NP$ (and may or may not be $P$ ):

Does a graph contain a path that visits every vertex exactly once?
- Hamilton path problem
Given sets of integers $a$ and an integer $b$ , is there a subset of $a$ that sums to $b$
Partitioning a list of integers such that the sums of the two subsets are equal
Satisfiability formula: given logic formula with Boolean variables and operations, can it be true?

Problem Complexity

Compare problem complexities by reducing one to another:

$A$ is polynomial-time reducible to $B$ : $A \le _P B$ if there is a function that runs in polynomial time that maps every input in $A$ to every input in $B$
Thus, if $B \in NP$ , and $A \le _P B$ , then $A \in NP$

NP-Complete Problems

Must be $NP$
Must be $NP$ -hard
- $B \le _P A \quad \forall B \in NP$

If these two conditions are met, then every other $NP$ problem is polynomial-time reducible to it. Thus, if one $NP$ -complete problem is solved, then $P = NP$

Showing that a problem is $NP$ -hard is difficult - there are a lot of $NP$ problems.

The satisfiability problem (SAT) was the first problem shown to be $NP$ complete:

There is an NTM for the problem running in polynomial time
Every problem in $NP \le _P SAT$
$SAT \le _P 3\text{-}SAT$
- So 3-SAT is also $NP$ -complete
$3\text{-}SAT \le _P CLIQUE$
- So CLIQUE is also $NP$ -complete
$\dots$

Hence, there is a tree of reductions.

CLIQUE

Finding the largest subset of the graph that is complete - all pairs of nodes are connected.

SAT

Given logic formula:

\begin{aligned} = &\text{Variable} \\ | \; &\text{!Formula} \\ | \; &\text{Formula AND Formula} \\ | \; &\text{Formula OR Formula} \end{aligned}

Output: can the variables be set such that the formula evaluates to true?

On a NTM, guess and verify:

Evaluating the formula once for some given set of variables: linear time
Use NTM to transition variables between 0 and 1 and generate all possible combinations
- Choices are possible; there is no set implementation for how the choices are picked

3-SAT

Limit the structure of logic formulas using Conjunctive Normal Form (CNF):

\begin{aligned} \text{CNF} &= \text{Clause } \text{(AND Clause)}^* \\ \text{Clause} &= \text{Literal } \text{(OR Literal)}^* \\ \text{Literal} &= \text{Variable OR !Variable} \end{aligned}

Every formula can be converted into CNF.

If the distributivity rule is used in highly nested formulas there is exponential blow-up as each use of it causes duplication: thus, it cannot be used.

S-SAT

Given formula in CNF with exactly 3 literals per clause: $\text{Clause = Literal OR Literal OR Literal}$

Push down negations using de Morgan/double negation
New variable $V_n$ for each inner node $n$ (non-literal) in the syntax tree
- Require $V_1$ , root node to be true
- $\text{AND } V_1 \leftrightarrow V_2 \text{ AND/OR } V_3$ to be true
- $\text{AND } V_2 \leftrightarrow \dots$ to be true
- $\dots$
Rewrite the equivalences as clauses:
- $V_1 \leftrightarrow \text{ AND/OR } V_3$
- Use definitions of $\leftrightarrow$ and $\rightarrow$ to convert them into clauses: $(V_1 \rightarrow V_2 \text{ AND/OR } V_3) \text{ AND } (V_2 \text{ AND/OR } V_3 \rightarrow V_1)$
Expand short clauses with new variables
- If $a = V_1 \text{ AND/OR } V_2$
- $a = (a \text{ OR } w) \text{ AND } (a \text{ OR } !w)$
- So now there are three variables

Result is in 3-CNF, obtained in polynomial time, is satisfiable if and only if the original formula is satisfiable.