3. Context-Free Grammars

Context-Free Grammars

G = (N, \Sigma, P, S)

Where:

$N$ is a finite set of non-terminals. They are always denoted with upper case letters
$\Sigma$ is a finite set of terminals that is disjoint from $N$
$P \subseteq N \times {(N \cup \Sigma)}^*$ is a finite set of productions
$S \in N$ is the start symbol

Productions

A production, $(A, \omega) \in P$ is written as $A \rightarrow \omega$ .

Multiple productions written as $A \rightarrow \omega_1 | \dots | \omega_n$ .

The RHS can be empty: in this case, the RHS is $\epsilon$ .

The LHS is always one non-terminal but the RHS can be any length and be made up of terminals and non-terminals.

In $E+F \Rightarrow_G^1 E + N$ , $G$ denotes the grammar being used and the $1$ denotes that the LHS can be transformed to the RHS in one step.

Derivability

On a relation ${(N \cup \Sigma)}^*$ :

$x_i \in {(N \cup \Sigma)}^*$ for each $i \in N$
$x_n$ is derivable from $x_0$ in $n$ steps if $x_i \Rightarrow_G^1 x_{i+1}$ for each $0 \le i<n$ : $x_0 \Rightarrow_G^n x_n$
$x \Rightarrow_G^0 y$ iff $x = y$
$y$ is derivable from $x$ if it is derivable in any number of steps: $x \Rightarrow_G^* y$ . This relation is the reflexive-transitive closure of the relation $\Rightarrow_G^1$

Context-Free Languages

Sentential form: any $x \in (N \cup \Sigma )^*$ derivable from start symbol $S:$ $S \Rightarrow_G^* x$
Sentence: sentential form containing only terminal symbols
$L(G) = \{ x \in \Sigma ^* \mid S \Rightarrow_G^* x \}$ is the *language generated by $G$ . i.e. all possible strings of terminals and all sentential forms
The language $A \subseteq \Sigma^*$ is context free if $A = L(G)$ for some CFG $G$

Building Blocks

$T \rightarrow aTb \mid ab$
- Gives $a^n b^n$ where $n \ge 1$
$C \rightarrow a \bar aA$
- Gives the RegExp equivalent to $a^+$
$S \rightarrow (S) \mid SS \mid \epsilon$
- Balanced parentheses

Leftmost and Rightmost

Leftmost: if there are multiple non-terminals, always replace the left-most one first. The choice of production will be determined by the sentence you are trying to get.

If there is more than one leftmost derivation, the sentence is ambiguous. The CFG that generates it is also ambiguous.

The CFL is inherently ambiguous if every CFG generating it is ambiguous e.g. $\{ a^i b^j c^k \mid i = j \text{ or } j = k \}$ .

Regular Grammar

G = (N, \Sigma, P, S)

For each $A \rightarrow \omega \in P$ :

$\omega = \epsilon$
$\omega \in \Sigma N$ That is, the RHS is $\epsilon$ or a terminal followed by a non-terminal

Every regular language is generated by a regular grammar:

For each state and transition, the production will be the concatenation of the transition symbol and state it goes to
Accept states will also have the epsilon

Every regular language is accepted by a DFA, NFA (with or without $\epsilon$ transitions), regular expression, regular grammar.

Regular languages are a strict subset of CFLs.

Left/Right Regular Grammars

If $\omega \in P$ is the set of productions, then:

Right regular: $\omega \in \Sigma N\cup \{ \epsilon \}$ e.g. $A \rightarrow bC$
Left regular: $\omega \in N\Sigma \cup \{ \epsilon \}$ e.g. $A \rightarrow Bc$ There are other forms such as $\omega \in \Sigma^* N \cup \Sigma ^*$ , $\omega \in N \Sigma \cup \Sigma \cup \{ \epsilon \}$ .

$ \omega \in \Sigma N\cup N\Sigma \cup { \epsilon }$ and $\omega \in NN \cup \Sigma$ do NOT yield regular languages.

Chomsky Normal Form

CFG $G = (N, \Sigma, P, S)$ if every production’s RHS is made up of:

two non-terminals or
one terminal

Every CFG can be transformed into Chomsky normal form (unless $\epsilon \in L(G)$ ) by eliminating:

$\epsilon$ -productions of the form $A \rightarrow \epsilon$

For productions of the form $B \rightarrow uAv$ , add the production $A \rightarrow uv$
- $u,v \in (\Sigma \cup N)^*$ : $u$ and $v$ can be empty strings or be multiple characters long
- Do this recursively until there are no more changes
- Then, delete all $\epsilon$ -productions

Unit-productions of the form $A \rightarrow B$ where $b \rightarrow w$ exists

$w \in (\Sigma \cup N)^*$
Replace with $A \rightarrow w$
Then remove all unit productions $A \rightarrow B$

Non-generating non-terminals

Any non-terminal that can never be transformed into a terminal
Create a set of generating non-terminals: start by including any non-terminals that have a production that generates only terminals
Then add non-terminals that have a production that generates only terminals or generating non-terminals
Remove all productions containing non-terminals that are not in the set of generating non-terminals

Non-reachable non-terminals

Reachable if $S \Rightarrow ^* uAv$ for $u, v \in (\Sigma \cup N)*$
- That is, the start symbol can be transformed into the non-terminal
Remove all productions containing non-reachable non-terminals

Productions with RHS length $\ge 2$ that contain terminals

Create a new production $B \rightarrow b$ for each terminal $b$ that meets the criteria
Then replace the productions to use $B$ instead of $b$

Productions with RHS length $\ge 3$ (should only be left with productions with only non-terminals of this length)

For $A \rightarrow B_1 B_2 \dots B_n$ , create a new production and non-terminal $C \rightarrow B_2\dots B_n$ , then let $A \rightarrow B_1 C$
- Repeat if $C$ is too long

If $\epsilon$ is needed:

Create a new non terminal $S'$ , production $S' \rightarrow \epsilon$
For each production $S \rightarrow \omega$ of the start symbol, add $S' \rightarrow \omega$
Make $S'$ the new start symbol

Cocke-Younger-Kasami Algorithm

Given a string $\omega \in \Sigma ^*$ , CFA $A$ , check if $\omega \in A$

Test membership of CFL
$A = L(G) = \{ x \in \Sigma^* \mid S \Rightarrow_G^* x \} \therefore w \in A \text{ iff } S \Rightarrow_G^* \omega$
But there are a lot of (infinitely many) strings

Chomsky normal form means that each non-terminal produces at least one terminal. Therefore, there is an upper bound on the length of derivations that need to be checked (exponential).

Use dynamic programming to improve speeds.

If the original problem is $S \Rightarrow ^* w$ , solve $N \Rightarrow ^* w_{ij}$ for all non-terminals and for all substrings of $w$ , where $i$ and $j$ are the lengths of the strings e.g. $abcde_{13}$ returns $bc$

Build a staircase table of character indexes (one-indexing)
Starting with the main diagonal (thus one-length substrings), for each cell, build a set, $N_{i, i+1}$ , of non-terminals from which you can derive that character
With the next diagonals:
- Split up the string into two substrings, $w_(i, i+n)$ and $w_(i+n, j)$
- For all non-terminals $B \in N_(i, i+n)$ and $C \in N_(i+n, j)$ , if there is a production such that $A \rightarrow BC$ , add $A$ to $N_{i,j}$

Hence, the algorithm is $O(n^3)$ .

If $S \in N_{0, \text{len}(w)}$ , the string $w$ is in the language of the grammar.

Pushdown Automata

Uses a stack. Transitions can depend on symbols at the top of the stack, and modify symbols at the top of the stack. The stack and transition alphabet can be different.

M = (Q, \Sigma, \Gamma, \delta, q_0, F)

Where:

$\Gamma$ is the stack alphabet, a finite set
$\epsilon \neq \Sigma \cup \Gamma$
$\delta$ is a transition function of the form $Q \times (\Sigma \cup \{ \epsilon \}) \times \Gamma^* \rightarrow P(Q \times \Gamma)$ i.e. based on NFA with $\epsilon$ transitions, but with a stack

For a transition to take place, there must be a normal transition, but the character on the top of the stack must be the same one as specified in the transition (unless it is $\epsilon$ , in which case nothing needs to be read).

Upon reading, the element is removed from the stack, and an element is added to the stack. The syntax is character, stack symbol to read/pop stack symbol to push e.g. $a, \epsilon /1$ . Strings can be pushed and popped as well (when pushing, the first character of the string is pushed last).

For a string to be read, it must be on an accept state AND the stack must be empty.

Configuration

(q, x, \alpha) \in Q \times \Sigma ^* \times \Gamma^*

Where:

$q$ is the current state
$x$ is the remaining input
$\alpha$ is contents of the stack

The configuration is a snapshot of the current state of $M$ . The next state is:

(p, ax, \alpha \Gamma) \rightarrow (q, x, \Beta \Gamma) \text{ if } (q, \Beta) \in \delta (p, a, \alpha)

$\rightarrow^*$ means the configuration is reachable in zero or more transitions.

The language accepted by $M$ :

L(M) = \{ x \in \Sigma^* \mid (q_0, x, \epsilon) \rightarrow^* (q, \epsilon, \epsilon), q \in F \}

Converting CFGs to PDAs

The resultant PDA will have two states: a start state $q_0$ and accept state $q_1$ .

A transition from $q_0$ to $q_1$ , $\epsilon , \epsilon / S$ , pushes the start state onto the stack.

$q_1$ will have transitions to itself for each production: for each production and RHS, an epsilon transition reading nothing and pushing the RHS of the production onto the stack should be made: $\epsilon, \text{LHS}_\text{production}, \text{RHS}_\text{production}$ .

For each terminal $\alpha$ , there should also be a read transition taking the terminal as input and popping that terminal from the stack: $\alpha,\alpha/ \epsilon$ .