Context-Free Grammars
Where:
-
is a finite set of non-terminals. They are always denoted with upper case letters -
is a finite set of terminals that is disjoint from -
is a finite set of productions -
is the start symbol
Productions
A production,
Multiple productions written as
The RHS can be empty: in this case, the RHS is
The LHS is always one non-terminal but the RHS can be any length and be made up of terminals and non-terminals.
In
Derivability
On a relation
-
for each -
is derivable from in steps if for each : -
iff -
is derivable from if it is derivable in any number of steps: . This relation is the reflexive-transitive closure of the relation
Context-Free Languages
- Sentential form: any
derivable from start symbol - Sentence: sentential form containing only terminal symbols
-
is the *language generated by . i.e. all possible strings of terminals and all sentential forms - The language
is context free if for some CFG
Building Blocks
-
- Gives
where
- Gives
-
- Gives the RegExp equivalent to
- Gives the RegExp equivalent to
-
- Balanced parentheses
Leftmost and Rightmost
Leftmost: if there are multiple non-terminals, always replace the left-most one first. The choice of production will be determined by the sentence you are trying to get.
If there is more than one leftmost derivation, the sentence is ambiguous. The CFG that generates it is also ambiguous.
The CFL is inherently ambiguous if every CFG generating it is ambiguous e.g.
Regular Grammar
For each
-
-
That is, the RHS is or a terminal followed by a non-terminal
Every regular language is generated by a regular grammar:
- For each state and transition, the production will be the concatenation of the transition symbol and state it goes to
- Accept states will also have the epsilon
Every regular language is accepted by a DFA, NFA (with or without
Regular languages are a strict subset of CFLs.
Left/Right Regular Grammars
If
- Right regular:
e.g. - Left regular:
e.g. There are other forms such as , .
$ \omega \in \Sigma N\cup N\Sigma \cup { \epsilon }$ and
Chomsky Normal Form
CFG
- two non-terminals or
- one terminal
Every CFG can be transformed into Chomsky normal form (unless
-
-productions of the form
- For productions of the form
, add the production -
: and can be empty strings or be multiple characters long - Do this recursively until there are no more changes
- Then, delete all
-productions
-
- Unit-productions of the form
where exists
-
- Replace with
- Then remove all unit productions
- Non-generating non-terminals
- Any non-terminal that can never be transformed into a terminal
- Create a set of generating non-terminals: start by including any non-terminals that have a production that generates only terminals
- Then add non-terminals that have a production that generates only terminals or generating non-terminals
- Remove all productions containing non-terminals that are not in the set of generating non-terminals
- Non-reachable non-terminals
- Reachable if
for - That is, the start symbol can be transformed into the non-terminal
- Remove all productions containing non-reachable non-terminals
- Productions with RHS length
that contain terminals
- Create a new production
for each terminal that meets the criteria - Then replace the productions to use
instead of
- Productions with RHS length
(should only be left with productions with only non-terminals of this length)
- For
, create a new production and non-terminal , then let - Repeat if
is too long
- Repeat if
If
- Create a new non terminal
, production - For each production
of the start symbol, add - Make
the new start symbol
Cocke-Younger-Kasami Algorithm
Given a string
- Test membership of CFL
-
- But there are a lot of (infinitely many) strings
Chomsky normal form means that each non-terminal produces at least one terminal. Therefore, there is an upper bound on the length of derivations that need to be checked (exponential).
Use dynamic programming to improve speeds.
If the original problem is
- Build a staircase table of character indexes (one-indexing)
- Starting with the main diagonal (thus one-length substrings), for each cell, build a set,
, of non-terminals from which you can derive that character - With the next diagonals:
- Split up the string into two substrings,
and - For all non-terminals
and , if there is a production such that , add to
- Split up the string into two substrings,
Hence, the algorithm is
If
Pushdown Automata
Uses a stack. Transitions can depend on symbols at the top of the stack, and modify symbols at the top of the stack. The stack and transition alphabet can be different.
Where:
-
is the stack alphabet, a finite set -
-
is a transition function of the form i.e. based on NFA with transitions, but with a stack
For a transition to take place, there must be a normal transition, but the character on the top of the stack must be the same one as specified in the transition (unless it is
Upon reading, the element is removed from the stack, and an element is added to the stack. The syntax is character, stack symbol to read/pop stack symbol to push e.g.
For a string to be read, it must be on an accept state AND the stack must be empty.
Configuration
Where:
-
is the current state -
is the remaining input -
is contents of the stack
The configuration is a snapshot of the current state of
The language accepted by
Converting CFGs to PDAs
The resultant PDA will have two states: a start state
A transition from
For each terminal