1. Algorithms

A well-defined procedure:

Input: instance of the problem
Output: solution to the instance

The procedure must work for all instances of the problem.

Defining Algorithms

English
Pseudo-code
Program
Formal mathematical notation The further down you go, the more precise but less readable it gets.

Properties

Correctness (for a defined set of inputs)
- Don’t write clever code! Write correct code
Scalability/Efficiency
- Time, RAM, disk, CPU, power etc.
Clarity
Generality
- Is it sufficient for it to work on a subset of problems? Then redefine the problem space
- Is an optimal solution required, or does it just have to be good-enough?

Determining Correctness

Testing can not prove an algorithm is correct (unless you can test all cases): you can only prove it is incorrect. Only formal proofs can prove it. Not covered in the course.

Start with a reasonably precise statement of the problem and algorithm.

Then:

Find failing cases
Try proof by induction
- Can you test every single case when the size is small?

Example: Robot Arm (Skiena 1.1)

Robot needs to place solder into $n$ points on a PCB. What is the shortest cycle tour that visits each point in the set $S$ ? This is the TSP, so an optimal solution cannot be found. Thus, heuristics need to be used.

Heuristics: greedy algorithm?

Example: Selecting Jobs (Skiena, 1.2)

Multiple movies that need to be filmed, need to maximize number of movies. Set $I$ of $n$ intervals on the line.

Heuristics: start with the shortest movie? Start with the first available movie?

Look for counter examples that show that the heuristic is incorrect:

Think small
Think exhaustively
Find weaknesses
- Ties
- Extremes

Proof by induction

def sum(n):
    s = 0
    for i in range(1, n + 1):
        s += i
    return s

Assume that the loop works up to $k$ . Thus, the loop must work up to when calculating the sum for $k+1$ .

Insertion sort

After $n$ iterations of the outer for loop, the sub array, $[0, n)$ is sorted. Now, you only need to prove that it is able to sort the $n$ th element correctly.

2. Algorithm Analysis

Once an algorithm is shown to be correct, consider the:

Efficiency in time and space
- Big O
- Big Omega
- Big Theta
Complexity in terms of the size of the input
Worse, average, and best case
- Focus on the worst case
Asymptotic analysis: $\lim_{n\rightarrow \infty}{time(n)}$

Complexity

The time taken will depend on the instance of the problem being solved.

RAM Model of Computation

Abstract random access machine

Accessing any byte in memory takes constant time
Basic operations (arithmetic, comparison, assignment) take one unit of time
Time inside loops proportional to the number of iterations
Subroutine time is the time to execute its body

Formal Definitions

Big O (upper bound)

O(G) = {f|\exists c>0: \exists x_0: \forall x\ge x_0: 0 \le f(x) \le c \cdot g(x)}

That is, $f(x)$ is part of the set $O(g)$ if there exists a positive multiplier $c$ such that for every value of $x$ greater than $x_0$ , $f(x) \le c\cdot g(x)$ .

Big Omega (lower bound)

\Omega(G) = {f \exists c>0: \exists x_0: \forall x\ge x_0: 0 \le c\ cdot g(x) \le f(x)}

That is, $f(x)$ is part of the set $\Omega(g)$ if there exists a positive multiplier $c$ such that for every value of $x$ greater than $x_0$ , $c\cdot g(x) \le f(x)$ .

Big Theta (tight bound)

\Omega(g) = O(g) \cap \Omega(g)

That is, $f(x)$ is part of the set $\Theta(g)$ if it is such that for every value of $x$ above $x_0$ , there exists $c_1$ and $c_2$ such that that $c_1 \cdot g(x) \le f(x) \le c_2 \cdot g(x)$ .

The worst case and best case are functions: the algorithm is not, so we cannot define the algorithm by Big O/Omega/Theta notation.

The best case of an algorithm will be $\Theta(g)$ ; the algorithm is not $\Theta(g)$ . Ideally, an algorithm would be separately categorized by its best and worst case. In practice, we often categorize the algorithm by its worst case.

Common functions, sorted by complexity descending:

Functions
$x!$
$a^x$
$x^a$
$x \log(x)$
$x$
$\log(x)$
$a$

3. Recursion

The Process

Divide the problem into sub-problems
- Usually $(1, n-1)$ or ( $\frac{n}{2}, \frac{n}{2})$
Solve the sub-problems
Merge the sub-problems

Example: Merge Sort

Merge sort from left to middle, merge sort from middle to right and then merge the results:

Divide: find the middle element
Solve: merge sort the two sub-arrays
Merge: mere the two sub-arrays

The base case is when the length of the array is 0 or 1.

Analysis

For $T(n)$ , the cost will be either the base case of $\Theta(1)$ , or the sum of:

Subdivision: $\Theta(1)$ to find the center
Solving sub-problems: $2T(\frac{n}{2})$
Merge: $\Theta(n)$

\begin{cases} \Theta(1) & n = 1 \\ 2T(\frac{n}{2}) + \Theta(n) & n > 1 \end{cases}

At each level, the cost will be:

\begin{aligned} \Theta(n) &= 2\Theta(\frac{n}{2}) \\ &= 4\Theta(\frac{n}{4}) \\ &\dots \\ &= \Theta(n) \end{aligned}

There are $\log_2{n}$ levels, so the total cost is $n\log_2n$ .

4. Greedy Algorithms

Greedy algorithms:

Are simple and intuitive
Choose the locally optimal choice (for some given objective function)
- Hope it will lead to a globally optional choice
Do not use backtracking

Examples

Minimum Spanning Tree

Prim’s Algorithm
- Start anywhere
- Connect to the cheapest node that doesn’t create a cycle
Kruskal’s algorithm
- Forest of trees
- Look for the cheapest node, create a new tree or connect two trees together

These can be proved using proof by counter-example.

Coin Changing Problem

Coins of denominations $C = \{c_1, \dots, c_n\}$ . Find the minimum number of coins such that the total value is $V$ . Assume:

You have an unlimited amount of each coin
$c_1 < c_2 < \dots < c_n$
$c_1=1$ and hence, there is a solution for every integer solution

Greedy strategy:

Select the largest coin such that $c_i<V$
Add the coin to the list and decrement $V$ : $V := V - c_i$

Counter example: $C=\{1, 10, 25\}$ , $V=30$ . The optimal solution is $3 \cdot 10$ , but the greed algorithm gives $1\cdot 25 + 5\cdot 1$ .

What is a sufficient property such that the greedy algorithm will always be optimal? Every coin is a multiple of the one below it.

Example: Interval Scheduling

There is a set of jobs with a start ( $s_j$ ) and end ( $f_j$ ) times. Find the subset that contains the most jobs.

You could:

Sort by start times
Sort by job length
Sort by finish times
- This is correct, but it is hard to prove this is optimal

To prove a greedy algorithm right, assume its wrong and find contradiction. To prove it wrong, simply find a counter-example.

Fractional Knapsack Problem

There is a set $S$ of $n$ items. Each item has a positive value $b_i$ and weight $w_i$ . Find the subset which yields the maximum value without exceeding given weight $W$ .

You can select fraction of item $x_i$ ; the weight and values are scaled proportionally.

To solve this with a greedy algorithm, sort it by the benefit/weight ratio ( $b_i/w_i$ ), filling up the knapsack with the item with the greatest ratio that is still available until target weight is reached.

0/1 Knapsack Problem

Can only take the whole item or not take it at all. Requires dynamic programming.

Huffman Coding

Variable vs Fixed length

Fixed length codes are convenient for indexing, counting (e.g. ASCII: 7-bit binary string per character, one character per byte).

Variable length codes may be more efficient (e.g. Morse code: frequent letters have short encodings; UTF-8: most common characters 1 byte long).

Variable-Length Coding

Let $a=0$ , $b=1$ , $c=01$ etc.

Problem: there is no spacing, so it is ambiguous where a character starts and ends.

Solution: ensure no binary string in the code is a prefix of any other binary string.

For $a$ to $f$ , one possible solution is $a=000$ , $b=0010$ , $c=0011$ , $d=01$ , $e=10$ , $f=11$ .

This can be represented as a tree with two children per node and characters being found in leaf nodes:

            .
           /  \
        0 /    \ 1
         .       .
        /\       /\
     0 /  \ 1 0 /  \ 1
      .    d   e    f
     /\
  0 /  \ 1
   a    .
        /\
     0 /  \ 1
      b    c

Encoding

The most frequent characters should have the shortest encoding. However, this approach will fail.

Instead, the key is that the two least frequent characters should be leaves at the bottom of the tree. Hence, to create a Huffman tree:

Create a min-heap where the value is:
- The frequency of the character for a leaf node or
- The sum of frequencies for all its children
While there are nodes:
- Join the two smallest nodes together
- Add the combined node back into the min-heap

Popping and insertion into a min-heap are $O(\log{n})$ operations and $n$ iterations are required. Hence, the time complexity for encoding a Huffman tree is $O(n \log{n})$

Disadvantages

The encoding table must be sent along with the encoded document. Hence, the overhead may be large for small documents.

Huffman encoding is optimal for global character frequencies. However:

The whole document must be processed
Local optimization not possible (e.g. character frequencies change half way through)
Can’t recognize patterns; it only recognizes letters, not words

Lempel-Ziv usually gives better compression with substring recognition. The deflate algorithm combines the LZ and Huffman algorithms.

5. Dynamic Programming

Memoisation

Fibonacci can be defined as a recurrence relation: $f(n) = f(n - 1) + f(n - 2)$ . However, this approach is inefficient due to lots of duplicate calls: when expanded, it becomes $f(n) = f(1)+ f(1) + f(1) + \dots$ .

One solution to this is memoisation: the caching of results.

General Purpose Memoiser

def memoise(f):
  known = {}
  def new_func(*params):
    if param not in known:
      known[param] = f(*params)
    return known[params]
  return new_func

@memoise
def function_definition():

# Or use the built in tools:
from func_tools import lru_cache
# LRU: least recently used; gets rid of oldest result once number of results stored reaches `maxsize`

@lru_cache(maxsize = None)
def function_definition():
  // ...

Notes:

A nested function has access to variables within the outer function’s scope
*param in the function definition returns a tuple, *param passes the elements of the tuple individually as arguments
@func acts as a decorator that takes the function as an argument and modifies it

Fibonacci, Iteratively

Bottom-up iteration:

def fibonacci_i(n):
  a, b = 0, 1
  for i in range(n):
    a, b = b, a + b
  return a

This approach is $O(n)$ in time and $\Theta(1)$ in space.

Top-down Dynamic Programming

Recursion and memoisation.

Bottom-up Dynamic Programming

Store a table of solutions, solving the simplest sub-problems first
Fill out the table by combining sub-problems until target problem is reached

Example: Minimum Cost Path

Rectangular grid of cells, each with a weight. You can move to any adjacent cell (including diagonals) from each cell. Find the minimum cost needed to get from top to bottom.

Recurrence formula

$C_{i, j}$ is optimal cost of getting to the cell $(i,j)$ where $i\in [0, I)$ , $j\in[0, J)$ :

C_{i,j} = \begin{cases} \infty &\text{cell outside of the grid}\\ W_{i,j} &i = 0 \text{ (cell is at the the top of the grid)} \\ W_{i,j} + \min_{k\in(j-1, j, j+1)}(C_{i-1, k}) &\text{otherwise} \end{cases}

Hence in the normal case, the cost of the cell is the cell weight plus minimum cost to get to cell above it (left, directly above, or right).

Using a top-down approach:

Infinite cost if outside grid
Top row: just get the weight of the cell
Otherwise
- Check cache, return if found
- Find the minimum cost of the cells above/neighboring it
- Add the weight of the cell
- Save it to cache and return
Find the cost for each cell in the bottom row, and find the minimum from there

Using a bottom-up approach::

Create a table of all cells
Get weights for top row
For all cells in second row, find the minimum cost to get to that cell and add the cost of that cell
Repeat for all rows
Find the minimum-cost path by backtracking
- This is easier if you store the column of the cell you came from while generating the table

If the grid is square, it is $O(n^2)$ in time and space.

This problem can also be found using Dijkstra’s shortest path by converting the problem to a multi-stage graph (with weights moved to edges). However, this is harder to implement and less efficient.

Suitably of DP

DP approaches are likely to work when the problem:

Is combinatorial: brute force solution of either or
Involves only integers or slices of fixed strings with well-define bounds: a finite search domain
Is looking for the optimal solution

Then build the solution from solutions to sub-problems:

Requires optimal substructure
Expect overlapping subproblems
- Otherwise, the cache is useless
- NB: cache will often require the path to the result

Optimal Substructure

When solutions to subproblems can be reused (multiple times!) to find the optimal solution to the problem.

For example, in the shortest path problem, the shortest path for a sub-problem will be inside the shortest path for the larger problem.

The longest path problem does not have an optimal substructure.

Properties of Recursive Solutions

Want to make a cache of all the calls you make
How many unique function calls is it possible to make? How big could the cache be?
- For $fn(x_1, \dots, x_n)$ , the cache would be $|domain(x_1)| \cdot \dots \cdot |domain(x_n)|$
- Hence, this can become very large

Coin Changing Problem

A greedy approach does not produce optimal results. Hence, DP must be used.

Top-Down

Given coinage = [1, 10, 25] and amount = 80, the solution will always be minimum of:

1 + change(80 - 25, coinage, 3): the solution given all three coins are available
OR change(80, coinage, 2): the solution given only the first two coins are available

Either amount or coinage decreases to decrease the size of the problem.

An alternative approach would be finding the minimum of:

1 + change(80 - 1, coinage)
1 + change(80 - 10, coinage)
1 + change(80 - 25, coinage)

That is 1 + min(coinage_calculate(amount - coin, coinage) for coin in coinage)

Without memoisation, there are a maximum of len(coinage) sub-calls per call. With memoisation, the graph collapses to amount + 1 calls:

The number of coins is the same for every call (given the second approach). Hence, the only varying argument is the amount.
This would be inefficient using LRU cache as it would save both arguments for each call. Hence:
- Pass home-grown cache as an argument or
- Define an inner function and use LRU cache on that instead

0/1 Knapsack

No fractional amounts. For each item you have two options:

Pick up the item: get a smaller knapsack that can carry $W-w_i$ and have a current worth of $v_i$
Don’t pick up the item: get a knapsack that can carry $W$ and have a current worth of $0$

Repeat for each item, picking the option that gives the greater worth.

Edge cases:

No more items
Item can’t fit

Cache using index of item ( $i$ ) and weight of knapsack ( $C$ ): $$ V(i, c) = \begin{cases} 0 &i < 0 \ V(i-1, C) &i\ge 0, w_i > C \ \max(V(i-1, c), v_i + V(i-1, C-w_i)) &i\ge0, w_i \le c \end{cases} $$ This is easier when using 1 indexing: length and index are the same.

Top-Down

The number of possible calls is $nC$ , where $n$ is the number of items, and each function call will call two other calls until it reaches the base case.

If using LRU cache, run the computation in in a nested function using LRU cache, and store the array of items outside the scope of the function.

Bottom-Up

Generate table of knapsack weight (column) and index (row), filling table row-wise. Go backwards to reconstruct solution.

1 indexing is also useful in this case, in which an extra row and column (0) will be needed.

For each cell, you have two options:

Don’t pick up: go to the value of cell directly above
Pick up: value of item plus value of the cell one row above and $w_i$ to the left

Take the maximum of these two options.

This has a time and space complexity of $O(nW)$ . If only the maximum value and not the path is required, only two rows are needed. Hence, the space complexity gets reduced to $O(W)$ .

NB: this is pseudo-polynomial time - it is polynomial with the size of the knapsack, not the number of items.

Assuming the path is required, use backtracking to reconstruct the solution: start at the end, and If the value of the item directly above is the same, it means the optimal solution came from not picking up the item. Else, go up one and left by the weight of the item. This has a complexity of $O(n)$ .

Subset Sum Problem

Given a set of integers, find a subset that sums to zero.

Generalize this to finding a subset that sums to some value $S$ .

def subset_sum_exists(numbers, n, required_sum)
# returns true iff subset within first n elements that sum to required_sum

Longest Common Subsequence

Given two strings $A$ of length $n$ and $B$ of length $m$ , find the length of longest subsequence common to both strings. Characters do not need to be contiguous.

Brute forcing this is not possible - there are $2^n-1$ subsequences of $A$ , and determining if it is a subsequence of $B$ would take $O(m$ ) time. Hence, the time complexity of brute forcing a solution is $O(m\cdot 2^n)$ .

Create a function, $L_{i,j}$ , which returns the longest common subsequence given substrings of $A$ and $B$ : $A=a_1a_2\dots a_i$ and $B=b_1b_2\dots b_j$ . $$ L_{i,j} = \begin{cases} 0 & i = 0 \text{ or } j = 0 \ L_{i-1, j-1} + 1 & a_i = b_j \ \max(L_{i, j-1}, L_{i-1, j}) & a_i \neq b_j \end{cases} $$

Bottom-Up

Create a table of 1-origin subscripts for all possible values of $L_{i, j}$ , initializing $L_{0, j} = 0$ . Then, built up the table row-by-row:

If $a_i = b_j$ , add 1 to the value of the cell one row up and one column to the left
Else, get the max of either the cell to the left or the cell above

Edit Distance

The diff algorithm. Transform source string $A = a_1\dots a_n$ into $B = b_1\dots b_m$ using the minimum number of insertions, deletions, and substitutions. This number is called the edit distance.

When a transformation is not needed, it is called the alignment operation, and has zero cost.

Work from end of string to start.

Copy/Alignment: If $a_i = b_j$ , $D_{i,j} = D_{i-1, j-1}$
Otherwise, use the minimum of:
- Deletion: $D_{i,j} = D_{i-1,j} + 1$
- Insertion: $D_{i,j} = D_{i, j-1} + 1$
- Substitution: $D_{i,j} = D_{i-1, j-1} + 1$
  - Prefer substitution over deletion/insertion $$ D_{i, j} = \begin{cases} 0 & i = j = 0 \ i & j = 0 \ j & i = 0 \ D_{i-1, j-1} & a_i = b_j \ 1 + \min(D_{i-1, j}, D_{i, j-1}, D_{i-1, j-1}) & \text{otherwise} \end{cases} $$

Longest Increasing Subsequence

One solution is to sort, remove duplicates, and then run LCS on the two sequences. However, a more efficient solution is possible.

Let $L_j$ be the length of the longest increasing substring up to and including position $s_j$ in the string $s\dots s_n$ .

Recurrence relation: find the maximum of LIS for all solutions before the current index, provided the last (and so biggest) element is smaller than the current element.

L_j = \begin{cases} i & i=1, s_j \ge s_i \; \forall 0 < j < i \\ 1 + \max_{0<j<i, \; s_j < s_i}{L_j} & \text{otherwise} \end{cases}

This can be further optimized into an $O(n\log{n})$ solution: optional sequences are in increasing order, so a binary search can be done instead of searching all elements before it.

6. Graph Theory

Definitions

Graph: $G(V,E)$ , where $V$ is a set of vertices, $E$ is a set of edges, each relating two vertices together. $$ n=|V| \ m=|E| \ m \le n^2 $$

Sparse graph: $m \in O(n)$
Dense graph: $m \in \Theta(n^2)$
Directed graph: edges are ordered pairs $E \subseteq V \times V$ , so one way.
Undirected graph: edges are a set of vertices, $E \in \{\{u,v\}, u,v \in V\}$

Weighted graphs: edges have a weight $\omega$ .

Path: a sequence of vertices where:

Each vertex is connected to each other by a edge
Each edge is not used more than once

Length of path: number of edges.

Cycle: path where the source and destination vertex is the same:

Cyclic if graph contains at least one cycle
- Acyclic if not
  - If it is also directed, it is called a directed acyclic graph
Loop: cycle of length one

Empty graph: $(\{\}, \{\})$ .

Subgraphs

$G'=(V',E')$ where $V' \subseteq V, E' \subseteq E$

Component: a non-empty subgraph where every pair of vertices connected by a path. If there is a path between two vertices, they are in the same component.

Tree: graph with a root vertex such that there is only one path between the vertex and root.

Directed tree: tree where all edges point either towards or away from the root.

Forest: graph where every component is a tree

Textual Representation of Graphs

Header:

D or U for directed/undirected
n: number of vertices (which are numbered from 0)
W: appears if it is a weighted graph

All other lines after the header define a edge: V1 V2 or V1 V2 Weight

Data Structures

Adjacency Matrix:

$n \times n$ binary matrix $A$ , where $A[i,j] = 1$ if $i$ and $j$ are connected to each other via a edge and 0 otherwise
If undirected, the matrix is symmetric along the main diagonal
If the graph is weighted, the value is the weight, or null if no edge

Adjacency List:

A list of length $n$ , each element being a list for every vertex
Each sub-list contains the vertices it is connected to ○ If directed, $j$ is in the list for $i$ if there is a edge $(i,j)$ ○ If the graph is weighted, each item in the list is of the form $(j,w)$

Adjacency matrices are faster ( $\Theta(1)$ vs $\Theta(n)$ ) for operations such as checking for the existence of an edge, adding, and deleting edges, but uses $\Theta(n^2)$ space compared with $\Theta(n)$ for adjacency lists.

Forests:

Each vertex has zero or one parents
Store as an array: if vertex $v$ has parent $u$ , $parent[v]=u$

Graph Traversal

Each vertex can be in one of three states:

Undiscovered
Discovered
Processed

Predecessor Tree:

Root: starting vertex
For $u$ in the list of vertices:
- If $(u,v)$ or $(u, v)$ , where $v$ is undiscovered:
  - Add $v$ to the tree: $v$ is the child of $u$ (even if the graph is undirected)
    - Set $parent(u) = v$ . Note that this is different from the forest map as a child can have multiple parents, but will only ever specify one parent: whatever was first

BFS

Two procedures: BFS-Tree initializes the data structure and calls BFS-Loop:

def BFSTree(adj, index):
  # Where:
  # - adj is the adjacency matrix
  # - index is the index of the root
  n = len(vertices)
  state = [UNDISCOVERED * n]
  state[index] = DISCOVERED

  parent = [NULL * n]
  Q = Queue()
  Q.add(index)
  return BFSLoop(adj, Q, state, parent)

def BFSLoop(adj, Q, state, parent):
   while Q is not empty:
    u = Q.remove()
    for v in adj[u]:
      # For each vertex u is connected to
      if state[v] == UNDISCOVERED:
        state[v] = DISCOVERED
        parent[v] = u
        Q.add(v)
    state[u] = PROCESSED
  return parent

Analysis

Size of $G$ : $|V|=n$ , $|E|=m$ , where $m \le n^2$ .

Time complexity of BFSTree: $\Theta(n)$
Time complexity of BFSLoop: proportional to number of times the loop is run, so maximum of number of edges in the graph: $O(m)$

So complexity is $O(n+m)=O(\max(n,m))$

DFS

def DFSLoop(adj, index, state, parent):
  for v in adj[index]:

    # For each vertex the given index is connected to
    if state[v] == UNDISCOVERED:
      state[v] = DISCOVERED
      parent[v] = index
      DFSLoop(adj, v, state, parent)
    state[index] = PROCESSED
  return parent

Properties of DFS and BFS

Works with directed and undirected, although weights are ignored
Only creates a single tree
DFS can be done by replacing BFS queue with stack, and replacing enqueue and dequeue with push and pop
BFS: vertices discovered in order of increasing distance from start
DFS: vertex only marked as processed once all vertices reachable from it are discovered

Shortest Path

Use BFS to produce parent/predecessor tree, working backwards from the end until you get to the start.

def treePath(parentTree, start, end):
    if start == end:
        return [start]
    path = treePath(parentTree, start, parentTree[end])
    path.append(end)
    return path

Connected Components

When BFS or DFS is run, all vertices reachable from the start node are processed: that is, it finds a subgraph.

Thus, to find all components, run BFS or DFS on each vertex that has not yet been discovered. To make sorting the vertices into components easier, there can be an array that stores the component number; on what run of the DFS/BFS the vertex was first discovered.

Strong Connectivity

A directed graph is strongly connected If there is a path between every ordered pair of vertices.

This can be determined inefficiently by running BFS and check if all vertices are processed for all possible starting vertices.

Transpose operation, $G^T$ : for all edges, reverse the order:

Adjacency matrix: mirror along the main diagonal. $\Theta(n^2)$
Adjacency list: $\Theta(m+n)$

Hub: vertex $h$ is a hub if $h$ is reachable from every vertex AND $h$ can visit every vertex.

If there is such a vertex, then the graph must be strongly connected as any vertex can go through $h$ to get to any other vertex. Thus, if there is a hub, every vertex is a hub. To determine this:

Run a traversal starting with $h$
Return false if any vertex is undiscovered
Construct the transpose of the graph
Run a traversal on the transpose starting at $h$
Return false if any vertex is undiscovered

Thus, time complexity is the same as graph traversal: $O(n+m)$

Topological Ordering (DAGs only)

Ordering of vertices such that if $(u,v)$ exists, $u$ is ordered before $v$ .

All DAGs have at least one such ordering.

Programs such as make will use it to find a valid order to compile rules; package installers will use it to find the correct installer to install dependencies.

Initialize empty stack
For each vertex:
- If undiscovered, start modified DFS: when vertex is processed, push it to the stack
  - The vertex at the end will be pushed to the stack first
  - This ordering will continue until it gets to the start vertex
  - When the DFS is run again with another start vertex, if it is ‘before’ the previous start vertex, vertices between the two will get added to the stack
Return the stack from top to bottom

Cycle Detection

Directed graphs: run DFS: if a vertex that has already been discovered is reached, there is a cycle. Run this with each possible starting vertex, or until a loop has been discovered.

Run DFS: if, while on vertex $u$ , if $u$ goes to $v$ and $v$ has already been discovered, there is a loop. If the graph is undirected, ignore the vertex $(u,parent[u])$ .

If the graph has multiple components, the DFS must be run multiple times, using a similar algorithm to when finding connected components.

Weighted Algorithms

Minimum Spanning Tree

Tree that includes all vertices of the graph and has the lowest total weight among all spanning trees. The graph must be undirected.

Prim’s Algorithm

Create a list: the parent of each vertex
Create a list: if a vertex is in the tree
Create an list containing the closest distance of a vertex to the tree, each with a starting value of infinity
While there are vertices remaining:
- Find the vertex v that is closest to the tree
  - For the first loop, pick vertex 0 (or anything, really)
- Add v to the list of vertices in the tree
- For all vertices still not in the tree:
  - If the distance between the vertex and v is smaller than the distance stored in the list
    - Update the distance, and set the parent of the vertex to v
Return the parent and distance

The algorithm is $O(n^2)$ , but if a heap-based priority queue used for the distance array, the time complexity of finding the next vertex is $O(\log{n})$ . Hence, the time complexity of the overall algorithm to $O(m+n\log{n}).$

def prim(adj):
# Prim's algorithm: minimum spanning tree
  in_tree  = [False]        * len(adj)
  distance = [float('inf')] * len(adj) # Stores distance from tree
  parent   = [None]         * len(adj)
  distance[0] = 0

  while False in in_tree:
    u = next_vertex(in_tree, distance)
    # Pick closest vertex not in tree. If distance is min heap with edit 
    # support, then function is log(n) and algorithm is nlog(n)
    in_tree[u] = True
    for (vertex, weight) in adj[u]:
      if not in_tree[vertex] and weight < distance[vertex]:
        distance[vertex] = weight
        parent  [vertex] = u
  
  return (parent, distance)

Shortest Path Trees

The shortest path tree rooted at vertex $s$ is the spanning tree such that distance from root to any other vertex is the smallest possible.

Dijkstra’s Algorithm

NB: this will not always produce correct results on graphs with negative weights.

def dijkstra(adj, s):
  # adj = adjacency list
  # s = index for start vertex
  in_tree = [False]         * len(adj)
  distance = [float('inf')] * len(adj)
  parent = [False]          * len(adj)
  distance[s] = 0
  while False in in_tree:
    u = next_vertex(in_tree, distance)

    # Pick the closest vertex not yet in the tree. Efficiency can improve
    # if priority tree (min heap) is used
    in_tree[u] = True
    for (vertex, weight) in adj[u]:
      if not in_tree[vertex] and distance[u] + weight < distance[vertex]:
        # If the distance to the vertex using `u` is closer than it was
        # when `u` was not in the tree, update the tree
        distance[vertex] = distance[u] + weight
        parent  [vertex] = u

  return (parent, distance)

All-Pairs Shortest Path

Dijkstra gives the shortest path from a given source vertex to every vertex in graph.

All-pairs gives the shortest path between every pair of vertices in graph. This can be done running Dijkstra for each vertex: $O(n^3)$ . However, it is greedy so will not work with graphs with negative edges.

Floyd-Warshall Algorithm

DP algorithm to all-pairs. Solutions are limited to graphs without negative cycles: the sum of weights along a cycle is not negative.

Intermediate vertices: vertices between source and destination vertices. $p = (v_s,v_1,v_2, \dots,v_l,v_d)$ . Path has $l$ intermediate vertices.

Properties of Shortest Paths

Simple: no vertex repeated along the path (assuming the condition that there are no negative cycles).

Any sub-path of a shortest path is also a shortest path. If this was false, that sub-path could be used to make a shorter shortest path.

The Algorithm

$n$ vertices, $0\dots n-1$ , with $d(i,j,k)$ being the shortest distance from $i$ to $j$ if intermediate vertices are all in $0 \dots k$ .

Use recursion on $d(i,j,k)$ , using results from $d(i,j,k-1)$ :

If $k$ is an intermediate vertex on the shortest path:
- It is equivalent to joining the shortest paths from $i$ to $k$ , and from $k$ to $j$
That is, adding the lengths of $d(i,k,k-1)$ and $(k,j,k-1)$
$k$ is not an intermediate vertex: the same as $d(i,j,k-1)$

There are three base cases:

$d(i,j,k) = 0$
- When $k = -1$ and $i = j$
$d(i,j,k) = \infty$
- When $k = -1$ and $(i,j) \notin E$
- No intermediate edges, and no edges connecting $i$ and $j$
$d(i,j,k) = w((i,j))$
- No intermediate edges, and a edge connecting $i$ and $j$ : set the distance to the weight of that edge

Top-Down

All three parameters can take on $n$ values, so the size of the cache is $O(n^3)$ , and each individual call runs in constant time. Thus, the algorithm is $O(n^3)$ .

Bottom-up

Uses table of size $n\times n$ , with the columns and rows being $i$ and $j$ respectively. Initially, the table contains the weights of the edge between $i$ and $j$ , infinity if it doesn’t exist, or $0$ if $i = j$ .

Thus, the space complexity is $\Theta(n^2)$ . This works as the value of $d(i,j,k)$ is dependent only on the values of $d(i,j,k-1$ so only the $k-1$ values are needed, and the matrix can be incrementally updated.

Run the algorithm with $k=0$ , and once all cells are filled in, move on to $k=1, 2, \dots$ .

Backtracking

Technique to generate all or some solutions to a problem incrementally. Generate one possible solution, and if it isn’t correct, then use it to generate children, and repeat.

def dfs_backtrack(candidate, input, output):
  if is_solution(candidate, input):
    add_to_output(candidate, output)
  else:
    for child in children(candidate, input):
      dfs_backtrack(child, input, output)

candidate: vertex of the tree
input: information about instance of the problem i.e. desired output (e.g. size of the solution)
output: list of valid solutions
is_solution: returns true if it is a leaf node: valid solutions won’t have valid children, so there is no need to go further down the tree
children: returns list of possible candidates

In some cases such as search, only one solution may be required.

Pruning

In some cases, it may be known that it is impossible for any children (and their children etc.) of a node to produce a solution, meaning making further calls is unnecessary.

Check if pruning is required at the start of the DFS/BFS backtrack function.

7. Computational Geometry I

Introduction

Be aware of:

Floating point errors
- Nearly-parallel lines
- Very thin/long rectangles
- Points very close together
Special cases
- Coincident points
- Intersecting lines
Degenerate cases
- Zero area triangles

This course simplifies this by using integer cartesian grids and avoiding division, square roots, trig etc.

Sidenote - basic matplotlib example:

import matplotlib.pyplot as plt
vertices = [(0,0), (1,2), (2,4)]
vx, vy = zip(*vertices) # Unzip the array into x and y components
axes = plt.axes()
axes.plot(vx, vy, color="blue", marker="o")
plt.show()

Points, Vectors & Lines

Definitions

Point:

Position in space
$(x, y)$ tuple
Uppercase, non-bold, non-italic letters

Vector:

Difference between two points
$(dx, dy)$ tuple
The vector from $A$ to $B$ is $v = (b_x-a_x, b_y-a_y )$
Lowercase, bold, italic letters
Position vector
- Vector from origin to some given point

Basic Operations

Points:

Subtract to get displacement vector
Linear/affine sum:
- $P(\alpha) = (1-\alpha)A + \alpha B$
- Indivisible, two operation operand
- Points on the line between A and B
- $P(0) = A$
- $P(1) = B$
The linear sum is the equation of the line: use this instead of $y = mx + c$ as it does not work with vertical lines

Vectors:

Addition, subtractions
Multiplication by scalar
Addition to a point to get another point
Magnitude: $|v|^2=\sqrt{v_x^2+v_y^2}$
Dot product
Signed triangle area

Vector Operations

Dot Product:

$p \cdot q = p_x q_x + p_y q_y=|p\|q|\cos{\theta}$
$p \cdot q > 0$ when the angle is acute: angle in $(0,90)$ or $(270,360)$
$p \cdot q = 0$ when they are perpendicular

Signed Area/Turn Direction

The area of parallelogram is:

p_x q_y-p_y q_x=|p\|q|\sin{\theta}

This is equivalent to the magnitude of the cross product where $p$ , $q$ are two edge vectors for vectors $B$ and $C$ respectively, relative to the point $A$ .

The area of triangle is half of this.

The area is positive if $b$ is ‘counterclockwise’ to $c$ , centered around $a$ and zero when the points are collinear (parallel).

Checking if $P$ is on the line $(A,B)$ :

Check if signed area 0
To check if it is on the line segment (i.e. Between $A$ and $B$ ):
- $|P,A| \le |A,B|$ and $|P,B|<|A,B|$
- Where $|X,Y|$ is the distance between the points $X$ and $Y$

Turn Direction:

Just check the sign of the signed area (unless the area is zero)

Line Segment Intersection:

Two line segments $AB$ and $CD$ intersect iff $C$ and $D$ are on opposite sides of the line $AB$ and vice-versa
True if isCCW(A,B,D) != isCCW(A,B,C) && isCCW(C,D,A) != isCCW(C,D,B) (where the first argument is the origin)

Degenerate cases:

One point on the line
Sharing a point
All four points collinear

Simple Polygons

By convention, vertices are in counter-clockwise order.

Convex Polygons:

Every interior angle less than 180 degrees
- Strictly less than for strictly convex polygon
Every counter-clockwise traversal turns left at every vertex (or goes straight)

Point Inclusion Test

Convex Polygon

For a convex polygon, a point is in a polygon (PiP) if it is on the left side of every edge for an anticlockwise traversal of the polygon:

all([isCCW(vertex[i], vertex[(i + 1) % len(vertex)], P) for i in range(len(vertex))])

Simple Polygon

Create a semi-infinite horizontal line starting from $P$ going either left or right
The point is inside the polygon iff it intersects the polygon an odd number of times
The line only needs to extend to past the maximum $x$ value for the polygon’s edges

Special cases:

The line goes through a vertex
- On the ‘inside’: count 0 or 2 times
- Otherwise: count 1 time
Goes along an edge

The correct way to solve this is to:

Ignore horizontal edges
If edge directed upwards, count the start vertex as an intersection
If edge directed down, count the end vertex as an intersection

The hacky solution: all vertices are on an integer grid, so simply move $P$ up by a very small amount (e.g. $10^{-8}$ ).

Improvements:

Find the rectangle that encompasses the polygon
- If its not in the rectangle, then it can’t be inside

Convex Hull Problem

For a finite set of points, the convex hull is the smallest convex polygon enclosing all points. The convex hull:

Is unique
Contains every point in the set
has $(x_{min}, ?)$ , $(x_{max}, ?)$ , $(?, y_{min})$ and $(?, y_{max})$ as vertices of the convex hull

Naïve solution:

Construct all possible edges using pairs of points
For every other point, check if it lies on the left side of the edge. If so, it belongs to the convex hull
$O(n^3)$

Gift Wrap Algorithm

Start from the point with the minimum $y$ coordinate (or max $y$ , or min/max $x$ )
- Choose the left-most one if multiple points have the same minimum
Select the ‘rightmost’ point from the perspective of the current point $A$
- Pick any point $B$
- For all points $C$
  - Check if $C$ is to the right of $AB$
  - If so, $B = C$
- The current $B$ is part of the convex hull. Repeat until a closed loop is formed
Worst case $O(n^2)$
Normal case $O(mn)$ , where $m$ is the number of vertices in the convex hull

Graham-Scan Algorithm

An $O(n\log{n})$ algorithm. First construct a simple closed path that includes all points:

Point $0$ will have the minimum $y$ value, and so is guaranteed to be on the convex hull
Create a line centered around point $0$ and pointing right
Rotate the line anticlockwise. When a vertex is found, add them to the list - we are essentially sorting the vertices by angle
Connect the vertices in the sorted list to create the simple closed path
Not all vertices will be on the convex hull, so the concavities need to be filled in

Sorting without calculating angles

We don’t need to know the angle: only the ordering. This can be done using isCCW, Python’s built in sort function (which will be $n\log{n}$ ), and from functools import cmp\_to\_key or by defining a SortKey class with a custom __lt__ operator and converting the point to that class in the sort function’s key lambda.

Consider edge cases:

$P_1$ or $P_2$ are the anchor
Several points colinear with $P_0$ (i.e. on a straight line)
- Don’t worry about them

Concavities:

$P_0$ and $P_1$ always on the convex hull
$P_2$ may not be

Initialize a stack with $[P_0, P_1, P_2]$ :

Check if the next point, $n$ , is on the left edge of the line connecting $P_{n-2}$ and $P_{n-1}$
If so, add the point to the stack
Otherwise, pop from the stack until this is true

Complexity:

Step 1: find the point with the minimum $y$ Y value: $O(n)$
Step 2: sort operation: $O(n\log{n})$
Step 3: initialize hull
Step 4: generate the convex hull
- Not $O(n^2)$ as although the nested while loop may run several times, it can only remove previously added elements, so you can only ever pop $n$ elements
- Hence, it is $2n \in O(n)$

8. Computational Geometry II

Range Searching

Introduction

There are a large number of geometric objects. How do you find all objects intersecting/within a given area/volume?

We assume:

Objects are points
Range is rectangular
There are many queries on the same point set
- Hence, we can pre-process the data

1D range searching

Determine which points lie in $[r_{min}, r_{max}]$ :

matches = [x for x in nums if rmin <= x <= rmax]

To improve performance, sort the items: $O(n\log{n})$ . Each query will use binary search to find the smallest point in the range, and then iterate through the list until the item is out of the range. Hence, the cost of each query is $O(\log{n} + m)$ , where $m$ is the number of points in the range.

However, this approach does not generalize to multiple dimensions.

1D Binary Search Tree (1D k-d tree)

A balanced binary search tree with numbers at the leaves
Internal nodes are medians - (almost) len(data)//2 - 1
Left subtree contains $\text{numbers} \le \text{median}$
Right subtree contains $\text{median} < \text{numbers}$
Assume no duplicates

To create the search tree, sort the elements; the left subtree should contain the lower half of the list, plus element with index len(list)//2 - 1. Repeat the process with the subtrees until you reach a leaf node.

from collections import namedtuple
Node = namedtuple("Node", ["value", "left", "right" # Cheap way to make a class

def search_tree(nums, sorted=False):
  if not sorted:
    nums = sorted(nums)

  if len(nums) == 1:
    tree = Node(nums[0], None, None) # Leaf node
    else:
      midpoint = len(nums) // 2 # Halfway, rounding down
      left = search_tree(nums[:midpoint], True)
      # Up to but not including midpoint
      right = search_tree(nums[midpoint:], True)
      # Including the midpoint and the end
      tree = Node(nums[midpoint - 1], left, right)
      # Left child will be equal to parent, which meets the
      # less than or equal to criteria

    return tree

Search

def range_query(tree, low, high):
  # Inclusive range
  matches = []
  if tree.left is None and tree.right is None:
    if low <= tree.value <= high:
      matches.append(tree.value)
  else:
    if low <= node.value and tree.left is not None:
      matches += query(tree.left, low, high)
    if node.value < high and tree.right is not None:
      # Range may include both children, so use if and not else
      matches += range_query(tree.right, low, high)
  return matches

The algorithm visits at least one leaf node, so the minimum cost is $O(\log{n})$ .

If it visits $m$ leaves, then it also traversed $m/2$ parents, $m/4$ , grandparents, $\dots$ , 1 root node.

$\frac{m}{2} + \frac{m}{4} + \frac{m}{8} + \dots \approx 2m$ , so the order of complexity is $O(\log{n+m})$ .

k-d trees

Spatial Subdivision

Regular grid: regular divisions.

Quad tree: regular grid, but each cell contains a sufficiently small number of cells; if it has more, recursively subdivide it into quarters.

k-d tree: divide cell into halves along the median (each half contains the same number of cells) on one axis. For each half, divide it again into equal halves again, but this time on the other axis. Rinse and repeat until there a sufficiently small number of cells in each cell or a maximum depth is reached.

Differences:

The leaf nodes can contain multiple points
Nodes labelled with the axis it is subdividing on
- Subscript is a list showing path from root to node (0 for left, 1 for right)
  - e.g. 01 means root/left/right
Only worth rebuilding the entire tree if there are significant additions/removals

def kd_tree(points, depth = 0):
  if len(points) <= threshold or depth >= max_depth:
    return Leaf(points)
  else:
    axis = depth % num_axes
    points.sort(key=lambda point: point[axis])
    mid_index = len(points) // 2
    midpoint = points[mid_index - 1][axis]
    # Subtract 1 since left <= midpoint, left does not include midpoint
    left =  kd_tree(points[:mid_index], depth + 1)
    right = kd_tree(points[mid_index:], depth + 1)
    return Node(axis, midpoint, left, right)

k-d Range Search:

def points_in_range(node, query_shape):
  if node is leaf:
    matches = points in node.points within shape
  else:
    matches = []
    if intersection of shape and lower half space not empty:
      matches += points_in_range(node.left, query_shape)
    if intersection of space and upper half space not empty:
      matches += points_in_range(node.right, query_shape)
  
  return matches

Unlike in the one dimensional case, multiple points can have the same value for a given dimension as when the split occurs on another dimension, it could end up on either side. Hence, both the lower and upper half of the shape can include points on the line.

Closest Neighbor

For a given point (that does not need to be in the tree), find the closest point within the tree.

Use the k-d tree to find the cell the point would be in - it is guaranteed to contain at least one point
Find the closest point in the cell, and take that distance as the upper bound on the minimum distance
Search for cells that intersect with the square (notes)/circle (optimal) centred around the original point
- Check if any of the points inside the cells are closer

Quadtrees

Split each region into 4 sub-regions, partitioned evenly on both axes
Repeat until the maximum depth is reached, or the region contains fewer than some specified number off points
Node contains coordinates of centre, width/height of the square, and their children
Node labels: path from root e.g. 0/1/1 would be root/quadrant 1/quadrant 1

This only works in 2D. Performance:

Best case: equally distributed, $T(n) = n + T(\frac{n}{4}) = n\log{n}$
Worst case: all points in the same cell at depth $d$ : $O(nd)$

Line Sweep

Simulates a vertical line sweeping across objects on $xy$ plane on one direction
Sort points by the $x$ coordinate
At any point, solution to problem for all points to the left is known
When point added, check if solution changes

Data structures:

Event queue
- Positions of event points
- Visited points deleted from queue
Status structure
- State of algorithm at each position
Solution set
- Solution at current position

Closest Point-Pair

$n$ points on a plane: find the points $P$ and $Q$ such that they are the minimum distance from each other.

In the 1D case, simply sort then search through the list sequentially: $O(n\log{n})$ .

2D Case

Idea:

Store current closest distance $d$ , current point $P$
Frontier set: set of points where the $x$ coordinate is in the range $(P_x - d, P_x]$
- Kind-of-but-not-really points where the $y$ coordinate is also in the range $(P_y - d, P+y + d)$
Only need to the check distance $P$ is to for points in the frontier set - any point further away cannot be the closest point

Data structure:

Point list
- Points sorted by x coordinate
State
- Current sweep position - index in point list
  - Initially $P_2$
- Current best point pair and distance (NOT $x$ distance)
  - Initially $(P_0, P_1)$
- Frontier set, $F$
  - Includes current point
  - Initially $\{ P_0, P_1, P_2 \}$

At each step:

Get the current point, add it to the frontier set
Remove elements from $F$ where the horizontal distance from the current point is greater than $d$
- DO NOT remove elements further away than $d$ vertically
If they are closer than $d$ vertically, calculate the distance and update $d$ if necessary

The frontier set must be sorted by $y$ to make it efficient and have efficient addition and removal. Hence, a self-balancing binary tree or SortedContainer could be used.

Complexity:

Sorting by $x$ : $O(n\log{n})$
Each point added and removed from frontier set: $O(n\log{n})$ total
Search for closest pair takes total $O(n\log{n})$ time ○ $O(\log{n})$ to select the range of points $d$ from the point in the $y$ axis
Hence, the algorithm is $O(n\log{n})$ in terms of time