All Files in ‘COSC262 (2020-S1)’ Merged

1. Algorithms

A well-defined procedure:

The procedure must work for all instances of the problem.

Defining Algorithms

Properties

Determining Correctness

Testing can not prove an algorithm is correct (unless you can test all cases): you can only prove it is incorrect. Only formal proofs can prove it. Not covered in the course.

Start with a reasonably precise statement of the problem and algorithm.

Then:

Example: Robot Arm (Skiena 1.1)

Robot needs to place solder into nn points on a PCB. What is the shortest cycle tour that visits each point in the set SS? This is the TSP, so an optimal solution cannot be found. Thus, heuristics need to be used.

Heuristics: greedy algorithm?

Example: Selecting Jobs (Skiena, 1.2)

Multiple movies that need to be filmed, need to maximize number of movies. Set II of nn intervals on the line.

Heuristics: start with the shortest movie? Start with the first available movie?

Look for counter examples that show that the heuristic is incorrect:

Proof by induction

def sum(n):
    s = 0
    for i in range(1, n + 1):
        s += i
    return s

Assume that the loop works up to kk. Thus, the loop must work up to when calculating the sum for k+1k+1.

Insertion sort

After nn iterations of the outer for loop, the sub array, [0,n)[0, n) is sorted. Now, you only need to prove that it is able to sort the nnth element correctly.

2. Algorithm Analysis

Once an algorithm is shown to be correct, consider the:

Complexity

The time taken will depend on the instance of the problem being solved.

RAM Model of Computation

Abstract random access machine

Formal Definitions

Big O (upper bound)

O(G)=fc>0:x0:xx0:0f(x)cg(x) O(G) = {f|\exists c>0: \exists x_0: \forall x\ge x_0: 0 \le f(x) \le c \cdot g(x)}

That is, f(x)f(x) is part of the set O(g)O(g) if there exists a positive multiplier cc such that for every value of xx greater than x0x_0, f(x)cg(x)f(x) \le c\cdot g(x).

Big Omega (lower bound)

Ω(G)=fc>0:x0:xx0:0c cdotg(x)f(x) \Omega(G) = {f \exists c>0: \exists x_0: \forall x\ge x_0: 0 \le c\ cdot g(x) \le f(x)}

That is, f(x)f(x) is part of the set Ω(g)\Omega(g) if there exists a positive multiplier cc such that for every value of xx greater than x0x_0, cg(x)f(x)c\cdot g(x) \le f(x).

Big Theta (tight bound)

Ω(g)=O(g)Ω(g) \Omega(g) = O(g) \cap \Omega(g)

That is, f(x)f(x) is part of the set Θ(g)\Theta(g) if it is such that for every value of xx above x0x_0, there exists c1c_1 and c2c_2 such that that c1g(x)f(x)c2g(x)c_1 \cdot g(x) \le f(x) \le c_2 \cdot g(x).

The worst case and best case are functions: the algorithm is not, so we cannot define the algorithm by Big O/Omega/Theta notation.

The best case of an algorithm will be Θ(g)\Theta(g); the algorithm is not Θ(g)\Theta(g). Ideally, an algorithm would be separately categorized by its best and worst case. In practice, we often categorize the algorithm by its worst case.

Common functions, sorted by complexity descending:

Functions
x!x!
axa^x
xax^a
xlog(x)x \log(x)
xx
log(x)\log(x)
aa

3. Recursion

The Process

Example: Merge Sort

Merge sort from left to middle, merge sort from middle to right and then merge the results:

The base case is when the length of the array is 0 or 1.

Analysis

For T(n)T(n), the cost will be either the base case of Θ(1)\Theta(1), or the sum of:

{Θ(1)n=12T(n2)+Θ(n)n>1 \begin{cases} \Theta(1) & n = 1 \\ 2T(\frac{n}{2}) + \Theta(n) & n > 1 \end{cases}

At each level, the cost will be:

Θ(n)=2Θ(n2)=4Θ(n4)=Θ(n) \begin{aligned} \Theta(n) &= 2\Theta(\frac{n}{2}) \\ &= 4\Theta(\frac{n}{4}) \\ &\dots \\ &= \Theta(n) \end{aligned}

There are log2n\log_2{n} levels, so the total cost is nlog2nn\log_2n.

4. Greedy Algorithms

Greedy algorithms:

Examples

Minimum Spanning Tree

These can be proved using proof by counter-example.

Coin Changing Problem

Coins of denominations C={c1,,cn}C = \{c_1, \dots, c_n\}. Find the minimum number of coins such that the total value is VV. Assume:

Greedy strategy:

Counter example: C={1,10,25}C=\{1, 10, 25\}, V=30V=30. The optimal solution is 3103 \cdot 10, but the greed algorithm gives 125+511\cdot 25 + 5\cdot 1.

What is a sufficient property such that the greedy algorithm will always be optimal? Every coin is a multiple of the one below it.

Example: Interval Scheduling

There is a set of jobs with a start (sjs_j) and end (fjf_j) times. Find the subset that contains the most jobs.

You could:

To prove a greedy algorithm right, assume its wrong and find contradiction. To prove it wrong, simply find a counter-example.

Fractional Knapsack Problem

There is a set SS of nn items. Each item has a positive value bib_i and weight wiw_i. Find the subset which yields the maximum value without exceeding given weight WW.

You can select fraction of item xix_i; the weight and values are scaled proportionally.

To solve this with a greedy algorithm, sort it by the benefit/weight ratio (bi/wib_i/w_i), filling up the knapsack with the item with the greatest ratio that is still available until target weight is reached.

0/1 Knapsack Problem

Can only take the whole item or not take it at all. Requires dynamic programming.

Huffman Coding

Variable vs Fixed length

Fixed length codes are convenient for indexing, counting (e.g. ASCII: 7-bit binary string per character, one character per byte).

Variable length codes may be more efficient (e.g. Morse code: frequent letters have short encodings; UTF-8: most common characters 1 byte long).

Variable-Length Coding

Let a=0a=0, b=1b=1, c=01c=01 etc.

Problem: there is no spacing, so it is ambiguous where a character starts and ends.

Solution: ensure no binary string in the code is a prefix of any other binary string.

For aa to ff, one possible solution is a=000a=000, b=0010b=0010, c=0011c=0011, d=01d=01, e=10e=10, f=11f=11.

This can be represented as a tree with two children per node and characters being found in leaf nodes:

            .
           /  \
        0 /    \ 1
         .       .
        /\       /\
     0 /  \ 1 0 /  \ 1
      .    d   e    f
     /\
  0 /  \ 1
   a    .
        /\
     0 /  \ 1
      b    c

Encoding

The most frequent characters should have the shortest encoding. However, this approach will fail.

Instead, the key is that the two least frequent characters should be leaves at the bottom of the tree. Hence, to create a Huffman tree:

Popping and insertion into a min-heap are O(logn)O(\log{n}) operations and nn iterations are required. Hence, the time complexity for encoding a Huffman tree is O(nlogn)O(n \log{n})

Disadvantages

The encoding table must be sent along with the encoded document. Hence, the overhead may be large for small documents.

Huffman encoding is optimal for global character frequencies. However:

Lempel-Ziv usually gives better compression with substring recognition. The deflate algorithm combines the LZ and Huffman algorithms.

5. Dynamic Programming

Memoisation

Fibonacci can be defined as a recurrence relation: f(n)=f(n1)+f(n2)f(n) = f(n - 1) + f(n - 2). However, this approach is inefficient due to lots of duplicate calls: when expanded, it becomes f(n)=f(1)+f(1)+f(1)+f(n) = f(1)+ f(1) + f(1) + \dots.

One solution to this is memoisation: the caching of results.

General Purpose Memoiser

def memoise(f):
  known = {}
  def new_func(*params):
    if param not in known:
      known[param] = f(*params)
    return known[params]
  return new_func

@memoise
def function_definition():

# Or use the built in tools:
from func_tools import lru_cache
# LRU: least recently used; gets rid of oldest result once number of results stored reaches `maxsize`

@lru_cache(maxsize = None)
def function_definition():
  // ...

Notes:

Fibonacci, Iteratively

Bottom-up iteration:

def fibonacci_i(n):
  a, b = 0, 1
  for i in range(n):
    a, b = b, a + b
  return a

This approach is O(n)O(n) in time and Θ(1)\Theta(1) in space.

Top-down Dynamic Programming

Recursion and memoisation.

Bottom-up Dynamic Programming

Example: Minimum Cost Path

Rectangular grid of cells, each with a weight. You can move to any adjacent cell (including diagonals) from each cell. Find the minimum cost needed to get from top to bottom.

Recurrence formula

Ci,jC_{i, j} is optimal cost of getting to the cell (i,j)(i,j) where i[0,I)i\in [0, I), j[0,J)j\in[0, J):

Ci,j={cell outside of the gridWi,ji=0 (cell is at the the top of the grid)Wi,j+mink(j1,j,j+1)(Ci1,k)otherwise C_{i,j} = \begin{cases} \infty &\text{cell outside of the grid}\\ W_{i,j} &i = 0 \text{ (cell is at the the top of the grid)} \\ W_{i,j} + \min_{k\in(j-1, j, j+1)}(C_{i-1, k}) &\text{otherwise} \end{cases}

Hence in the normal case, the cost of the cell is the cell weight plus minimum cost to get to cell above it (left, directly above, or right).

Using a top-down approach:

Using a bottom-up approach::

If the grid is square, it is O(n2)O(n^2) in time and space.

This problem can also be found using Dijkstra’s shortest path by converting the problem to a multi-stage graph (with weights moved to edges). However, this is harder to implement and less efficient.

Suitably of DP

DP approaches are likely to work when the problem:

Then build the solution from solutions to sub-problems:

Optimal Substructure

When solutions to subproblems can be reused (multiple times!) to find the optimal solution to the problem.

For example, in the shortest path problem, the shortest path for a sub-problem will be inside the shortest path for the larger problem.

The longest path problem does not have an optimal substructure.

Properties of Recursive Solutions

Coin Changing Problem

A greedy approach does not produce optimal results. Hence, DP must be used.

Top-Down

Given coinage = [1, 10, 25] and amount = 80, the solution will always be minimum of:

Either amount or coinage decreases to decrease the size of the problem.

An alternative approach would be finding the minimum of:

That is 1 + min(coinage_calculate(amount - coin, coinage) for coin in coinage)

Without memoisation, there are a maximum of len(coinage) sub-calls per call. With memoisation, the graph collapses to amount + 1 calls:

0/1 Knapsack

No fractional amounts. For each item you have two options:

Repeat for each item, picking the option that gives the greater worth.

Edge cases:

Cache using index of item (ii) and weight of knapsack (CC): $$ V(i, c) = \begin{cases} 0 &i < 0 \ V(i-1, C) &i\ge 0, w_i > C \ \max(V(i-1, c), v_i + V(i-1, C-w_i)) &i\ge0, w_i \le c \end{cases} $$ This is easier when using 1 indexing: length and index are the same.

Top-Down

The number of possible calls is nCnC, where nn is the number of items, and each function call will call two other calls until it reaches the base case.

If using LRU cache, run the computation in in a nested function using LRU cache, and store the array of items outside the scope of the function.

Bottom-Up

Generate table of knapsack weight (column) and index (row), filling table row-wise. Go backwards to reconstruct solution.

1 indexing is also useful in this case, in which an extra row and column (0) will be needed.

For each cell, you have two options:

Take the maximum of these two options.

This has a time and space complexity of O(nW)O(nW). If only the maximum value and not the path is required, only two rows are needed. Hence, the space complexity gets reduced to O(W)O(W).

NB: this is pseudo-polynomial time - it is polynomial with the size of the knapsack, not the number of items.

Assuming the path is required, use backtracking to reconstruct the solution: start at the end, and If the value of the item directly above is the same, it means the optimal solution came from not picking up the item. Else, go up one and left by the weight of the item. This has a complexity of O(n)O(n).

Subset Sum Problem

Given a set of integers, find a subset that sums to zero.

Generalize this to finding a subset that sums to some value SS.

def subset_sum_exists(numbers, n, required_sum)
# returns true iff subset within first n elements that sum to required_sum

Longest Common Subsequence

Given two strings AA of length nn and BB of length mm, find the length of longest subsequence common to both strings. Characters do not need to be contiguous.

Brute forcing this is not possible - there are 2n12^n-1 subsequences of AA, and determining if it is a subsequence of BB would take O(mO(m) time. Hence, the time complexity of brute forcing a solution is O(m2n)O(m\cdot 2^n).

Create a function, Li,jL_{i,j}, which returns the longest common subsequence given substrings of AA and BB: A=a1a2aiA=a_1a_2\dots a_i and B=b1b2bjB=b_1b_2\dots b_j. $$ L_{i,j} = \begin{cases} 0 & i = 0 \text{ or } j = 0 \ L_{i-1, j-1} + 1 & a_i = b_j \ \max(L_{i, j-1}, L_{i-1, j}) & a_i \neq b_j \end{cases} $$

Bottom-Up

Create a table of 1-origin subscripts for all possible values of Li,jL_{i, j}, initializing L0,j=0L_{0, j} = 0. Then, built up the table row-by-row:

Edit Distance

The diff algorithm. Transform source string A=a1anA = a_1\dots a_n into B=b1bmB = b_1\dots b_m using the minimum number of insertions, deletions, and substitutions. This number is called the edit distance.

When a transformation is not needed, it is called the alignment operation, and has zero cost.

Work from end of string to start.

Longest Increasing Subsequence

One solution is to sort, remove duplicates, and then run LCS on the two sequences. However, a more efficient solution is possible.

Let LjL_j be the length of the longest increasing substring up to and including position sjs_j in the string ssns\dots s_n.

Recurrence relation: find the maximum of LIS for all solutions before the current index, provided the last (and so biggest) element is smaller than the current element.

Lj={ii=1,sjsi  0<j<i1+max0<j<i,  sj<siLjotherwise L_j = \begin{cases} i & i=1, s_j \ge s_i \; \forall 0 < j < i \\ 1 + \max_{0<j<i, \; s_j < s_i}{L_j} & \text{otherwise} \end{cases}

This can be further optimized into an O(nlogn)O(n\log{n}) solution: optional sequences are in increasing order, so a binary search can be done instead of searching all elements before it.

6. Graph Theory

Definitions

Graph: G(V,E)G(V,E), where VV is a set of vertices, EE is a set of edges, each relating two vertices together. $$ n=|V| \ m=|E| \ m \le n^2 $$

Weighted graphs: edges have a weight ω\omega.

Path: a sequence of vertices where:

Length of path: number of edges.

Cycle: path where the source and destination vertex is the same:

Empty graph: ({},{})(\{\}, \{\}).

Subgraphs

G=(V,E)G'=(V',E') where VV,EEV' \subseteq V, E' \subseteq E

Component: a non-empty subgraph where every pair of vertices connected by a path. If there is a path between two vertices, they are in the same component.

Tree: graph with a root vertex such that there is only one path between the vertex and root.

Directed tree: tree where all edges point either towards or away from the root.

Forest: graph where every component is a tree

Textual Representation of Graphs

Header:

All other lines after the header define a edge: V1 V2 or V1 V2 Weight

Data Structures

Adjacency Matrix:

Adjacency List:

Adjacency matrices are faster (Θ(1)\Theta(1) vs Θ(n)\Theta(n)) for operations such as checking for the existence of an edge, adding, and deleting edges, but uses Θ(n2)\Theta(n^2) space compared with Θ(n)\Theta(n) for adjacency lists.

Forests:

Graph Traversal

Each vertex can be in one of three states:

Predecessor Tree:

BFS

Two procedures: BFS-Tree initializes the data structure and calls BFS-Loop:

def BFSTree(adj, index):
  # Where:
  # - adj is the adjacency matrix
  # - index is the index of the root
  n = len(vertices)
  state = [UNDISCOVERED * n]
  state[index] = DISCOVERED

  parent = [NULL * n]
  Q = Queue()
  Q.add(index)
  return BFSLoop(adj, Q, state, parent)

def BFSLoop(adj, Q, state, parent):
   while Q is not empty:
    u = Q.remove()
    for v in adj[u]:
      # For each vertex u is connected to
      if state[v] == UNDISCOVERED:
        state[v] = DISCOVERED
        parent[v] = u
        Q.add(v)
    state[u] = PROCESSED
  return parent

Analysis

Size of GG: V=n|V|=n, E=m|E|=m, where mn2m \le n^2.

So complexity is O(n+m)=O(max(n,m))O(n+m)=O(\max(n,m))

DFS

def DFSLoop(adj, index, state, parent):
  for v in adj[index]:

    # For each vertex the given index is connected to
    if state[v] == UNDISCOVERED:
      state[v] = DISCOVERED
      parent[v] = index
      DFSLoop(adj, v, state, parent)
    state[index] = PROCESSED
  return parent

Properties of DFS and BFS

Shortest Path

Use BFS to produce parent/predecessor tree, working backwards from the end until you get to the start.

def treePath(parentTree, start, end):
    if start == end:
        return [start]
    path = treePath(parentTree, start, parentTree[end])
    path.append(end)
    return path

Connected Components

When BFS or DFS is run, all vertices reachable from the start node are processed: that is, it finds a subgraph.

Thus, to find all components, run BFS or DFS on each vertex that has not yet been discovered. To make sorting the vertices into components easier, there can be an array that stores the component number; on what run of the DFS/BFS the vertex was first discovered.

Strong Connectivity

A directed graph is strongly connected If there is a path between every ordered pair of vertices.

This can be determined inefficiently by running BFS and check if all vertices are processed for all possible starting vertices.

Transpose operation, GTG^T: for all edges, reverse the order:

Hub: vertex hh is a hub if hh is reachable from every vertex AND hh can visit every vertex.

If there is such a vertex, then the graph must be strongly connected as any vertex can go through hh to get to any other vertex. Thus, if there is a hub, every vertex is a hub. To determine this:

  1. Run a traversal starting with hh
  2. Return false if any vertex is undiscovered
  3. Construct the transpose of the graph
  4. Run a traversal on the transpose starting at hh
  5. Return false if any vertex is undiscovered

Thus, time complexity is the same as graph traversal: O(n+m)O(n+m)

Topological Ordering (DAGs only)

Ordering of vertices such that if (u,v)(u,v) exists, uu is ordered before vv.

All DAGs have at least one such ordering.

Programs such as make will use it to find a valid order to compile rules; package installers will use it to find the correct installer to install dependencies.

  1. Initialize empty stack
  2. For each vertex:
    • If undiscovered, start modified DFS: when vertex is processed, push it to the stack
      • The vertex at the end will be pushed to the stack first
      • This ordering will continue until it gets to the start vertex
      • When the DFS is run again with another start vertex, if it is ‘before’ the previous start vertex, vertices between the two will get added to the stack
  3. Return the stack from top to bottom

Cycle Detection

Directed graphs: run DFS: if a vertex that has already been discovered is reached, there is a cycle. Run this with each possible starting vertex, or until a loop has been discovered.

Run DFS: if, while on vertex uu, if uu goes to vv and vv has already been discovered, there is a loop. If the graph is undirected, ignore the vertex (u,parent[u])(u,parent[u]).

If the graph has multiple components, the DFS must be run multiple times, using a similar algorithm to when finding connected components.

Weighted Algorithms

Minimum Spanning Tree

Tree that includes all vertices of the graph and has the lowest total weight among all spanning trees. The graph must be undirected.

Prim’s Algorithm

The algorithm is O(n2)O(n^2), but if a heap-based priority queue used for the distance array, the time complexity of finding the next vertex is O(logn)O(\log{n}). Hence, the time complexity of the overall algorithm to O(m+nlogn).O(m+n\log{n}).

def prim(adj):
# Prim's algorithm: minimum spanning tree
  in_tree  = [False]        * len(adj)
  distance = [float('inf')] * len(adj) # Stores distance from tree
  parent   = [None]         * len(adj)
  distance[0] = 0

  while False in in_tree:
    u = next_vertex(in_tree, distance)
    # Pick closest vertex not in tree. If distance is min heap with edit 
    # support, then function is log(n) and algorithm is nlog(n)
    in_tree[u] = True
    for (vertex, weight) in adj[u]:
      if not in_tree[vertex] and weight < distance[vertex]:
        distance[vertex] = weight
        parent  [vertex] = u
  
  return (parent, distance)

Shortest Path Trees

The shortest path tree rooted at vertex ss is the spanning tree such that distance from root to any other vertex is the smallest possible.

Dijkstra’s Algorithm

NB: this will not always produce correct results on graphs with negative weights.

def dijkstra(adj, s):
  # adj = adjacency list
  # s = index for start vertex
  in_tree = [False]         * len(adj)
  distance = [float('inf')] * len(adj)
  parent = [False]          * len(adj)
  distance[s] = 0
  while False in in_tree:
    u = next_vertex(in_tree, distance)

    # Pick the closest vertex not yet in the tree. Efficiency can improve
    # if priority tree (min heap) is used
    in_tree[u] = True
    for (vertex, weight) in adj[u]:
      if not in_tree[vertex] and distance[u] + weight < distance[vertex]:
        # If the distance to the vertex using `u` is closer than it was
        # when `u` was not in the tree, update the tree
        distance[vertex] = distance[u] + weight
        parent  [vertex] = u

  return (parent, distance)

All-Pairs Shortest Path

Dijkstra gives the shortest path from a given source vertex to every vertex in graph.

All-pairs gives the shortest path between every pair of vertices in graph. This can be done running Dijkstra for each vertex: O(n3)O(n^3). However, it is greedy so will not work with graphs with negative edges.

Floyd-Warshall Algorithm

DP algorithm to all-pairs. Solutions are limited to graphs without negative cycles: the sum of weights along a cycle is not negative.

Intermediate vertices: vertices between source and destination vertices. p=(vs,v1,v2,,vl,vd)p = (v_s,v_1,v_2, \dots,v_l,v_d). Path has ll intermediate vertices.

Properties of Shortest Paths

Simple: no vertex repeated along the path (assuming the condition that there are no negative cycles).

Any sub-path of a shortest path is also a shortest path. If this was false, that sub-path could be used to make a shorter shortest path.

The Algorithm

nn vertices, 0n10\dots n-1, with d(i,j,k)d(i,j,k) being the shortest distance from ii to jj if intermediate vertices are all in 0k0 \dots k.

Use recursion on d(i,j,k)d(i,j,k), using results from d(i,j,k1)d(i,j,k-1):

There are three base cases:

Top-Down

All three parameters can take on nn values, so the size of the cache is O(n3)O(n^3), and each individual call runs in constant time. Thus, the algorithm is O(n3)O(n^3).

Bottom-up

Uses table of size n×nn\times n, with the columns and rows being ii and jj respectively. Initially, the table contains the weights of the edge between ii and jj, infinity if it doesn’t exist, or 00 if i=ji = j.

Thus, the space complexity is Θ(n2)\Theta(n^2). This works as the value of d(i,j,k)d(i,j,k) is dependent only on the values of d(i,j,k1d(i,j,k-1 so only the k1k-1 values are needed, and the matrix can be incrementally updated.

Run the algorithm with k=0k=0, and once all cells are filled in, move on to k=1,2,k=1, 2, \dots.

Backtracking

Technique to generate all or some solutions to a problem incrementally. Generate one possible solution, and if it isn’t correct, then use it to generate children, and repeat.

def dfs_backtrack(candidate, input, output):
  if is_solution(candidate, input):
    add_to_output(candidate, output)
  else:
    for child in children(candidate, input):
      dfs_backtrack(child, input, output)

In some cases such as search, only one solution may be required.

Pruning

In some cases, it may be known that it is impossible for any children (and their children etc.) of a node to produce a solution, meaning making further calls is unnecessary.

Check if pruning is required at the start of the DFS/BFS backtrack function.

7. Computational Geometry I

Introduction

Be aware of:

This course simplifies this by using integer cartesian grids and avoiding division, square roots, trig etc.

Sidenote - basic matplotlib example:

import matplotlib.pyplot as plt
vertices = [(0,0), (1,2), (2,4)]
vx, vy = zip(*vertices) # Unzip the array into x and y components
axes = plt.axes()
axes.plot(vx, vy, color="blue", marker="o")
plt.show()

Points, Vectors & Lines

Definitions

Point:

Vector:

Basic Operations

Points:

Vectors:

Vector Operations

Dot Product:

Signed Area/Turn Direction

The area of parallelogram is:

pxqypyqx=pqsinθ p_x q_y-p_y q_x=|p\|q|\sin{\theta}

This is equivalent to the magnitude of the cross product where pp, qq are two edge vectors for vectors BB and CC respectively, relative to the point AA.

The area of triangle is half of this.

The area is positive if bb is ‘counterclockwise’ to cc, centered around aa and zero when the points are collinear (parallel).

Checking if PP is on the line (A,B)(A,B):

Turn Direction:

Line Segment Intersection:

Degenerate cases:

Simple Polygons

By convention, vertices are in counter-clockwise order.

Convex Polygons:

Point Inclusion Test

Convex Polygon

For a convex polygon, a point is in a polygon (PiP) if it is on the left side of every edge for an anticlockwise traversal of the polygon:

all([isCCW(vertex[i], vertex[(i + 1) % len(vertex)], P) for i in range(len(vertex))])

Simple Polygon

Special cases:

The correct way to solve this is to:

The hacky solution: all vertices are on an integer grid, so simply move PP up by a very small amount (e.g. 10810^{-8}).

Improvements:

Convex Hull Problem

For a finite set of points, the convex hull is the smallest convex polygon enclosing all points. The convex hull:

Naïve solution:

Gift Wrap Algorithm

Graham-Scan Algorithm

An O(nlogn)O(n\log{n}) algorithm. First construct a simple closed path that includes all points:

Sorting without calculating angles

We don’t need to know the angle: only the ordering. This can be done using isCCW, Python’s built in sort function (which will be nlognn\log{n}), and from functools import cmp\_to\_key or by defining a SortKey class with a custom __lt__ operator and converting the point to that class in the sort function’s key lambda.

Consider edge cases:

Concavities:

Initialize a stack with [P0,P1,P2][P_0, P_1, P_2]:

Complexity:

8. Computational Geometry II

Range Searching

Introduction

There are a large number of geometric objects. How do you find all objects intersecting/within a given area/volume?

We assume:

1D range searching

Determine which points lie in [rmin,rmax][r_{min}, r_{max}]:

matches = [x for x in nums if rmin <= x <= rmax]

To improve performance, sort the items: O(nlogn)O(n\log{n}). Each query will use binary search to find the smallest point in the range, and then iterate through the list until the item is out of the range. Hence, the cost of each query is O(logn+m)O(\log{n} + m), where mm is the number of points in the range.

However, this approach does not generalize to multiple dimensions.

1D Binary Search Tree (1D k-d tree)

To create the search tree, sort the elements; the left subtree should contain the lower half of the list, plus element with index len(list)//2 - 1. Repeat the process with the subtrees until you reach a leaf node.

from collections import namedtuple
Node = namedtuple("Node", ["value""left""right" # Cheap way to make a class

def search_tree(nums, sorted=False):
  if not sorted:
    nums = sorted(nums)

  if len(nums) == 1:
    tree = Node(nums[0], NoneNone# Leaf node
    else:
      midpoint = len(nums) // 2 # Halfway, rounding down
      left = search_tree(nums[:midpoint], True)
      # Up to but not including midpoint
      right = search_tree(nums[midpoint:], True)
      # Including the midpoint and the end
      tree = Node(nums[midpoint - 1], left, right)
      # Left child will be equal to parent, which meets the
      # less than or equal to criteria

    return tree
def range_query(tree, low, high):
  # Inclusive range
  matches = []
  if tree.left is None and tree.right is None:
    if low <= tree.value <= high:
      matches.append(tree.value)
  else:
    if low <= node.value and tree.left is not None:
      matches += query(tree.left, low, high)
    if node.value < high and tree.right is not None:
      # Range may include both children, so use if and not else
      matches += range_query(tree.right, low, high)
  return matches

The algorithm visits at least one leaf node, so the minimum cost is O(logn)O(\log{n}).

If it visits mm leaves, then it also traversed m/2m/2 parents, m/4m/4, grandparents, \dots, 1 root node.

m2+m4+m8+2m\frac{m}{2} + \frac{m}{4} + \frac{m}{8} + \dots \approx 2m, so the order of complexity is O(logn+m)O(\log{n+m}).

k-d trees

Spatial Subdivision

Regular grid: regular divisions.

Quad tree: regular grid, but each cell contains a sufficiently small number of cells; if it has more, recursively subdivide it into quarters.

k-d tree: divide cell into halves along the median (each half contains the same number of cells) on one axis. For each half, divide it again into equal halves again, but this time on the other axis. Rinse and repeat until there a sufficiently small number of cells in each cell or a maximum depth is reached.

Differences:

def kd_tree(points, depth = 0):
  if len(points) <= threshold or depth >= max_depth:
    return Leaf(points)
  else:
    axis = depth % num_axes
    points.sort(key=lambda point: point[axis])
    mid_index = len(points) // 2
    midpoint = points[mid_index - 1][axis]
    # Subtract 1 since left <= midpoint, left does not include midpoint
    left =  kd_tree(points[:mid_index], depth + 1)
    right = kd_tree(points[mid_index:], depth + 1)
    return Node(axis, midpoint, left, right)

k-d Range Search:

def points_in_range(node, query_shape):
  if node is leaf:
    matches = points in node.points within shape
  else:
    matches = []
    if intersection of shape and lower half space not empty:
      matches += points_in_range(node.left, query_shape)
    if intersection of space and upper half space not empty:
      matches += points_in_range(node.right, query_shape)
  
  return matches

Unlike in the one dimensional case, multiple points can have the same value for a given dimension as when the split occurs on another dimension, it could end up on either side. Hence, both the lower and upper half of the shape can include points on the line.

Closest Neighbor

For a given point (that does not need to be in the tree), find the closest point within the tree.

Quadtrees

This only works in 2D. Performance:

Line Sweep

Data structures:

Closest Point-Pair

nn points on a plane: find the points PP and QQ such that they are the minimum distance from each other.

In the 1D case, simply sort then search through the list sequentially: O(nlogn)O(n\log{n}).

2D Case

Idea:

Data structure:

At each step:

The frontier set must be sorted by yy to make it efficient and have efficient addition and removal. Hence, a self-balancing binary tree or SortedContainer could be used.

Complexity: