1. Algorithms
A well-defined procedure:
- Input: instance of the problem
- Output: solution to the instance
The procedure must work for all instances of the problem.
Defining Algorithms
- English
- Pseudo-code
- Program
- Formal mathematical notation The further down you go, the more precise but less readable it gets.
Properties
- Correctness (for a defined set of inputs)
- Don’t write clever code! Write correct code
- Scalability/Efficiency
- Time, RAM, disk, CPU, power etc.
- Clarity
- Generality
- Is it sufficient for it to work on a subset of problems? Then redefine the problem space
- Is an optimal solution required, or does it just have to be good-enough?
Determining Correctness
Testing can not prove an algorithm is correct (unless you can test all cases): you can only prove it is incorrect. Only formal proofs can prove it. Not covered in the course.
Start with a reasonably precise statement of the problem and algorithm.
Then:
- Find failing cases
- Try proof by induction
- Can you test every single case when the size is small?
Example: Robot Arm (Skiena 1.1)
Robot needs to place solder into
Heuristics: greedy algorithm?
Example: Selecting Jobs (Skiena, 1.2)
Multiple movies that need to be filmed, need to maximize number of movies. Set
Heuristics: start with the shortest movie? Start with the first available movie?
Look for counter examples that show that the heuristic is incorrect:
- Think small
- Think exhaustively
- Find weaknesses
- Ties
- Extremes
Proof by induction
def sum(n):
s = 0
for i in range(1, n + 1):
s += i
return s
Assume that the loop works up to
Insertion sort
After
2. Algorithm Analysis
Once an algorithm is shown to be correct, consider the:
- Efficiency in time and space
- Big O
- Big Omega
- Big Theta
- Complexity in terms of the size of the input
- Worse, average, and best case
- Focus on the worst case
- Asymptotic analysis:
Complexity
The time taken will depend on the instance of the problem being solved.
RAM Model of Computation
Abstract random access machine
- Accessing any byte in memory takes constant time
- Basic operations (arithmetic, comparison, assignment) take one unit of time
- Time inside loops proportional to the number of iterations
- Subroutine time is the time to execute its body
Formal Definitions
Big O (upper bound)
That is,
Big Omega (lower bound)
That is,
Big Theta (tight bound)
That is,
The worst case and best case are functions: the algorithm is not, so we cannot define the algorithm by Big O/Omega/Theta notation.
The best case of an algorithm will be
Common functions, sorted by complexity descending:
| Functions |
|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3. Recursion
The Process
- Divide the problem into sub-problems
- Usually
or (
- Usually
- Solve the sub-problems
- Merge the sub-problems
Example: Merge Sort
Merge sort from left to middle, merge sort from middle to right and then merge the results:
- Divide: find the middle element
- Solve: merge sort the two sub-arrays
- Merge: mere the two sub-arrays
The base case is when the length of the array is 0 or 1.
Analysis
For
- Subdivision:
to find the center - Solving sub-problems:
- Merge:
At each level, the cost will be:
There are
4. Greedy Algorithms
Greedy algorithms:
- Are simple and intuitive
- Choose the locally optimal choice (for some given objective function)
- Hope it will lead to a globally optional choice
- Do not use backtracking
Examples
Minimum Spanning Tree
- Prim’s Algorithm
- Start anywhere
- Connect to the cheapest node that doesn’t create a cycle
- Kruskal’s algorithm
- Forest of trees
- Look for the cheapest node, create a new tree or connect two trees together
These can be proved using proof by counter-example.
Coin Changing Problem
Coins of denominations
- You have an unlimited amount of each coin
-
-
and hence, there is a solution for every integer solution
Greedy strategy:
- Select the largest coin such that
- Add the coin to the list and decrement
:
Counter example:
What is a sufficient property such that the greedy algorithm will always be optimal? Every coin is a multiple of the one below it.
Example: Interval Scheduling
There is a set of jobs with a start (
You could:
- Sort by start times
- Sort by job length
- Sort by finish times
- This is correct, but it is hard to prove this is optimal
To prove a greedy algorithm right, assume its wrong and find contradiction. To prove it wrong, simply find a counter-example.
Fractional Knapsack Problem
There is a set
You can select fraction of item
To solve this with a greedy algorithm, sort it by the benefit/weight ratio (
0/1 Knapsack Problem
Can only take the whole item or not take it at all. Requires dynamic programming.
Huffman Coding
Variable vs Fixed length
Fixed length codes are convenient for indexing, counting (e.g. ASCII: 7-bit binary string per character, one character per byte).
Variable length codes may be more efficient (e.g. Morse code: frequent letters have short encodings; UTF-8: most common characters 1 byte long).
Variable-Length Coding
Let
Problem: there is no spacing, so it is ambiguous where a character starts and ends.
Solution: ensure no binary string in the code is a prefix of any other binary string.
For
This can be represented as a tree with two children per node and characters being found in leaf nodes:
.
/ \
0 / \ 1
. .
/\ /\
0 / \ 1 0 / \ 1
. d e f
/\
0 / \ 1
a .
/\
0 / \ 1
b c
Encoding
The most frequent characters should have the shortest encoding. However, this approach will fail.
Instead, the key is that the two least frequent characters should be leaves at the bottom of the tree. Hence, to create a Huffman tree:
-
Create a min-heap where the value is:
- The frequency of the character for a leaf node or
- The sum of frequencies for all its children
-
While there are nodes:
- Join the two smallest nodes together
- Add the combined node back into the min-heap
Popping and insertion into a min-heap are
Disadvantages
The encoding table must be sent along with the encoded document. Hence, the overhead may be large for small documents.
Huffman encoding is optimal for global character frequencies. However:
- The whole document must be processed
- Local optimization not possible (e.g. character frequencies change half way through)
- Can’t recognize patterns; it only recognizes letters, not words
Lempel-Ziv usually gives better compression with substring recognition. The deflate algorithm combines the LZ and Huffman algorithms.
5. Dynamic Programming
Memoisation
Fibonacci can be defined as a recurrence relation:
One solution to this is memoisation: the caching of results.
General Purpose Memoiser
def memoise(f):
known = {}
def new_func(*params):
if param not in known:
known[param] = f(*params)
return known[params]
return new_func
@memoise
def function_definition():
# Or use the built in tools:
from func_tools import lru_cache
# LRU: least recently used; gets rid of oldest result once number of results stored reaches `maxsize`
@lru_cache(maxsize = None)
def function_definition():
// ...
Notes:
- A nested function has access to variables within the outer function’s scope
*paramin the function definition returns a tuple,*parampasses the elements of the tuple individually as arguments@funcacts as a decorator that takes the function as an argument and modifies it
Fibonacci, Iteratively
Bottom-up iteration:
def fibonacci_i(n):
a, b = 0, 1
for i in range(n):
a, b = b, a + b
return a
This approach is
Top-down Dynamic Programming
Recursion and memoisation.
Bottom-up Dynamic Programming
- Store a table of solutions, solving the simplest sub-problems first
- Fill out the table by combining sub-problems until target problem is reached
Example: Minimum Cost Path
Rectangular grid of cells, each with a weight. You can move to any adjacent cell (including diagonals) from each cell. Find the minimum cost needed to get from top to bottom.
Recurrence formula
Hence in the normal case, the cost of the cell is the cell weight plus minimum cost to get to cell above it (left, directly above, or right).
Using a top-down approach:
- Infinite cost if outside grid
- Top row: just get the weight of the cell
- Otherwise
- Check cache, return if found
- Find the minimum cost of the cells above/neighboring it
- Add the weight of the cell
- Save it to cache and return
- Find the cost for each cell in the bottom row, and find the minimum from there
Using a bottom-up approach::
- Create a table of all cells
- Get weights for top row
- For all cells in second row, find the minimum cost to get to that cell and add the cost of that cell
- Repeat for all rows
- Find the minimum-cost path by backtracking
- This is easier if you store the column of the cell you came from while generating the table
If the grid is square, it is
This problem can also be found using Dijkstra’s shortest path by converting the problem to a multi-stage graph (with weights moved to edges). However, this is harder to implement and less efficient.
Suitably of DP
DP approaches are likely to work when the problem:
- Is combinatorial: brute force solution of
either or - Involves only integers or slices of fixed strings with well-define bounds: a finite search domain
- Is looking for the optimal solution
Then build the solution from solutions to sub-problems:
-
Requires optimal substructure
-
Expect overlapping subproblems
- Otherwise, the cache is useless
- NB: cache will often require the path to the result
Optimal Substructure
When solutions to subproblems can be reused (multiple times!) to find the optimal solution to the problem.
For example, in the shortest path problem, the shortest path for a sub-problem will be inside the shortest path for the larger problem.
The longest path problem does not have an optimal substructure.
Properties of Recursive Solutions
- Want to make a cache of all the calls you make
- How many unique function calls is it possible to make? How big could the cache be?
- For
, the cache would be - Hence, this can become very large
- For
Coin Changing Problem
A greedy approach does not produce optimal results. Hence, DP must be used.
Top-Down
Given coinage = [1, 10, 25] and amount = 80, the solution will always be minimum of:
1 + change(80 - 25, coinage, 3): the solution given all three coins are available- OR
change(80, coinage, 2): the solution given only the first two coins are available
Either amount or coinage decreases to decrease the size of the problem.
An alternative approach would be finding the minimum of:
1 + change(80 - 1, coinage)1 + change(80 - 10, coinage)1 + change(80 - 25, coinage)
That is 1 + min(coinage_calculate(amount - coin, coinage) for coin in coinage)
Without memoisation, there are a maximum of len(coinage) sub-calls per call. With memoisation, the graph collapses to amount + 1 calls:
- The number of coins is the same for every call (given the second approach). Hence, the only varying argument is the amount.
- This would be inefficient using LRU cache as it would save both arguments for each call. Hence:
- Pass home-grown cache as an argument or
- Define an inner function and use LRU cache on that instead
0/1 Knapsack
No fractional amounts. For each item you have two options:
- Pick up the item: get a smaller knapsack that can carry
and have a current worth of - Don’t pick up the item: get a knapsack that can carry
and have a current worth of
Repeat for each item, picking the option that gives the greater worth.
Edge cases:
- No more items
- Item can’t fit
Cache using index of item (
Top-Down
The number of possible calls is
If using LRU cache, run the computation in in a nested function using LRU cache, and store the array of items outside the scope of the function.
Bottom-Up
Generate table of knapsack weight (column) and index (row), filling table row-wise. Go backwards to reconstruct solution.
1 indexing is also useful in this case, in which an extra row and column (0) will be needed.
For each cell, you have two options:
- Don’t pick up: go to the value of cell directly above
- Pick up: value of item plus value of the cell one row above and
to the left
Take the maximum of these two options.
This has a time and space complexity of
NB: this is pseudo-polynomial time - it is polynomial with the size of the knapsack, not the number of items.
Assuming the path is required, use backtracking to reconstruct the solution: start at the end, and If the value of the item directly above is the same, it means the optimal solution came from not picking up the item. Else, go up one and left by the weight of the item. This has a complexity of
Subset Sum Problem
Given a set of integers, find a subset that sums to zero.
Generalize this to finding a subset that sums to some value
def subset_sum_exists(numbers, n, required_sum)
# returns true iff subset within first n elements that sum to required_sum
Longest Common Subsequence
Given two strings
Brute forcing this is not possible - there are
Create a function,
Bottom-Up
Create a table of 1-origin subscripts for all possible values of
- If
, add 1 to the value of the cell one row up and one column to the left - Else, get the max of either the cell to the left or the cell above
Edit Distance
The diff algorithm. Transform source string
When a transformation is not needed, it is called the alignment operation, and has zero cost.
Work from end of string to start.
- Copy/Alignment: If
, - Otherwise, use the minimum of:
- Deletion:
- Insertion:
- Substitution:
- Prefer substitution over deletion/insertion $$ D_{i, j} = \begin{cases} 0 & i = j = 0 \ i & j = 0 \ j & i = 0 \ D_{i-1, j-1} & a_i = b_j \ 1 + \min(D_{i-1, j}, D_{i, j-1}, D_{i-1, j-1}) & \text{otherwise} \end{cases} $$
- Deletion:
Longest Increasing Subsequence
One solution is to sort, remove duplicates, and then run LCS on the two sequences. However, a more efficient solution is possible.
Let
Recurrence relation: find the maximum of LIS for all solutions before the current index, provided the last (and so biggest) element is smaller than the current element.
This can be further optimized into an
6. Graph Theory
Definitions
Graph:
-
Sparse graph:
-
Dense graph:
-
Directed graph: edges are ordered pairs
, so one way. -
Undirected graph: edges are a set of vertices,
Weighted graphs: edges have a weight
Path: a sequence of vertices where:
- Each vertex is connected to each other by a edge
- Each edge is not used more than once
Length of path: number of edges.
Cycle: path where the source and destination vertex is the same:
- Cyclic if graph contains at least one cycle
- Acyclic if not
- If it is also directed, it is called a directed acyclic graph
- Acyclic if not
- Loop: cycle of length one
Empty graph:
Subgraphs
Component: a non-empty subgraph where every pair of vertices connected by a path. If there is a path between two vertices, they are in the same component.
Tree: graph with a root vertex such that there is only one path between the vertex and root.
Directed tree: tree where all edges point either towards or away from the root.
Forest: graph where every component is a tree
Textual Representation of Graphs
Header:
DorUfor directed/undirectedn: number of vertices (which are numbered from 0)W: appears if it is a weighted graph
All other lines after the header define a edge: V1 V2 or V1 V2 Weight
Data Structures
Adjacency Matrix:
-
binary matrix , where if and are connected to each other via a edge and 0 otherwise - If undirected, the matrix is symmetric along the main diagonal
- If the graph is weighted, the value is the weight, or null if no edge
Adjacency List:
- A list of length
, each element being a list for every vertex - Each sub-list contains the vertices it is connected to
○ If directed,
is in the list for if there is a edge ○ If the graph is weighted, each item in the list is of the form
Adjacency matrices are faster (
Forests:
- Each vertex has zero or one parents
- Store as an array: if vertex
has parent ,
Graph Traversal
Each vertex can be in one of three states:
- Undiscovered
- Discovered
- Processed
Predecessor Tree:
- Root: starting vertex
- For
in the list of vertices: - If
or , where is undiscovered: - Add
to the tree: is the child of (even if the graph is undirected) - Set
. Note that this is different from the forest map as a child can have multiple parents, but will only ever specify one parent: whatever was first
- Set
- Add
- If
BFS
Two procedures: BFS-Tree initializes the data structure and calls BFS-Loop:
def BFSTree(adj, index):
# Where:
# - adj is the adjacency matrix
# - index is the index of the root
n = len(vertices)
state = [UNDISCOVERED * n]
state[index] = DISCOVERED
parent = [NULL * n]
Q = Queue()
Q.add(index)
return BFSLoop(adj, Q, state, parent)
def BFSLoop(adj, Q, state, parent):
while Q is not empty:
u = Q.remove()
for v in adj[u]:
# For each vertex u is connected to
if state[v] == UNDISCOVERED:
state[v] = DISCOVERED
parent[v] = u
Q.add(v)
state[u] = PROCESSED
return parent
Analysis
Size of
- Time complexity of
BFSTree: - Time complexity of
BFSLoop: proportional to number of times the loop is run, so maximum of number of edges in the graph:
So complexity is
DFS
def DFSLoop(adj, index, state, parent):
for v in adj[index]:
# For each vertex the given index is connected to
if state[v] == UNDISCOVERED:
state[v] = DISCOVERED
parent[v] = index
DFSLoop(adj, v, state, parent)
state[index] = PROCESSED
return parent
Properties of DFS and BFS
- Works with directed and undirected, although weights are ignored
- Only creates a single tree
- DFS can be done by replacing BFS queue with stack, and replacing enqueue and dequeue with push and pop
- BFS: vertices discovered in order of increasing distance from start
- DFS: vertex only marked as processed once all vertices reachable from it are discovered
Shortest Path
Use BFS to produce parent/predecessor tree, working backwards from the end until you get to the start.
def treePath(parentTree, start, end):
if start == end:
return [start]
path = treePath(parentTree, start, parentTree[end])
path.append(end)
return path
Connected Components
When BFS or DFS is run, all vertices reachable from the start node are processed: that is, it finds a subgraph.
Thus, to find all components, run BFS or DFS on each vertex that has not yet been discovered. To make sorting the vertices into components easier, there can be an array that stores the component number; on what run of the DFS/BFS the vertex was first discovered.
Strong Connectivity
A directed graph is strongly connected If there is a path between every ordered pair of vertices.
This can be determined inefficiently by running BFS and check if all vertices are processed for all possible starting vertices.
Transpose operation,
- Adjacency matrix: mirror along the main diagonal.
- Adjacency list:
Hub: vertex
If there is such a vertex, then the graph must be strongly connected as any vertex can go through
- Run a traversal starting with
- Return false if any vertex is undiscovered
- Construct the transpose of the graph
- Run a traversal on the transpose starting at
- Return false if any vertex is undiscovered
Thus, time complexity is the same as graph traversal:
Topological Ordering (DAGs only)
Ordering of vertices such that if
All DAGs have at least one such ordering.
Programs such as make will use it to find a valid order to compile rules; package installers will use it to find the correct installer to install dependencies.
- Initialize empty stack
- For each vertex:
- If undiscovered, start modified DFS: when vertex is processed, push it to the stack
- The vertex at the end will be pushed to the stack first
- This ordering will continue until it gets to the start vertex
- When the DFS is run again with another start vertex, if it is ‘before’ the previous start vertex, vertices between the two will get added to the stack
- If undiscovered, start modified DFS: when vertex is processed, push it to the stack
- Return the stack from top to bottom
Cycle Detection
Directed graphs: run DFS: if a vertex that has already been discovered is reached, there is a cycle. Run this with each possible starting vertex, or until a loop has been discovered.
Run DFS: if, while on vertex
If the graph has multiple components, the DFS must be run multiple times, using a similar algorithm to when finding connected components.
Weighted Algorithms
Minimum Spanning Tree
Tree that includes all vertices of the graph and has the lowest total weight among all spanning trees. The graph must be undirected.
Prim’s Algorithm
- Create a list: the parent of each vertex
- Create a list: if a vertex is in the tree
- Create an list containing the closest distance of a vertex to the tree, each with a starting value of infinity
- While there are vertices remaining:
- Find the vertex v that is closest to the tree
- For the first loop, pick vertex 0 (or anything, really)
- Add v to the list of vertices in the tree
- For all vertices still not in the tree:
- If the distance between the vertex and v is smaller than the distance stored in the list
- Update the distance, and set the parent of the vertex to v
- If the distance between the vertex and v is smaller than the distance stored in the list
- Find the vertex v that is closest to the tree
- Return the parent and distance
The algorithm is
def prim(adj):
# Prim's algorithm: minimum spanning tree
in_tree = [False] * len(adj)
distance = [float('inf')] * len(adj) # Stores distance from tree
parent = [None] * len(adj)
distance[0] = 0
while False in in_tree:
u = next_vertex(in_tree, distance)
# Pick closest vertex not in tree. If distance is min heap with edit
# support, then function is log(n) and algorithm is nlog(n)
in_tree[u] = True
for (vertex, weight) in adj[u]:
if not in_tree[vertex] and weight < distance[vertex]:
distance[vertex] = weight
parent [vertex] = u
return (parent, distance)
Shortest Path Trees
The shortest path tree rooted at vertex
Dijkstra’s Algorithm
NB: this will not always produce correct results on graphs with negative weights.
def dijkstra(adj, s):
# adj = adjacency list
# s = index for start vertex
in_tree = [False] * len(adj)
distance = [float('inf')] * len(adj)
parent = [False] * len(adj)
distance[s] = 0
while False in in_tree:
u = next_vertex(in_tree, distance)
# Pick the closest vertex not yet in the tree. Efficiency can improve
# if priority tree (min heap) is used
in_tree[u] = True
for (vertex, weight) in adj[u]:
if not in_tree[vertex] and distance[u] + weight < distance[vertex]:
# If the distance to the vertex using `u` is closer than it was
# when `u` was not in the tree, update the tree
distance[vertex] = distance[u] + weight
parent [vertex] = u
return (parent, distance)
All-Pairs Shortest Path
Dijkstra gives the shortest path from a given source vertex to every vertex in graph.
All-pairs gives the shortest path between every pair of vertices in graph. This can be done running Dijkstra for each vertex:
Floyd-Warshall Algorithm
DP algorithm to all-pairs. Solutions are limited to graphs without negative cycles: the sum of weights along a cycle is not negative.
Intermediate vertices: vertices between source and destination vertices.
Properties of Shortest Paths
Simple: no vertex repeated along the path (assuming the condition that there are no negative cycles).
Any sub-path of a shortest path is also a shortest path. If this was false, that sub-path could be used to make a shorter shortest path.
The Algorithm
Use recursion on
- If
is an intermediate vertex on the shortest path: - It is equivalent to joining the shortest paths from
to , and from to
- It is equivalent to joining the shortest paths from
- That is, adding the lengths of
and -
is not an intermediate vertex: the same as
There are three base cases:
-
- When
and
- When
-
- When
and - No intermediate edges, and no edges connecting
and
- When
-
- No intermediate edges, and a edge connecting
and : set the distance to the weight of that edge
- No intermediate edges, and a edge connecting
Top-Down
All three parameters can take on
Bottom-up
Uses table of size
Thus, the space complexity is
Run the algorithm with
Backtracking
Technique to generate all or some solutions to a problem incrementally. Generate one possible solution, and if it isn’t correct, then use it to generate children, and repeat.
def dfs_backtrack(candidate, input, output):
if is_solution(candidate, input):
add_to_output(candidate, output)
else:
for child in children(candidate, input):
dfs_backtrack(child, input, output)
candidate: vertex of the treeinput: information about instance of the problem i.e. desired output (e.g. size of the solution)output: list of valid solutionsis_solution: returns true if it is a leaf node: valid solutions won’t have valid children, so there is no need to go further down the treechildren: returns list of possible candidates
In some cases such as search, only one solution may be required.
Pruning
In some cases, it may be known that it is impossible for any children (and their children etc.) of a node to produce a solution, meaning making further calls is unnecessary.
Check if pruning is required at the start of the DFS/BFS backtrack function.
7. Computational Geometry I
Introduction
Be aware of:
- Floating point errors
- Nearly-parallel lines
- Very thin/long rectangles
- Points very close together
- Special cases
- Coincident points
- Intersecting lines
- Degenerate cases
- Zero area triangles
This course simplifies this by using integer cartesian grids and avoiding division, square roots, trig etc.
Sidenote - basic matplotlib example:
import matplotlib.pyplot as plt
vertices = [(0,0), (1,2), (2,4)]
vx, vy = zip(*vertices) # Unzip the array into x and y components
axes = plt.axes()
axes.plot(vx, vy, color="blue", marker="o")
plt.show()
Points, Vectors & Lines
Definitions
Point:
- Position in space
-
tuple - Uppercase, non-bold, non-italic letters
Vector:
- Difference between two points
-
tuple - The vector from
to is - Lowercase, bold, italic letters
- Position vector
- Vector from origin to some given point
Basic Operations
Points:
- Subtract to get displacement vector
- Linear/affine sum:
-
- Indivisible, two operation operand
- Points on the line between A and B
-
-
-
- The linear sum is the equation of the line: use this instead of
as it does not work with vertical lines
Vectors:
- Addition, subtractions
- Multiplication by scalar
- Addition to a point to get another point
- Magnitude:
- Dot product
- Signed triangle area
Vector Operations
Dot Product:
-
-
when the angle is acute: angle in or -
when they are perpendicular
Signed Area/Turn Direction
The area of parallelogram is:
This is equivalent to the magnitude of the cross product where
The area of triangle is half of this.
The area is positive if
Checking if
- Check if signed area 0
- To check if it is on the line segment (i.e. Between
and ): -
and - Where
is the distance between the points and
-
Turn Direction:
- Just check the sign of the signed area (unless the area is zero)
Line Segment Intersection:
- Two line segments
and intersect iff and are on opposite sides of the line and vice-versa - True if
isCCW(A,B,D) != isCCW(A,B,C) && isCCW(C,D,A) != isCCW(C,D,B)(where the first argument is the origin)
Degenerate cases:
- One point on the line
- Sharing a point
- All four points collinear
Simple Polygons
By convention, vertices are in counter-clockwise order.
Convex Polygons:
- Every interior angle less than 180 degrees
- Strictly less than for strictly convex polygon
- Every counter-clockwise traversal turns left at every vertex (or goes straight)
Point Inclusion Test
Convex Polygon
For a convex polygon, a point is in a polygon (PiP) if it is on the left side of every edge for an anticlockwise traversal of the polygon:
all([isCCW(vertex[i], vertex[(i + 1) % len(vertex)], P) for i in range(len(vertex))])
Simple Polygon
- Create a semi-infinite horizontal line starting from
going either left or right - The point is inside the polygon iff it intersects the polygon an odd number of times
- The line only needs to extend to past the maximum
value for the polygon’s edges
Special cases:
- The line goes through a vertex
- On the ‘inside’: count 0 or 2 times
- Otherwise: count 1 time
- Goes along an edge
The correct way to solve this is to:
- Ignore horizontal edges
- If edge directed upwards, count the start vertex as an intersection
- If edge directed down, count the end vertex as an intersection
The hacky solution: all vertices are on an integer grid, so simply move
Improvements:
- Find the rectangle that encompasses the polygon
- If its not in the rectangle, then it can’t be inside
Convex Hull Problem
For a finite set of points, the convex hull is the smallest convex polygon enclosing all points. The convex hull:
- Is unique
- Contains every point in the set
- has
, , and as vertices of the convex hull
Naïve solution:
- Construct all possible edges using pairs of points
- For every other point, check if it lies on the left side of the edge. If so, it belongs to the convex hull
-
Gift Wrap Algorithm
- Start from the point with the minimum
coordinate (or max , or min/max ) - Choose the left-most one if multiple points have the same minimum
- Select the ‘rightmost’ point from the perspective of the current point
- Pick any point
- For all points
- Check if
is to the right of - If so,
- Check if
- The current
is part of the convex hull. Repeat until a closed loop is formed
- Pick any point
- Worst case
- Normal case
, where is the number of vertices in the convex hull
Graham-Scan Algorithm
An
- Point
will have the minimum value, and so is guaranteed to be on the convex hull - Create a line centered around point
and pointing right - Rotate the line anticlockwise. When a vertex is found, add them to the list - we are essentially sorting the vertices by angle
- Connect the vertices in the sorted list to create the simple closed path
- Not all vertices will be on the convex hull, so the concavities need to be filled in
Sorting without calculating angles
We don’t need to know the angle: only the ordering. This can be done using isCCW, Python’s built in sort function (which will be from functools import cmp\_to\_key or by defining a SortKey class with a custom __lt__ operator and converting the point to that class in the sort function’s key lambda.
Consider edge cases:
-
or are the anchor - Several points colinear with
(i.e. on a straight line) - Don’t worry about them
Concavities:
-
and always on the convex hull -
may not be
Initialize a stack with
- Check if the next point,
, is on the left edge of the line connecting and - If so, add the point to the stack
- Otherwise, pop from the stack until this is true
Complexity:
- Step 1: find the point with the minimum
Y value: - Step 2: sort operation:
- Step 3: initialize hull
- Step 4: generate the convex hull
- Not
as although the nested while loop may run several times, it can only remove previously added elements, so you can only ever pop elements - Hence, it is
- Not
8. Computational Geometry II
Range Searching
Introduction
There are a large number of geometric objects. How do you find all objects intersecting/within a given area/volume?
We assume:
- Objects are points
- Range is rectangular
- There are many queries on the same point set
- Hence, we can pre-process the data
1D range searching
Determine which points lie in
matches = [x for x in nums if rmin <= x <= rmax]
To improve performance, sort the items:
However, this approach does not generalize to multiple dimensions.
1D Binary Search Tree (1D k-d tree)
- A balanced binary search tree with numbers at the leaves
- Internal nodes are medians - (almost)
len(data)//2 - 1 - Left subtree contains
- Right subtree contains
- Assume no duplicates
To create the search tree, sort the elements; the left subtree should contain the lower half of the list, plus element with index len(list)//2 - 1. Repeat the process with the subtrees until you reach a leaf node.
from collections import namedtuple
Node = namedtuple("Node", ["value", "left", "right" # Cheap way to make a class
def search_tree(nums, sorted=False):
if not sorted:
nums = sorted(nums)
if len(nums) == 1:
tree = Node(nums[0], None, None) # Leaf node
else:
midpoint = len(nums) // 2 # Halfway, rounding down
left = search_tree(nums[:midpoint], True)
# Up to but not including midpoint
right = search_tree(nums[midpoint:], True)
# Including the midpoint and the end
tree = Node(nums[midpoint - 1], left, right)
# Left child will be equal to parent, which meets the
# less than or equal to criteria
return tree
Search
def range_query(tree, low, high):
# Inclusive range
matches = []
if tree.left is None and tree.right is None:
if low <= tree.value <= high:
matches.append(tree.value)
else:
if low <= node.value and tree.left is not None:
matches += query(tree.left, low, high)
if node.value < high and tree.right is not None:
# Range may include both children, so use if and not else
matches += range_query(tree.right, low, high)
return matches
The algorithm visits at least one leaf node, so the minimum cost is
If it visits
k-d trees
Spatial Subdivision
Regular grid: regular divisions.
Quad tree: regular grid, but each cell contains a sufficiently small number of cells; if it has more, recursively subdivide it into quarters.
k-d tree: divide cell into halves along the median (each half contains the same number of cells) on one axis. For each half, divide it again into equal halves again, but this time on the other axis. Rinse and repeat until there a sufficiently small number of cells in each cell or a maximum depth is reached.
Differences:
- The leaf nodes can contain multiple points
- Nodes labelled with the axis it is subdividing on
- Subscript is a list showing path from root to node (0 for left, 1 for right)
- e.g.
01meansroot/left/right
- e.g.
- Subscript is a list showing path from root to node (0 for left, 1 for right)
- Only worth rebuilding the entire tree if there are significant additions/removals
def kd_tree(points, depth = 0):
if len(points) <= threshold or depth >= max_depth:
return Leaf(points)
else:
axis = depth % num_axes
points.sort(key=lambda point: point[axis])
mid_index = len(points) // 2
midpoint = points[mid_index - 1][axis]
# Subtract 1 since left <= midpoint, left does not include midpoint
left = kd_tree(points[:mid_index], depth + 1)
right = kd_tree(points[mid_index:], depth + 1)
return Node(axis, midpoint, left, right)
k-d Range Search:
def points_in_range(node, query_shape):
if node is leaf:
matches = points in node.points within shape
else:
matches = []
if intersection of shape and lower half space not empty:
matches += points_in_range(node.left, query_shape)
if intersection of space and upper half space not empty:
matches += points_in_range(node.right, query_shape)
return matches
Unlike in the one dimensional case, multiple points can have the same value for a given dimension as when the split occurs on another dimension, it could end up on either side. Hence, both the lower and upper half of the shape can include points on the line.
Closest Neighbor
For a given point (that does not need to be in the tree), find the closest point within the tree.
- Use the k-d tree to find the cell the point would be in - it is guaranteed to contain at least one point
- Find the closest point in the cell, and take that distance as the upper bound on the minimum distance
- Search for cells that intersect with the square (notes)/circle (optimal) centred around the original point
- Check if any of the points inside the cells are closer
Quadtrees
- Split each region into 4 sub-regions, partitioned evenly on both axes
- Repeat until the maximum depth is reached, or the region contains fewer than some specified number off points
- Node contains coordinates of centre, width/height of the square, and their children
- Node labels: path from root e.g.
0/1/1would beroot/quadrant 1/quadrant 1
This only works in 2D. Performance:
- Best case: equally distributed,
- Worst case: all points in the same cell at depth
:
Line Sweep
- Simulates a vertical line sweeping across objects on
plane on one direction - Sort points by the
coordinate - At any point, solution to problem for all points to the left is known
- When point added, check if solution changes
Data structures:
- Event queue
- Positions of event points
- Visited points deleted from queue
- Status structure
- State of algorithm at each position
- Solution set
- Solution at current position
Closest Point-Pair
In the 1D case, simply sort then search through the list sequentially:
2D Case
Idea:
- Store current closest distance
, current point - Frontier set: set of points where the
coordinate is in the range - Kind-of-but-not-really points where the
coordinate is also in the range
- Kind-of-but-not-really points where the
- Only need to the check distance
is to for points in the frontier set - any point further away cannot be the closest point
Data structure:
- Point list
- Points sorted by x coordinate
- State
- Current sweep position - index in point list
- Initially
- Initially
- Current best point pair and distance (NOT
distance) - Initially
- Initially
- Frontier set,
- Includes current point
- Initially
- Current sweep position - index in point list
At each step:
- Get the current point, add it to the frontier set
- Remove elements from
where the horizontal distance from the current point is greater than - DO NOT remove elements further away than
vertically
- DO NOT remove elements further away than
- If they are closer than
vertically, calculate the distance and update if necessary
The frontier set must be sorted by SortedContainer could be used.
Complexity:
- Sorting by
: - Each point added and removed from frontier set:
total - Search for closest pair takes total
time ○ to select the range of points from the point in the axis - Hence, the algorithm is
in terms of time