Exam Notes

Search

Prune on insert/remove if the end node has been expanded.

A*: $f(path) = cost(path) + h(path[-1])$ .

Admissible: estimate is always an underestimate
Monotone/consistent: $h(n) \leq cost(n,n') + h(n')$ for all neighbors $n'$ of $n$
- That is, the estimate is always less than estimate from a neighbor plus the cost to go to the neighbor
Fails if pruning + heuristic not monotone

Bidirectional search (BFS, LCFS, A*): $2 * b^{d/2} \ll b^d$ ; saves time and space.

Iterative deepening: max depth incremented until solution found. DFS is $\geq pow(b, k)$ ; iterative is $\leq b^k \cdot \frac{b}{b-1}^2$ .

Propositions and Inference

$h \leftarrow b$ : if $b$ is true, $h$ must be true. $b$ is an atom, the body
KB: set of definite clauses
Interpretation: assignment of truth value to each atom
Model: interpretation where all clauses are true
Logical consequence: atoms that are true in every model of the KB

Soundness and Completeness

$KB \vdash g$ means $g$ can be derived (is a consequence) from the given proof procedure
Sound if every consequence derived is true (if $KB \vdash g$ , $KB \models g$ )
Complete if all consequences are derived (if $KB \models g$ , $KB \vdash g$ )

Bottom-up/forward-chaining

Initialize the set of consequences to be the set of atomic clauses
Find an atom $h$ that is not yet a consequence where $h \leftarrow b_1 \land \dots \land b_m$ and all $b$ ’s are consequences; add it to the set
Repeat until no more clauses can be selected

Fixed point: set of consequences generated.

If $I$ is the interpretation where every element of the fixed point is true and every other one is false, $I$ is the minimal model of the KB.

Top-down procedure

Answer clause: $yes \leftarrow a_1 \land \dots \land a_m$ .

Until the answer clause is an answer ( $yes \leftarrow \text{}$ ), repeatedly run SLD resolution.

Pick an atom $a_i$
Choose a clause $a_i \leftarrow body$
Replace $a_i$ with its body
- If it is just a definition (atomic clause like $a_i.$ ), the answer clause will shrink

Prolog

consult(filename). loads all knowledge bases
make. re-consults all knowledge bases
, is AND, ; is OR
predicate(args) :- definition.
Search uses DFS: place non-recursive term first
List is [] or [Head_Element|Rest_Of_List]
_ is anonymous variable

% append(list, list_to_append, result)
append([], L, L).
append([H|L1], L2, [H|L3]) :- append(L1, L2, L3).

% accReverse(original, accumulator, reversed)
accReverse([], L, L).
accReverse([Head|Tail], Acc, Rev) :- accReverse(Tail, [Head|Acc], Rev).

! suppresses backtailing; fail always fail; can be combined to invert result. Can also use \+ to do the same

Constraint Satisfaction Problems

A set of variables with domains for each, and set of constraints; a solution is an assignment that satisfies the constraints.

Constraint Network:

Oval for variable; has associated domain
Rectangle for constraint
Arc from each variable to each constraint

For a given constraint and variable $X$ in the scope of the constraint, the arc is arc consistent if, for each element in the domain of $X$ , there is a valid assignment for all other variables in the constraint’s scope.

Elements may need to be removed from the domain of $X$ to make it arc consistent.

Algorithm:

Get all pairings of constraints and variables in their scope
For each pairing and element, find values in the domain that allow for valid assignments
If the domain has changed, need to recheck all variables involved with any constraints involving the variable

Empty domains: no solution; multiple values in domains: may or may not have a solution.

Domain Splitting

Halve the domain for some variable, create two instances of the problem with the new domain, and solve both, revisiting only constraints involving the split variable.

Variable Elimination

Eliminate variables by passing constraints on to their neighbors.

Select a variable $X$
For each constraint involving $X$ , find all assignments that satisfy that constraint
- If a variable is not involved in any of the constraints, it does not need to be assigned
Join all constraints, then remove $X$ with a project operation
Repeat

Local and Global Search

Optimization problem: finding the assignment that optimizes (min/max) the value of an objective function
Local search
- Each iteration, move to one of its neighbors
- Use random restarts/steps (moving to random neighbors) to prevent getting stuck in local optima
Constrained satisfaction problem
- The objective function is the number of unsatisfied constraints
Global search with parallel search
- Get $k$ individuals/total assignments
- At each iteration update all individuals - if any individual is a solution the search can stop
Simulated Annealing
- Pick a random variable and value, adopting it if it is an improvement
- If it is not an improvement, adopt it with probability $exp(\frac{h(current\_assignment) - h(proposed\_assignment)}{Temperature})$
- Decrease temperature over time
Gradient descent
- Walk along the gradient of the objective function to find a minima

Roulette wheel: return the first individual whose fitness added to the running total is larger than a random number between 1 and sum of the fitnesses.

Probabilistic Inference and Belief Networks

P(x|y,z) = \frac{P(y|z)}{P(x,y|z)}

For a full assignment:

P(x_1, \dots, x_n) = \prod_{i=1}^{n}{P(x_i | parents(X_i))}

If it is the probability over all variables, it is called the joint probability distribution.

Basic Machine Learning

Error = num incorrect / total
Naïve Bayes model: assume features only dependent on class variable (thing being predicted)
Laplace smoothing: add pseudo-count to reduce confidence
- Add pseudo-count to counts for every tuple
K-nearest neighbors
- Non-parametric, instance-based learning: needs to store all examples
- Uses k examples closest to one being retrieved and method to merge them

Artificial Neural Networks

Prediction: $a = \sum_{i = 0}^{n}{w_i x_i}$ , where $x_0 = 1$ .

Activation function: $g(a): bool = a \geq 0$ .

Learning: $weight \leftarrow weight + \eta\_learning\_rate \cdot x(actual - prediction)$ . Repeat for each training example and loop until no mis-classifications or limit reached.

Weeks 01-02: Searching the State Space

State Space

State: object representing a possible configuration of the world (agent and environment).

State space: set of all possible states (cross product of all elements of the state).

State Space Graphs

Actions change the state of the world
Each action may cause a change in the state
Basically a FSM

Directed Graph

Many problems can be abstracted into the problem of finding a path in a directed graphs.

Node: state (vertices)
Arc: action (edges)
Path: sequence of $\langle n_0, n_1, \dots, n_k \rangle$ nodes (of length $k$ in this example). Implemented as a sequence of arcs in practice
Arcs can have an associate cost: cost of path is sum of cost of arcs
Solution: path from a start node to a goal node - multiple starting and goal nodes allowed

Explicit Graphs

Entire graph in memory, stored as adjacency list or matrix
Complexity measured in terms of number of vertices/edges

Implicit Graphs

outgoing_arcs method returns a set of outgoing directed arcs for the given node
Graph generating on the fly
Complexity measured by depth of goal node (path length) and average branching factor (number of outgoing arcs at each node)

Searching Graphs

Generic algorithm

Frontier: list of paths (starting from a start node) that have been explored
Explore graph and update frontier until goal node encountered
Way in which paths are added and removed: search strategy
- e.g. pruning: store visited nodes to avoid cycles

def search(graph, start_nodes, is_goal_node):
  frontier = [[s] for s in start_nodes]
  while len(frontier) != 0:
    path = frontier.pop()
    if is_goal_node[path[-1]]:
      return path

  for n in path[-1].neighbors():
    frontier.append(path + n)

The value selected from the frontier at each stage defines the search strategy - above, the frontier object is passed to the search procedure using pop
The neighbors function defines the graph - outgoing_arcs is used above
The goal function defines solution that is used
- Finish after you remove from the frontier. Otherwise, in graphs with costs, the lowest cost path may not be found
If more than one answer is required, the search can continue - use the yield keyword

DFS

Pop last element from stack
If algorithm continues, the paths that extend the popped element are pushed to the stack
Thus, the algorithm expands the deepest path
Does not guarantee it will find the solution with the fewest arcs
Does not halt on every graph and is not complete - is not guaranteed to find a solution if one exists
- No pruning, so it can get stuck in a cycle
Time complexity as function of length of the selected path:
- $O(b^d)$ , where $b$ is the branching factor and $d$ is the depth
Space complexity as a function of the length of the selected path:
- $O(d)$ for a given depth - there can only be $d$ frontier paths (plus the branching of the current node that was explored)

Explicit graphs

Used in exercises
Nodes are specified as a set - order does not matter
Edges are in a list: (tail, head, cost?) - order does matter

Tracing the frontier:

Each line starts with a plus or minus
- + to indicate that this will be added to the frontier
- - to indicate that something has been selected and returned from the frontier
- List of nodes with no separator character
- Optional ! at the end - means pruning has occurred
  - When adding or removing from the frontier but the end node is already expanded

BFS

FIFO queue
Pop from the start
Check if goal node. If not
- Enqueue paths that extend the node to the queue
Shallowest path expanded first
Guarantees the solution with fewest arcs will be found first
Complete - if there is a solution it will find it
Does not always halt - if there is no solution and there is a cycle
Time complexity: $O(b^d)$
Space complexity: $O(b^d)$

Lowest-cost-first search (LCFS)

Cost of path: sum of costs of arcs
Frontier is priority queue ordered by path cost
At each stage, select the path on the frontier with the lowest cost
Finds the optimal solution: least-cost path to goal node

Priority Queue

Each element has priority
Element with higher priority always selected/removed/dequeued before element with lower cost
Queue is stable: if two or more elements have the same priority, FIFO
- Does not affect the correctness of the algorithm, but makes it deterministic
Python has heapq
- Won’t automatically be stable

Pruning

Two problems: cycles - an infinite search tree, and when expanding multiple paths leads to the same node.

We need memory - the frontier should keep track of which nodes have been expanded/closed.

Expansion happens when the frontier is removed from the path and you add new elements to it
Store the expanded nodes as a set
- The nodes must be hashable

Prune when:

When adding a path to a frontier but the end node has already been expanded
When the frontier is asked for the path, but the end node of that path has already been expanded

LCFS finds an optimal solution, but it explores options in every direction, and knows nothing about the goal location.

Heuristics

Extra knowledge that can be used to guide a search:

$h(n)$ is an estimate of the cost of the shortest path from a given node $n$ to a goal node
An underestimate (or equal): if there is no path to a goal node, any estimate it gives will be an underestimate
- If so, the heuristic function is admissible
Multiple goal nodes: one approach is to return the ‘closest’ goal node

Best-first Search

Select the path on the frontier with minimal $h$ -value
Priority queue ordered by $h$
Explores more promising paths first - usually faster than LCFS
May NOT find the optimal solution

A* Search Strategy

Not as wasteful as LCFS
Not as greedy as best-first-search
Avoid expanding paths that are already expensive

$f(p) = cost(p) + h(n)$ where:

$p$ is a path and $n$ is the last node on $p$
$cost(p)$ is the real cost from the starting node to $n$
$h(n)$ is an estimate of the cost from $n$ to the closest goal node
$f(p)$ is the estimated total cost of the path
The frontier is a priority queue, ordered by $f$
Should be admissible (underestimate) - makes it prefer less-explored (i.e. shallower) paths
Fails if there is pruning and the heuristic is not monotone
- An expensive path may be expanded before a cheaper one ending at the same node. If pruning occurs, the cheaper path cannot be used

Monotonicity

A requirement that is stronger than admissibility. A function is monotone/consistent if, for any two nodes $n$ and $n'$ (which are reachable from $n$ ):

h(n) \leq cost(n, n') + h(n')

That is, the estimated cost from $n$ to the goal must be less than the estimated cost of going to the goal via $n'$ . Where $s$ is the start node:

\begin{aligned} f(n') &= cost(s, n') + h(n') \\ &= cost(s, n) + cost(n, n') + h(n') \\ \therefore f(n') &\leq cost(s, n) + h(n) \\ \therefore f(n') &\leq f(n) \\ \end{aligned}

Thus, $f(n)$ is non-decreasing along any path.

Finding good heuristics

Solve a simpler version of the problem:

Finding admissible heuristics is hard
- Admissible heuristics are usually consistent. Yay!
Inadmissible heuristics are often quite effective, although that sacrifices optimality
- Multiplying by some constant is a hacky way of making it admissible

Sliding puzzle example:

Number of misplaced tiles: admissible as only one tile can move at a time and if n tiles are in the wrong spot, it needs at least n steps
Total Manhattan distance: closer to the actual value, so improves performance

Dominance Relation

Dominance: for two heuristics, if $h_a > h_c$ if $\forall{n}: h_a(n) \geq h_c(n)$ .

Heuristics form a semi-lattice: if neither are dominating, could use $max(h_a(n), h_c(n))$ .

The bottom of the lattice is the zero heuristic - makes A* search just a LCFS.

Bidirectional Search

Search from both the start nodes and the goal nodes simultaneously
Can be used with BFS, LCFS, or A*
$2b^{\frac{d}{2}} \ll b^d$ , thus exponential savings in time and space

Bounded Depth-First Search

Takes a bound (cost or depth) and does not expand paths that exceed the bound:

Explores part of the search tree
Uses space linear in the depth of the search
Kind of acts like BFS while using little memory

Iterative-deepening Search

Uses less memory but more CPU when compared to BFS:

Start with bound $b = 0$
Do a bounded depth-first search with bound $b$
While a solution is not found, increment $b$ and repeat
Nothing is remembered between iterations; wasteful
This will find the same first solution as BFS
But linear in depth of goal node: $O(bd)$
If there is no path to the goal: identical behavior to BFS. Infinite loop if not pruning

Complexity with solution at depth $k$ and branching factor $b$ :

Level	BFS	Iterative Deepening	Num. Nodes
$1$	$1$ (only looks at nodes once)	$k$ (repeat search $k$ times)	$b$
$2$	$1$	$k - 1$	$b^2$
$k$	$1$	$1$	$b^k$
Total	$\geq b^k$	$\leq b^k(\frac{b}{b-1})^2$

As branching factor increases, complexity gets closer and closer to BFS - thus, it is not very wasteful.

Week 03: Propositions and Inference

Simple Language: Propositional Definite Clauses

Atom: symbol starting with lower case letter
Body: atom or of the form $b_1 \land b_2$ , where $b_1$ and $b_2$ are bodies
Definite clause: atom or rule of form $h \leftarrow b$ , where $h$ is an atom and $b$ is a body. If it has an empty body, is called an atomic clause/fact
Knowledge base: set of definite clauses

Interpretation

An interpretation assigns a truth value to each atom. Thus, there are $2^{num\_atoms}$ possible interpretations.

Body $b_1 \land b_2$ is true in $I$ if both $b_1$ and $b_2$ are true in $I$
Rule $h \leftarrow b$ is false in $I$ if $b$ is true and $h$ is false
- i.e. If $b$ is true, $h$ must be true
Knowledge base $KB$ is true in iff every clause in $KB$ is true in $I$

Models and Logical Consequences

Model of a set of clauses: interpretation in which all the clauses are true
If is $KB$ a set of clauses and is $g$ a conjunction of atoms, $g$ is a logical consequence of $KB$ ( $KB \models g$ ) if is true in every model of $KB$

Example: $$ KB = \begin{cases} p \leftarrow q,\ q,\ r \leftarrow s \end{cases} $$ $m = (p = true, q = true, r = false, s = false)$ would be a model of $KB$ .

Proof Procedures

A possibly non-deterministic algorithm for deriving consequences of a knowledge base
Given a proof procedure, $KB \vdash g$ means $g$ can be derived from the knowledge base $KB$
A proof procedure is sound if $KB \vdash g$ implies $KB \models g$
- Every consequence it finds is true
A proof procedure is complete if $KB \models g$ implies $KB \vdash g$
- All consequences are found by the procedure

Bottom-Up Proof Procedure

If $h \leftarrow b_1 \land \cdots \land b_m$ is a clause in the knowledge base and each $b_i$ has been derived (all are consequences), then can be derived. This method is called forward chaining.

First, set $c:=\{\}$ . Then, select a clause $h \leftarrow \; b_1 \land \cdots \land b_m$ in $KB$ such that:

$b_i \in C \;\forall i$
$h \notin C$

Then set $C := C \cup \{h\}$ .

Repeat until no more clauses can be selected. (Only atomic clauses can be added to the set at the beginning).

$KB \vdash g$ if $g \in C$ at the end of the procedure.

Proof of soundness

If there is a $g$ such that $KB \vdash g$ but $KB \nvDash g$ , there must be some atoms added to $C$ which aren’t true in every model of $KB$ . Call the first such atom $h$ .

Thus, there must be clause of the form $h \leftarrow b_2 \land \dots \land b_m$ . As $h$ is the first wrong atom, each $b_i$ must be true in some interpretation $I$ . Thus, this clause must be false in $I$ ; thus, cannot be a model of $KB$ .

Fixed Point

The $C$ generated at the end of the bottom-up algorithm is called a fixed point.

If $I$ is the interpretation in which every element of the fixed point is true, and every other atom is false:

$I$ is a model of $KB$
- If $h \leftarrow b_1 \land \dots \land b_m$ in $KB$ is false in $I$ . Then, $h$ is false but each $b_i$ is true in $I$
  - Thus, $h$ can be added to $C$ (WHY?), meaning it is not the fixed point
$I$ is called a minimal model

Proof of Completeness

Suppose $KB \models g$ ; $g$ is true in all models of $KB$
Thus, $g$ is true in the minimal model
Thus, $g$ is in the fixed point
Thus, $g$ must have been generated by the bottom up algorithm
Thus, $KB \vdash g$

Top-Down Procedure

Search backwards from a query to determine if it is a logical consequence of $KB$ (i.e. asking if an atom is true).

An answer clause: $yes \leftarrow a_1 \land \dots \land a_m$ .

The SLD resolution of the answer clause on atom $a_i$ with the clause $a_i \leftarrow b_1 \land \dots \land b_p$ is another answer clause:

yes \leftarrow \; a_1 \land \dots \land a_{i-1} \land b_1 \land \dots \land b_p \land a_{i+1} \land \dots \land a_m

Basically: replace the atom with its clause, repeating until no more replacements can be made.

An answer is an answer clause $m=0$ with $m=0$ ; that is, the answer clause is $yes \leftarrow$ .

Derivations

Derivation of query $? q_1 \land \dots \land q_k$ is a sequence of answer clauses $\gamma_0, \dots, \gamma_n$ such that:

$\gamma_0$ is the answer clause $yes \leftarrow q_1 \land \dots \land q_k$
$\gamma_i$ is the answer clause obtained by resolving $\gamma_{i-1}$
$\gamma_n$ is an answer

Procedure

To solve the query $? q_1 \land \dots \land q_k$ :

$ac := yes \leftarrow q_1 \land \dots \land q_k$
Until $ac$ is an answer:
- Select $a_i$ atom from the body of $ac$
  - Choose a clause $C$ from $KB$ with $a_i$ as the head ( $a_i \leftarrow body$ )
  - Replace $a_i$ with the body of $C$
- If the clause is just a definition (e.g. $e.$ ), then it does not get replaced with anything; $ac$ shrinks

Either don’t-care non-determinism, in which case if a selection does not lead to a solution, there is no point in trying other alternatives, or don’t-know-non-determinism, in which other choices may lead to a solution.

A successful derivation would return $ac := yes$ .

A failing derivation would return something in the form $ac := yes \leftarrow a_0 \land \dots \land a_n$ . This does not mean it cannot be derived, just that it failed. Use DFS; backtrack.

Weeks 04-05: Prolog

Prolog - programming in logic

Intro

Describe the situation of interest
Ask a question
Prolog logically deduces new facts, and gives deductions back as answers
Prolog has an interactive interpreter
- To exit, use halt.
- Load a knowledge base using consult(filename).
- Reload using reconsult(filename).
- make. reloads all changed source files
- Ask queries in interactive mode (?:)
- Comments: /**/ and %
- write(+Term) always succeeds. It prints out stuff.
- trace/1

Basic Syntax

Facts and rules are both clauses
- The end of a clause is marked with a full stop
happy(yolanda). is a fact
- happy is a predicate; a test - it is ‘true’ (provably true) or ‘false’ (unknown) for the given argument; it is not a function call
listensToMusic(yolanda) :- happy(yolanda). is a rule
- If RHS (head) is true, then LHS (body) must be true
- :- means $\leftarrow \text{}$
, means conjunction ( $\land$ )
; means disjunction ( $\lor$ ). It can defined by having two rules; this is just syntactic sugar

Variables

Variables start with an underscore or upper case letter, and may contain upper, lower, digits or underscores. happy(X). returns a term that replaces X such that the rule is met. Typing j tries to find another term that satisfies the rule.

Conjunction can also be used e.g. happy(X), listensToMusic(X)..

Variables can be in the knowledge base as well e.g. jealous(X,Y) :- loves(X,Z), loves(Y,Z).

Atoms

A sequence of characters (upper, lower, digits, underscore) starting with a lowercase letter OR a sequence of special characters (:, ,, ;, ., :-). Atoms can be enclose in single quotes if it does not meet the naming requirements (e.g. 'Yolanda').

Complex terms

Functor directly followed by a sequence of arguments - put in brackets, separated by commas. e.g. listensToMusic(yolanda), hide(X,father(father(father(butch)))..

Arity

The number of arguments a complex term has; predicates with the same functor but different arity can be defined.

Unification

Two terms unify if they are the same term or contain variables that can be uniformly instantiated with terms in a way such that the resulting terms are equal.

e.g. woman(mia) = woman(mia). woman(Z) and woman(mia) will be unified, with Z taking the value of mia.

e.g. defining horizontal(line(point(X,Y), point(X,Z))). and running horizontal(line(point(1,2), point(X,3))). will unify to X=1.

Search Tree

f(a).
f(b).
g(a).
g(b).
h(b).
k(X):-f(X),g(X),h(X).

Asking ?: k(Y). will:

Set Y=X
Replace k(Y). with f(X), g(X), h(X)
Try X=a
f(a) succeeds
g(a), h(a).: dead end
Try X=b
f(b),g(b),h(b) becomes g(b), h(b) becomes h(b) becomes empty; success
Get Y=B

Recursion

For recursive predicates using conjunction, place the non-recursive term first - using DFS, so will run out of memory otherwise.

numeral(0).
numeral(succ(X)):-numeral(X).

Where succ(X) returns X + 1: 3 could be defined as numeral(succ(succ(succ(0)))). numeral(X). will continue going on forever. Beware of running out of memory e.g. p:-p..

Addition

add(0,X,X). is the base clause. No return values, so three arguments needed (the last is the return value).

add(succ(X),Y,succ(Z)):-add(X,Y,Z). is for the recursive case.

`Dif` predicate

mother(X, Y) :- parent(X, Y), female(X).
sister(X, Y) :- parent(Z, X), parent(Z, Y), female(X).

Any female is their own sister so the dif predicate is required: append X \= Y to the end of the sister body.

Predicate description - argument mode indicator

Character prepended to argument when describing a functor:

A + argument must be fully instantiated: must be input
A - argument must be unbound: must be an variable
A ? argument can be either instantiated or unbound

Logical quantification

Variables that appear in the head of a rule are universally quantified.

Variables that appear only in the body are existentially quantified.

path(X,Y) :- edge(X,Z), path(Z,Y).: For all nodes X and Y, there exists a node Z such that there is an edge from X to X and a path from X to Y

Lists

A finite sequence of elements e.g. [[], dead(z), [2, [b, c]], Z, 3].

Lists as implemented as linked lists. A non-empty list has two parts:

The head: the first item in the list (type element)
The tail: everything else (type list)
- The last element will be a special empty list [] - it has neither a head or tail

The | operator decomposes a list into a head and tail:

[Head|Tail] = [a, b, c, d] (or vice-versa) sets Head to a and Tail to [b, c, d]
[X,Y,Z] = [a, b, c, d] also works, setting Z to [c, d]
[X|Y] = [] fails as the empty list has neither a head or tail.

Anonymous variables

If you do not need the value of a variable, use _ - the anonymous variable. Two instances of _ may not be equal.

Membership

Is an element a member of a list? Use member/2:

member(X, [X|_]). % If element is the head, it is in the list
member(X, [_|T]) :- member(X, T). % Else, recurse through the list until it fails

member(X, [a, b, c]) can unify to three separate values.

% Exercise: a2b/2; first length all a's and second list all b's; both of the same length

a2b([], []).
a2b([a, L1], [b, L2]) :- a2b(L1, L2).

Append

Append two lists together using append/3, where the third argument is the result of concatenating the lists together.

append([], L, L).
append([H|L1], L2, [H|L3]) :- append(L1, L2, L3).
% If L3 is result, first elements of L1 and L3 must be the same. Hence, remove the first element from both L1 and L3 until L1 is empty.

Concatenating a list is done by traversing down one of the lists; hence, it is inefficient.

Prefix and Suffix

prefix(P, L) :- append(P, _, L). with prefix(X, [a, b, c]) generating all possible prefix lists of [a, b, c].

suffix(S, L) :- append(_, S, L). with suffix(X, [a, b, c]) generating all possible suffix lists of [a, b, c].

Sublist

Sublists are prefixes of suffixes of the list:

sublist(Sub, List) :- suffix(Suffix, List), prefix(Sub, Suffix).

Reversing a list

If given [H|T], reverse the list by reversing T and appending the list to H.

naiveReverse([], []):
naiveReverse([H|T], R) :- naiveReverse(T, RT), append(RT, [H], R).

This is inefficient: append/3 is O(n) and this is done at each stage, so it is O(n^2).

By using an accumulator, we can improve things:

The accumulator is a list, initially empty
Head of the list is prepended to the head of the accumulator
Repeat until the list is empty

% Third argument is an accumulator
accReverse([], L, L). % If list is empty, reversed array is the accumulator
accReverse([H|T], Acc, Rev) :- accReverse(T, [H|Acc], Rev).

reverse(L1, L2) :- accReversse(L1, [], L2).
% append/3 not used

List          Acc
[a, b, c, d]  []
[b, c, d]     [a]
[c, d]        [b, a]
[d]           [c, b, a]
[]            [d, c, b, a]

Arithmetic and other operators

C	Prolog
`<`	`<`
`<=`	`=<`
`==`	`=:=`
`!=`	`=/=`
`>=`	`>=`
`>`	`>`

These force the left and right hand arguments to be evaluated.

= is the unification predicate and \= is the negation. == is the identity predicate, which succeeds if the arguments are identical.

2+2 = 4. is false as +(2, 2) does not unify to 4. You must use 2+2 =:= 4.

! is the cutback operator - it suppresses backtracking. The fail predicate always fails. Using these two allows us to invert the result:

neg(Goal) :- Goal, !, fail.

If Goal unifies, it gets to ! so can never backtrack. Then it gets to fail an fails. If the ! was not there, it would attempt to evaluate Goal again.

As this is so common, there is a built in operator that does this: \+.

Week 06: Constraint Satisfaction Problems

These problems are characterized by:

A set of variables $V_1$ , $V_2$ , \dots, $V_n$
A set of domains for each variable: $D_{V_i}$ is the domain for $V_i$
A set of constraints on various subsets which specify legal combinations of values for these variables (e.g. $V_1 \neq V_2$ )

A solution is a n-tuple of values for the variables that satisfies the constraints.

Examples

Australian map colouring:

Variables: $WA, NT, Q, NSW, V, SA, T$
Domains: ${red, green, blue}$
Constraints: $WA \neq NT, WA \neq SA, NA \neq Q, NT \neq SA, SA \neq Q, SA \neq NSW, SA \neq V, NSW \neq V$

Sudoku:

Variables: unfilled values
Domains: 1-9
Constraints: sudoku rules

Eight queens puzzle:

Variables: row in which queen is located
Domains: 0-8
Constraint: no two queens in the same row

Basic Algorithms

Generate-and-Test algorithm

Generate the assignment space $D = D_{v_1} \times \dots \times D_{v_n}$ (cartesian product), then test each assignment with the constraints.

It is exponential in the number of variables.

Backtracking

Systematically explore D by instantiating variables one at a time, evaluating each constraint as all its variables are bound.

Thus, any partial assignment that doesn’t satisfy the constraints can be pruned (e.g. if $A \neq B$ , can prune these even if $C$ , $D$ etc. have not been instantiated yet).

CSP as graph search

A node is an assignment of values to some of the variables.

Search:

Select a variable $Y$ that is not assigned to node $N$
Generate the neighbor to $N$ where $Y$ has been assigned for all possible values of $Y$
Prune if the assignment is not consistent with the constraints

The start node is the empty assignment, and the goal node is a total assignment that satisfies the constraints.

CSP Algorithms

CSP is NP-hard. However, some instances of CSP can be solved more efficiently by exploiting certain properties.

Constraint Networks

An instance of CSP can be represented as a network:

Oval node for each variable
- Each variable has a domain associated with it
Rectangle node for each constraint
Arcs from each variable to constraints that involve it

Arc Consistency

An arc $\langle X, r(X, \overline{Y})\rangle$ is arc consistent if for every value of $X$ , there a value of $\overline{Y}$ such that $r(x,\overline{y})$ is satisfied. $\overline{Y}$ may be a set of multiple variables (variables in the scope of the constraint, except $X$ ).

A network is arc consistent if all arcs are arc consistent.

If an constraint has only one variable in its scope and every value in the domain satisfies the constraint, then the arc is domain consistent.

If there is an arc that is not arc consistent, remove values from $X$ ’s domain to make it arc consistent.

Example:

Variables $A, B, C$
Domain $[1, 4]$ for all variables
$A + B = C$

One arc would be $\langle{A, A+B=C}\rangle$ :

$\overline{Y} = domain(B) \times domain(C)$
To make it arc consistent, $4$ must be removed from the domain of $A$

By repeating this with other nodes, the network can be made arc consistent.

Arc Consistency Algorithm

Arcs can be considered in series. An arc, $\langle X, r(X, \overline{Y})\rangle$ , needs to be revisited if the domain of one of the $Y$ ’s is reduced.

def GAC(variables, domains, constraints):
  # GAC: Generalized Arc Consistency algorithm
  return GAC2(variables, domains, constraints, [((X, C) for C in constraints if X in scope(C)) for X in variables].flatten())

def GAC2(variables, domains, constraints, todo):
  while not todo.isEmpty():
    X, constraint = todo.pop() # The variable is in the scope of the constraint (i.e. it is relevant)
    Y = scope(todo)
    Y.pop(X) # Need to remove the chosen from the set of constraints

    new_domain = [x for x in domain(X)
      # such that there exists a set (y_1,..., y_n) such that is_consistent(X=x, Y_1=y_1, ..., Y_n=y_n)
    ]

    if new_domain != domain(X):
      # need to re-check all variables that are involved with any constraints involving X
      todo += [(Z, C) for C in constraints if X in scope(C) and Z in scope(C) and X != Z]
      domain(X) = new_domain

There are three possible outcomes of this algorithm:

One or more domains are empty: there is no solution
Each domain has a single value: unique solution
Some domains have multiple values: there may or may not be a solution. More work must be done

Domain Splitting

Split a problem into a number of disjoint cases and solve each case separately: the set of solutions is the union of all solutions to each case.

e.g. if $X \in {0,1}$ , find all solutions where $X=0$ , then where $X=1$ .

Algorithm

Given a CSP instance:

Select any variable $X$ that has more than one value in its domain
Split $X$ into two disjoint, non-empty sets
Create two new CSP instances with the new domain for $X$

def CSPSolver(variables, domains, constraints, todo):
  if [domain for domain in domains if len(domain) == 0] is not empty:
    return False
  else if [domain[0] for domain in domains if len(domain) == 1] is the same length:
    # Every domain has one possible value
    return (domain[0] for domain in domains)

  X = [X for X in domains if domain(X) > 1][0]
  D1, D2 = split(domain(X))

  domains1 = domains.replace(X, D1)
  domains2 = domains.replace(X, D2)

  todo = [(Z, C) for C in constraints if X in scope(C) and Z in scope(C) and X != Z]
  # Domain of X has been split so need to recheck all variables that are involved with constraints involving X

  return CSPSolver(variables, domains1, constraints, todo) or CSPSolver(variables, domains2, constraints, todo)

Variable Elimination

Eliminate variables one-by-one, passing constraints onto their neighbors.

Constraints can be thought of as a relation containing tuples for all possible valid values.

A join operation can be done on two constraints to capture both constraints, joining on the variable being eliminated. Then, this table can be projected with that column (variable being eliminated) removed.

Algorithm

If there is only one variable, return the intersections of the constraints
Select a variable $X$
Join the constraints in which $X$ appears to form constraint $R_1$
Project $R_1$ onto its variables other than $X$ to form $R2$
Replace all constraints in which $X$ appears with $R2$
Recursively solve

def VE_CSP(variables, constraints):
  if len(variables) == 1:
    return join(...constraints)
  else:
    X = variables.pop() # select variable to eliminate and remove from variables
    constraints_X = [C for C in constraints if X in scope(C)]
    R = join(...constraints_X)
    R2 = project(all_variables_but_X)

    new_constraints = [C for C in constraints if X not in scope(C)] + R2
    recursed = VE_CSP(variables, new_constraints)
    return join(R, recursed)

Week 07: Local and Global Search

An optimization problem has:

A set of variables and their domains
An objective function

The assignment that optimizes (maximize/minimize) the value of the objective function must be found.

A constrained optimization problem adds a set of constraints which determines which assignments are allowed.

Local Search

Use algorithms to iteratively improve a state.

A single current state is kept in memory and in each iteration, we move to one of its neighbors to improve it.

Most local search algorithms are greedy. Two such algorithms are hill climbing and greedy descent.

e.g. TSP: start with any complete tour, and perform pairwise exchanges (pick two edges, switch the vertex to one in the other edge). This type of approach can get close to the optimal solution quickly.

Local Search for CSPs

CSP can be reduced to an optimization problem.

If each variable is assigned a value, a conflict is an unsatisfied constraint: the goal is to find an assignment with zero conflicts.

Hence, the heuristic function to to be minimized will be the number of conflicts.

Example: n-queens problem

Put n queens on an n by n board such that no two queens can attack each other.

Heuristic: number of pairs of queens that can attack each other.

One queen on each column, queens can move up and down only. For each state there will be $n(n-1)$ neighbors (each can move to $n-1$ locations).

Variants of Greedy Descent

Find the variable-value pair that minimizes the number of conflicts at each step
Select the variable that participates in the most number of conflicts, and find the value of that variable that minimizes this
Select a variable that appears in any conflict and find the value of that variable that minimizes this

Issues

Can get stuck in local optima/flat areas of the landscape - randomized greedy descent can sometimes help:

Random step: move to a random neighbor
Random restart: reassign random values to all variables

These two make the search global.

Parallel Search

A total assignment is called an individual.

Maintain a population of k individuals instead of one and update each individual at every stage. If an individual is a solution, it can be reported (and the search can stop).

With local search, random restarts will occur when it reaches a non-zero local minimum and finish when a solution is found.

With parallel search, if a solution is found all individuals can be stopped.

Simulated Annealing

Pick a variable at random
- Pick a value at random
If it is an improvement, adopt it
If not. adopt it with some probability

Given:

The current assignment is $n$
The proposed assignment $n'$
The objective function $h$
The current temperature parameter is $T$

The probability of adopting the new value is:

e^{(h(n) - h(n'))/T}

As temperature gets reduced, the probability of accepting a change decreases.

Gradient Descent

Objective function must be (mostly) differentiable:

def gradient_descent(f, initial_guess):
  # f = objective function
  x = initial_guess # x is a vector
  while magnitude(grad(f)(x)) > epsilon:
    # f is a multi-variable function, so \nabla f returns a vector
    x = x - step_size * grad(f)(x)
    # Steepest slope is along the vector, so walk along it

Evolutionary Algorithms

Requires:

Representation
Evaluation function
Selection of parents
Reproduction operators
Initialize procedures
Parameter settings

Flow

After initializing the population:

Calculate fitness of individuals
- Terminate if ‘perfect’ individual found, time limit reached etc.
Select individuals (some randomness required)
Crossover: select two or more candidate individuals and somehow combine them
Mutation

Evaluation/Fitness Function

A value relating to how ‘good’ an individual is.

Used for:

Parent selection: which parents will reproduce
Measure of convergence: when performance no longer increases significantly between generations
Steady state: selecting which individuals will die

Selection

Better individuals should have a higher chance of surviving and breeding.

Roulette wheel

Each individual is assigned to a part of the roulette wheel, with the ‘angle’ being proportional to its fitness, and it is spun n times to select n individuals.

Sum the fitness of individuals to variable T
Generate a random number N between 1 and T
Return the first individual whose fitness added to the running total is equal to or larger than N

def roulette_wheel_select(population, fitness, r):
  total_fitness = sum([fitness(individual) for individual in population])

  rolling_sum = 0

  for individual in population:
    rolling_sum += fitness(individual)
    if rolling_sum > total_fitness * r:
      return individual

Tournaments

n (size of the tournament) individuals chosen randomly: the fittest one is selected as a parent.

If n is one, it is equivalent to randomly selecting an individual.

If n is the population size, it is completely deterministic.

Elitism

Keeps at least one copy of the fittest solution so far for the next generation - ensures the fitness will not decrease over generations.

Reproduction Operators

Crossover

Two parents produce two offspring. 1, 2 or n crossover points are generated:

One point crossover: child has parent one’s chromosomes up until the random cross over point; after that is the other parent’s chromosomes.

n point crossover: at n separate points, it swaps from getting chromosomes from one parent to the other.

Mutation

Mutation gives a low probability of randomly changing the gene of a child.

Mutation brings in diversity as compared to combining candidates to (hopefully) produce better children.

Randomly Generating Programs (Trees)

Pick a non-terminal
Pick children for that element, either a terminal or non-terminal
Repeat

Mutation: pick a random node and replace the subtree with a randomly-generated subtree.

Crossover: pick a random node in each parent and exchange them.

Week 08: Probabilistic Inference and Belief Networks

In the real world, there will be uncertainty and randomness due to several factors:

Theoretical and modelling limitations
- Coin toss: incomplete physics model
Sensory and measurement limitations
- Measurements not accurate enough
Computational limitations
- Computation too time consuming

Hence, using probabilities is often required.

Random Variable

Some aspect of the world about which there is uncertainty.

RVs are denoted with a capital letter and have an associated domain.

Unobserved RVs have distributions; a table of probabilities of values. A probability is a single number e.g. $P(W = rain) = 0.1$ . The probabilities sum to 1 and none are negative.

Joint distribution over a set of RVs: a map from assignments/outcomes/atomic events to reals; $P(X_1 = x_1, \dots, X_n = x_n)$ .

Event: set E of assignments; $P(E) = \sum_{(x_1, \dots, x_n \in E)}{P(x_1, \dots, x_n)}$ .

Marginalization (summing out): projecting a joint distribution to a sub-distribution over subset of variables: $P(X_1 = x_1) = \sum_{x_2 \in domain(X_2)}{P(X_1 = x_1, X_2 = x_2)}$ .

Conditional probability: $P(a|b) = \frac{P(a, b)}{P(b)}$ .

Conditional distribution: probability distribution over some variables given fixed values of others. If $W$ and $T$ take binary values, $P(W, T)$ is a 2 by 2 table, $P(W|T)$ is two 2-row tables, each summing to 1.

To get the whole conditional distribution at once, select the joint probabilities matching the evidence and normalize the selection (make it sum to 1). Example:

$P(T|rain)$ : select rows where $R=rain$ , then divide the probabilities by the sum $P(warm, rain) + P(cold, rain)$ .

\begin{aligned} P(x_1|x_2) &= \frac{P(x_1, x_2)}{P(x_2)} \\ &= \frac{P(x_1, x_2)}{\sum_{x_1 \in domain(X_1)}{P(x_1, x_2)}} \end{aligned}

Product rule:

\begin{aligned} P(x|y) &= \frac{P(x, y)}{P(y)} \\ \therefore P(x, y) &= P(x|y) \cdot P(y) \end{aligned}

Chain rule: $P(x_1, x_2, x_3) = P(x_1) \cdot P(x_2|x_1) \cdot P(x_3|x_1, x_2)$ . More generally:

P(x_1, \dots, x_n) = \prod_{i=1}^{n}{P(x_i|x_1, \dots, x_{i-1})}

Probabilistic Inference

Computing a desired probability from other known probabilities.

$P(x, y) = P(x|y) \cdot P(y) = P(y|x) \cdot P(x)$ . By dividing this by the marginal, we get Baye’s rule:

P(x|y) = \frac{P(y|x) \cdot P(x)}{P(y)}

This allows us to invert a conditional distribution - often one conditional is simple but the other is tricky. $$ \begin{aligned} P(x,y|z) &= P(y,x|z) \ \ P(x,y|z) &= \frac{P(x,y,z)}{P(z)} \ P(y|x,z) &= \frac{P(x,y,z)}{P(x,z)} \ \therefore P(x,y,z) &= P(y|x,z) \cdot P(x,z) \ \therefore P(x,y|z) &= \frac{P(y|x,z) \cdot P(x,z)}{P(z)} \ &= P(y|x,z) \cdot P(x|z) \ \ \therefore P(x|y,z) &= \frac{P(y|x,z) \cdot P(x|z)}{P(y,z)} \ \text{if z is implicit}: \ P(x|y) &= \frac{P(y|x) \cdot P(x)}{P(y)} \text{ (Baye’s rule)} \end{aligned} $$

Inference by Enumeration

A more general procedure: $P(Y_1, \dots, Y_m|e_1, \dots, e_k)$ where:

$(E_1, \dots, E_k) = (e_1, \dots, e_k)$ are evidence variables
$Y_1, \dots, Y_m$ are query variables
$H_1, \dots, H_r$ are hidden variables

These variables can be referred to as $X_1, \dots, X_n$ .

First, select entries consistent with the evidence.

Then, sum out $H$ :

P(Y_1, \dots, Y_m, e_1, \dots, e_k) = \sum_{h_1, \dots, h_r}{P(Y_1, \dots, Y_m, h_1, \dots, h_r, e_1, \dots, e_k)}

Finally, normalize the remaining entries.

Complexity of Models

Simple models are easier to build, explain, and usually lead to lower time complexity and space requirements.

To measure complexity, count the number of free parameters that must be specified; a joint distribution over n variables, each with a domain size of d requires $d^n$ entries in the table, and the number of free parameters will be $d^n - 1$ (the last one can be inferred as probabilities must sum to 1).

Independence

Two variables are independent if: $$ \begin{aligned} P(X, Y) &= P(X) \cdot P(Y) \text{ or} \ \forall x, y: P(x, y) &= P(X=x) \cdot P(Y=y) \end{aligned} $$ That is, their joint distribution factors into a product of two simpler distributions.

Absolute independence: $X{\perp\!\!\!\perp}Y$

Independence can be used as a modelling assumption; if all the variables are independent, instead of having $d^n - 1$ parameters in the joint model, we only need $nd - 1$ rows.

Absolute (unconditional) independence is vary rare; conditional independence is more robust. For a given value of $Z$ , the probability of $X$ is independent of $Y$ ; $X{\perp\!\!\!\perp}Y|Z$ if:

\begin{aligned} &\forall x, y, z: P(x, y|z) = P(x|z) \cdot P(y|z) \text{ or}\\ &\forall x, y, z: P(x|z, y) = P(x|z) \end{aligned}

In this case:

\begin{aligned} P(X, Y, Z) &= P(X|Y, Z) \cdot P(Y, Z) \\ &= P(X|Y, Z) \cdot P(Y|Z) \cdot P(Z) \\ &= P(X|Z) \cdot P(Y|Z) \cdot P(Z) \end{aligned}

This can occur if $X$ and $Y$ are both dependent on $Z$ but are independent of each other; the value of $X$ modifies the probability of the parent, $Z$ ’s, value and thus modifies the probability of $Y$ in turn.

Belief Networks

A way to describe complex joint distributions
Sometimes called Baye’s nets or Bayesian networks
Local interactions chain together to give global, indirect interactions

Graphical Model Notation

Arcs:

Allows dependence between variables
Arcs don’t force dependence
Arrows may imply causation (but don’t have to)

Nodes:

One node per RV
- Either assigned (observed) or unassigned (unobserved)
Conditionally independent of its non-descendants given its parents’ state
Conditionally independent of all other nodes given its parents, children, and children’s parents’ state
- Called a Markov blanket

Distributions:

A collection of distributions (CPTs) over each node; one for each combination of the parents’ values
- $P(X|a_1, \dots, a_n)$ for all combinations of $a$

D-separation can be used to decide if a set of nodes $X$ is independent of $Y$ given $Z$ .

Encoding

BNs implicitly encode joint distributions; this can be calculated as a product of local conditional distributions.

Example:

\begin{aligned} A &\rightarrow B \\ A &\rightarrow C \\ B, C &\rightarrow D \\ C &\rightarrow E \\ \end{aligned}

\begin{aligned} P(a, b, c, d, e) &= P(e|a, b, c, d) \cdot P(a, b, c, d) \quad\text{(by the product rule)} \\ &= P(e|c) \cdot P(a, b, c, d) \quad\text{(e dependent on c but independent of all others)} \\ &= P(e|c) \cdot P(d|a, b, c) \cdot P(a, b, c) \\ &= P(e|c) \cdot P(d|b, c) \cdot P(a, b, c) \\ &= P(e|c) \cdot P(d|b, c) \cdot P(c|a, b) \cdot P(a, b) \\ &= P(e|c) \cdot P(d|b, c) \cdot P(c|a) \cdot P(b|a) \cdot P(a) \\ \end{aligned}

More generally, if you have a full assignment, multiplying the relevant conditionals gives the probability:

P(x_1, \dots, x_n) = \prod_{i=1}^{n}{P(x_i | parents(X_i))}

product = 1
for i in range(1, n):
  product *= probability(x[i] | parents(i))

Thus, we can reconstruct any entry of the full joint. However, not every BN can represent every full joint.

Inference Enumeration

Computing $P(Y | e)$ . If $H$ are the hidden variables and $\alpha$ normalizes the values so the sum of the probabilities is 1:

P(Y|e) = \alpha \cdot P(Y, e) = \alpha \sum_H{P(Y, e, H)}

(NB: $\sum_H$ means $\sum_{H_1}{\sum_{H_2}{\dots}}$ )

This has to be computed for every value in the domain of $Y$ .

Answering $P(Y)$ : no evidence; all variables except the query are hidden
Answering $P(y|e)$ : answer $P(Y|e)$ , then pick result for $Y=y$
Answering $P(Y_2=y_1, Y_2=y_2 | e)$ : $P(y_1|y_2, e) \cdot P(y_2|e)$

Week 09: Basic Machine Learning

Learning: improving behavior based experience. This could be:

The range of behaviors increasing
The accuracy of its tasks increasing
The speed at which it executes its tasks is faster

Components of a learning problem:

Task: the behavior/task being improved e.g. classification
Data: experiences used to improve the performance of its tasks
Measure of improvement: a way of measuring the performance/improvement e.g. accuracy of classification

Learning architecture:

The learner is fed experiences/data and background knowledge/bias
The model is what does the reasoning/prediction
The reasoner is fed the problem/task, and outputs an answer/performance

Supervised Learning

Given the following as input:

A set of input attributes/features/random variables: $X_1, \dots, X_n$
A target feature $Y$ (discrete class value or continuous/real value) - what is being predicted
A set of training examples/instances where the value of the input and target variables are given

This is fed to a learning algorithm to build a predictive model that takes a new instance and returns/predicts the value for the target feature.

For continuous target variables, regression is used.

Measuring performance

Common performance measures:

\text{error} = \frac{\text{number of incorrectly classified instances}}{\text{total number of instances}}

\text{accuracy} = 1 - \text{error} = \frac{\text{number of correctly classified instances}}{\text{total number of instances}}

In binary classification problems, one class is called positive (p) and the other negative (n).

Common performance measures for regression:

Mean squared error (MSE): $\frac{1}{n} \cdot \sum_{i=1}^{n}{(Y_i - \hat{Y_i})^2}$
Mean absolute error

Training and Test Sets

A set (multi-set) of examples is divided into training and test examples; this is required as the model can overfit the training data, giving high performance on the training data but low performance on unseen data.

The more complex the model is, the lower the error on the training data.

A general pattern is that at a certain complexity, increasing the complexity of the model increases the error on the test data.

Naïve Bayes Model

P(C | X_1, \dots, X_n) = \alpha \cdot \prod_{i=1}^n{P(X_i | C)} \cdot P(C)

Where:

Features $X$ are independent given the class variable $C$
$P(C)$ : prior distribution of $C$
$P(X_i | C)$ : likelihood conditional distributions
$P(C | X_1, \dots, X_n)$ : posterior distribution

Conditional probabilities can be estimated from labelled data.

Find $P(Class | an\_input\_vector)$ for different classes and pick the class with the highest probability.

Problem: hard to learn $P(Class | Evidence)$ as there needs to be examples for every possible assignment. As the number of features increases, the number of assignments grows exponentially.

Thus, assume input features are conditionally independent given the class model.

Example: Building a Classifier

Determine if the patient is susceptible to heart disease (y/n) given family history (t/f), fasting blood sugar level (l/h), BMI (l, n, h).

Model it as $Hist$ , $BG$ , $BMI$ all having $Class$ as a parent:

P(Class | Hist, BG, BMI) = \alpha \cdot P(Class) \cdot P(Hist | Class) \cdot P(BG | Class) \cdot P(BMI | Class)

The class can take two values, so there are two tables per feature and two rows for $Hist$ / $BG$ per table (three for $BMI$ as it has three values).

NB: in the quiz, you only store value for when class is true.

To calculate $\alpha$ , calculate the sum of $P(Class | Hist, BG, BMI)$ for all values of $Class$ , and then take the inverse.

Laplace Smoothing

Zero counts in small data sets lead to zero probabilities - this is too strong a claim based on only a sample.

To fix this, add a non-negative pseudo-count to the counts - this can reduce the confidence.

$domain(A)$ is the set of values $A$ can take; $|domain(A)|$ is the number of values $A$ can take

$count(constraints)$ is the number of examples in the dataset that satisfy the given constraints:

e.g. $count(A=a, B=b)$ is the number of examples in the dataset where $A=a$ and $B=b$
$count()$ is the number of examples in the dataset

Given these:

P(A=a | B=b) \approx \frac{count(A=a | B=b) + pseudo\_count}{\sum_{a' \in domain(A)}{ count(A=a', B=b) + pseudo\_count }}

This is equivalent to:

P(A=a | B=b) \approx \frac{count(A=a | B=b) + pseudo\_count}{count(B=b) + pseudo\_count \cdot |domain(A)|}

The greater the pseudo-count is, the closer the probabilities will even out (closer to $\frac{1}{|domain(A)|}$ ).

Parametric vs Non-Parametric Models

Parametric models described with a set of parameters; learning means finding the optimal values for these parameters.

Non-parametric models are not characterized by parameters - a family of this is called instance-based learning:

Instance-based learning is based on memorization of the dataset
The cost of learning is 0; all the cost is in the computation of the prediction
It is called lazy-learning: learning is put off until it is required

k-Nearest Neighbors

An example of a instance-based learning algorithm is k-nearest neighbors:

It uses the local neighborhood to obtain a prediction - the k memorized examples most similar to the one being classified is retrieved
A distance function is used to compare similarity (e.g. Euclidean or Manhattan distance)
If the distance function is changed, how examples are classified changes

Training only requires storing all the examples.

Prediction: $H(x_new)$ :

Let $x_1, \dots, x_k$ be the k most similar examples to $x_new$
$h(x_{new}) = combine\_predictions(x_1, \dots, x_k)$ ; given the k nearest neighbors to $x_{new}$ , calculate which value it should have

If k is too high, it will be under-fit.

Geometrically, each data point $x$ defines a ‘cell’ of space where any point within that space has $x$ as the closest point to it. As the target feature is discrete, decision boundaries can be made where the class is different on each side of the boundary.

Week 10: Artificial Neural Networks

Perceptron

A neuron receives signals from multiple inputs (inputs are weighted), and if the overall signal is above a threshold, the neural fires. A perceptron models this with:

Inputs $x_1, \dots, x_n$
Parameters (weights) $w_1, \dots, w_n$ , and $\mathrm{bias}$
Output (activation function): $g(a)$

a = \sum_{i = 1}^{n}{w_i\cdot x_i} + bias

Sometimes, bias is represented as another weight $w_0$ - in this case, there is a virtual input $x_0 = 1$ and hence:

a = \sum_{i = 0}^{n}{w_i\cdot x_i}

For this course, $g(a) = 1$ if $a \ge 0$ and $0$ otherwise; a Heaviside (step) function.

A perceptron can be seen as a predicate: given a vector $x$ , $f(x) = 1$ if the predicate over $x$ is true, and 0 otherwise. Hence, it can be used in decision making and binary classification problems ( $f(x) = 1$ if in the positive class).

The function partitions the input space into two sections: if there are two inputs, the decision boundary will be a straight line: $w_1$ and $w_2$ determine the gradient and $w_0$ determines the intercept. For a given value of these weights, the decision boundary can be found (e.g. points where $x_1$ or $x_2$ equal zero).

If there are three inputs, the decision boundary is a plane. For n dimensions, it will be a hyperplane.

The vector, $\underline{w}$ (without $w_0$ ), can be thought of as the normal to the line. The minimum distance between the origin and the decision boundary is $\frac{w_0}{\|\underline{w}\|}$ .

$\underline{w}$ can be used to show which side of the hyper-plane will be classified as positive (the direction it points in will be positive).

Learning

Given a data set - a collection of training vectors of the form $(x_1, \dots, x_n, t)$ where $t$ is the target value:

Randomly initialize the weights and bias
While there are mis-classifications and the number of iterations is less than the maximum number of epochs:
- For each training example, classify the example. If it is misclassified, then update the weights, where:
  - $\eta$ is the learning rate
  - $x_j$ is the value for input
  - $t$ is the actual output
  - $y$ is the current prediction (perceptron output):
- Update the weights/bias using:
  - $w_j \leftarrow w_{j} + \eta \cdot x_j(t - y)$
  - $\mathrm{bias} \leftarrow \mathrm{bias} + \eta(t-y)$ (the same equation can be used for bias as above if it is represented as a virtual input)

If examples are linearly separable, the weights and bias will, in finite time, converge to values that produce a perfect separation.

If $\eta$ , the learning rate is too high, the boundary may oscillate back and never perfectly partition the values. If it is too small, it will take a long time to converge.

Multi-Layer Perceptrons

Motivation

y
^
|
| false    true
|
| true     false
|-----------------> x

A single perceptron cannot partition these four points into a decision boundary. However, this can be done with multi-layer perceptrons.

Define two perceptrons, $P_1$ and $P_2$ which receive the same vectors but have their own weights and biases. By some algorithm, $P_1$ could partition the input space so that the upper left is separated from the rest of the points, and $P_2$ the same for the the bottom right.

y          P_1            y   P_2
^                        ^
|          -----------   |
| false(0) | true(1)     | false(1)  true(1)
|----------              |         -----------
| true(1)    false(1)    | true(1) | false(0)
|-------------------> x  |--------------------> x

Now, pass the output of the perceptrons as input to another perceptron $P_3$ :

P2        P_3
^
| --------
| (0, 1)  |  (1, 1) <- two points superimposed
|         ----------
|           (1, 0)  |
|---------------------> P1

Now, this perceptron can form a decision boundary that correctly partitions the input space.

Description

The feed-forward networks we are dealing with arranges perceptrons into layers where:

Adjacent layers are fully connected: all outputs from one layer are used as inputs for each perceptron in the next layer
There are no backwards connections (DAG)
Layers are not skipped

Some notes:

The first layer contains only input nodes
- Hence, the number of input nodes is the number of inputs in the problem domain
The last layer is called the output layer
A network with only one layer is an identify function
Weights and biases are between layers
- Between layer $i$ and $i+1$ :
  - The number of weights is $\mathrm{nodesInLayer}(i) \cdot \mathrm{nodesInLayer}(i + 1)$
  - The number of biases is $\mathrm{nodesInLayer}(i + 1)$ .
Layers between the input and output are called hidden layers

As more layers/neurons are added, the complexity of the boundary shape(s) can increase. If you have two dimensions:

With 2 layers (one perceptron - the first layer contains the input nodes), a straight line can be formed
With 3 layers, any polygon can be formed
- The number of perceptrons in the hidden layer determines the number of sides of the polygon
- If are two perceptrons, not a polygon but two intersecting lines
With 4 layers, multiple polygons can be formed
- Polygons within polygons etc.

Multi-class Classification

This can be done by having multiple numeric outputs and picking the node with the largest value.

Outputting numeric values instead of a Boolean requires the Sigmoid function: $g(a) = \frac{1}{1 + e^{-a}}$ . This function is differentiable at all points and handles uncertainty.

Error Function

The mean squared error is typically used, where $t_i$ is the desired output (according to the training data), $y_i$ is the output of the network, and $n$ is the number of examples in the training set:

E=\sum_{i=1}^{n}{(t_i - y_i)^2}

The weights can be updated incrementally:

W \leftarrow \text W - \eta \nabla E(W)

$\nabla E(W)$ is the gradient of the error; a vector of partial derivatives (derivatives for each input scalar given all other inputs are fixed). The gradient in the output layer is easy to compute, but in the hidden layer neurons can influence multiple other neurons, so back-propagation is needed (not covered).

Typical Architecture

The number of input nodes is determined by the number of attributes and the number of outputs is determined by the number of classes. A single hidden layer is enough for many classification tasks.

Guidelines:

Use as few hidden layers/nodes as possible
- Forces better generalization
- Fewer weights need to be found, reducing training time and cost
- Too many nodes may lead to overfitting
For the hidden layer
- Make a guess at how many nodes you need - a number between the number of input and output nodes
- If unsuccessful, increase the number of nodes
- Is successful, reduce the number of nodes to force better generalization

Week 11: Games - Non-cooperative Multi-agent Systems

Many problems can be modelled as games: multiple agents with (possibly competing) interests. They can be described by:

Actions available to each agent in various stages of a game
Utility (payoff) functions; one for each agent. They assign a real number to every possible outcome of the game (typically terminal states)
- If the utility functions are the same, the agents become cooperative
Strategy functions; one for each function. They determine what action is taken in each state
- Typically, goal is to find the strategy maximizes the utility for an agent or groups of agents

Example: papers scissors rock:

Simultaneous action game instead of turn-based
9 possible outcomes
One utility function for each player (1 for win, -1 for loss, 0 for tie)
- The sum of the scores is a constant value

Game Trees

In perfect-information games, a game tree is a finite tree where the nodes are states and the arcs correspond to actions by the agents.

Each internal node is a labelled with an agent (or nature) - the agent controls the node
- Each internal node labelled with nature has a probability distribution over its children
Each arc coming out of a node with agent i corresponds to an action for i
The leaves represent final outcomes and are labelled with a utility function for each agent

Hence:

A complete game tree contains all possibilities
The root is the current state
A play-through is a path through the tree

Perfect-information, Zero-sum, Turn-based games

Properties:

Typically two agents
Take turns to play
No chance
State of the game fully observable by all agents
Utility values for each agents are the opposite of the other: zero sum
- Hence, one player tries to maximize the utility and the other minimize
- Each level of the tree switches between a max and min node as the players switch turns

These properties creates adversary.

Optimal Strategy: Min-Max Function

Designed to find the best move at each stage:

Generate the whole game tree
Apply the utility function to each leaf
Back-up values from the leaves through branch nodes
- A max node computes the max of its child values, and vice-versa for a min node
If a max node at the root, choose the move leading to the value with the largest value (and vice-versa if root is a min node)

This is optimal if both players are playing optimally.

def min_max_decision(state, root_is_max = True):
  if root_is_max:
    return min([(min_value(child), child) for child in state.children()])[1]
  else:
    return max([(max_value(child), child) for child in state.children()])[1]

def max_value(state):
  if state.is_terminal:
    return state.calculate_utility()

  return max([(min_value(child), child) for child in state.children()])[1]

def min_value(state):
  if state.is_terminal:
    return state.calculate_utility()

  return min([(max_value(child), child) for child in state.children()])[1]

Reducing Search Space

Game tree size:

Tic-tac-toe
- $b \approx 5$ legal actions per state
- $d = 9$ moves total
- $b^d \approx 5^9 \approx 2,000,000$ : searching the entire tree reasonable
Chess
- $b \approx 35$ legal moves per state
- $d \approx 100$ for a typical game
- $b^d \approx 35^{100}$ : searching the entire tree completely infeasible

Hence, the search space needs to be reduced:

Pruning
Heuristic evaluation of states (e.g. ML, number of pieces left in chess)
Table lookup instead of search (e.g. for opening/closing situations)
Monte-Carlo tree search: randomly sample tree

Alpha-Beta Pruning

If an option is bad compared to other available options, there is no point in expending search time to find out exactly how bad it is.

Example:

MAX           3
           /     \
          /       \
Min      3         ?
       / | \     /   \
Leaf  3  9  6   2     ?

We have computed that the value for the first child (min node) is 3 and that the first child of the second child has a value of 2. Hence, the second child has a maximum value of 2, so regardless of the true value of the second child, the outcome of the max will not change.

More generally:

(player)   MAX              (player)   MIN
           / \                         / \
min       m   ...            max     m    ...
               ...                         ...
                 \                           \
min               ?          max              ?
                /   \                       /   \
max    m > n   n     ?       min    m < n  n     ?

If m > n, the max node is guaranteed to get a utility of at least m - hence, the min node with utility n or less will never be reached. Thus, it does not need to be evaluated further. The opposite occurs for a min node.

The algorithm has:

$\alpha$ : highest-value choice found for a max node higher-up (parents) in the tree. Initially $-\infty$
$\beta$ : lowest-value choice found for a min node higher-up (parents) in the tree. Initially $ \infty$

These two values should be passed down to the child nodes during search. If the value returned by the child node is greater than $\alpha$ for a max node, then $\alpha$ should be updated, and vice versa if it is a min node. If $\alpha \geq \beta$ , the node can be pruned.

from math import inf

def alpha_beta_search(tree, is_max_node = True, alpha = -inf, beta = inf):
  # Input: nested array. Numbers are leaf nodes, arrays are inner nodes
  # Returns a tuple of the best utility and path that gets that utility (index numbers)
  # If path is none, pruning occurred or it is a leaf node
  best_path = None
  best_utility = -inf if is_max_node else inf

  if isinstance(tree, int):
    return (tree, None)

  for (i, child) in enumerate(tree):
    utility, path = alpha_beta_search(child, not is_max_node, alpha, beta)
    path = [i] if path is None else ([i] + path) # Append index of child to path received by child

    if is_max_node:
      if utility > best_utility:
        (best_utility, best_path) = (utility, path)

        if utility > alpha:
          # This child is now the largest child encountered by this max node or any parent max nodes
          alpha = utility
    else:
      if utility < best_utility:
        (best_utility, best_path) = (utility, path)

        if utility < beta:
          beta = utility

    if alpha >= beta:
      return (utility, None)
      # In the case that this is a max node:
      # The child node is min node and its value is larger than or equal to the smallest value
      # encountered so far by any parent min nodes. Hence, this (max) node pick this child or
      # a larger one, so this node's parent will reject it. Thus, this node can be pruned
      # Similar logic applies if it is a min node

  return (best_utility, best_path)

Effectiveness

Pruning does not affect the final result
An entire sub-tree can be pruned

Worst case: if branches are ordered, no pruning takes place.

Best case: each player’s best move is the left-most child/evaluated first.

Good move ordering improves effectiveness. Examples:

Sort moves by the remembered move values found last time
Expand captures first, then treats, then forward moves etc.
Run iterative deepening search, sort by value last iteration

Alpha-beta search often gives us $O(b^\frac{d}{2})$ ; an improvement over $O(b^d)$ (i.e. take the square root of the branching factor).

Static (Heuristic Evaluation) Function

Estimates how good the current board configuration (a non-terminal state) is for a player.

A typical function could evaluate how good the state is for the player subtract the opponent’s score. If the board evaluation is $X$ for one player, it is $-X$ for its opponent.

This can be used to perform a cut-off search: after a maximum depth is reached, use the heuristic evaluation instead of finding the actual utility.

Sidenote: three players. We now have two degrees of freedom; the utility is a tuple, not a scalar. Each player will attempt to maximize it’s own dimensions.

Week 12: Concluding Remarks

Final Exam:

Most programming questions have feedback; some have hidden test cases
Covers content from the entire course: slightly less emphasis on content from lab test such that lab test (20%) + exam give even grade coverage
2019 exam will be available: note that some exam questions have been used in this year’s quizzes
Past exams had some different content (e.g. Clojure, computer vision)
Wrong selections in multi-choice questions take away from the mark for the grade (higher weight than correct selections)
No access to lecture notes; there will be info on how to run prolog etc.

Topics covered:

Planning (graph algorithms, path finding)
Problem solving (constraint satisfaction problems)
Logic
Reasoning (classical logic, probabilistic reasoning)
Working with uncertainty (probabilistic methods)
Learning
Competing with adversaries (games)

A recurring pattern across these topics is that the problems they solve can be reduced to search problems.

The industrial revolution fundamentally changed how physical goods were produced, redistributing wealth and job opportunities. Will AI lead to a fundamental shift in how decisions are made?
Fairness: AI in justice systems, recruiting etc.
Singularity: recursive self-improvement leading to runaway systems
- AI probably won’t destroy humanity; we’re doing a good enough job already