Exam Notes
Search
Prune on insert/remove if the end node has been expanded.
A*:
-
Admissible: estimate is always an underestimate
-
Monotone/consistent:
for all neighbors of - That is, the estimate is always less than estimate from a neighbor plus the cost to go to the neighbor
-
Fails if pruning + heuristic not monotone
Bidirectional search (BFS, LCFS, A*):
Iterative deepening: max depth incremented until solution found. DFS is
Propositions and Inference
-
: if is true, must be true. is an atom, the body -
KB: set of definite clauses
-
Interpretation: assignment of truth value to each atom
-
Model: interpretation where all clauses are true
-
Logical consequence: atoms that are true in every model of the KB
Soundness and Completeness
-
means can be derived (is a consequence) from the given proof procedure - Sound if every consequence derived is true (if
, ) - Complete if all consequences are derived (if
, )
Bottom-up/forward-chaining
- Initialize the set of consequences to be the set of atomic clauses
- Find an atom
that is not yet a consequence where and all ’s are consequences; add it to the set - Repeat until no more clauses can be selected
Fixed point: set of consequences generated.
If
Top-down procedure
Answer clause:
Until the answer clause is an answer (
- Pick an atom
- Choose a clause
- Replace
with its body - If it is just a definition (atomic clause like
), the answer clause will shrink
- If it is just a definition (atomic clause like
Prolog
-
consult(filename).loads all knowledge bases -
make.re-consults all knowledge bases -
,is AND,;is OR -
predicate(args) :- definition. -
Search uses DFS: place non-recursive term first
-
List is
[]or[Head_Element|Rest_Of_List] -
_is anonymous variable -
% append(list, list_to_append, result) append([], L, L). append([H|L1], L2, [H|L3]) :- append(L1, L2, L3). % accReverse(original, accumulator, reversed) accReverse([], L, L). accReverse([Head|Tail], Acc, Rev) :- accReverse(Tail, [Head|Acc], Rev). -
!suppresses backtailing;failalways fail; can be combined to invert result. Can also use\+to do the same
Constraint Satisfaction Problems
A set of variables with domains for each, and set of constraints; a solution is an assignment that satisfies the constraints.
Constraint Network:
- Oval for variable; has associated domain
- Rectangle for constraint
- Arc from each variable to each constraint
Arc Consistency
For a given constraint and variable
Elements may need to be removed from the domain of
Algorithm:
- Get all pairings of constraints and variables in their scope
- For each pairing and element, find values in the domain that allow for valid assignments
- If the domain has changed, need to recheck all variables involved with any constraints involving the variable
Empty domains: no solution; multiple values in domains: may or may not have a solution.
Domain Splitting
Halve the domain for some variable, create two instances of the problem with the new domain, and solve both, revisiting only constraints involving the split variable.
Variable Elimination
Eliminate variables by passing constraints on to their neighbors.
- Select a variable
- For each constraint involving
, find all assignments that satisfy that constraint - If a variable is not involved in any of the constraints, it does not need to be assigned
- Join all constraints, then remove
with a project operation - Repeat
Local and Global Search
-
Optimization problem: finding the assignment that optimizes (min/max) the value of an objective function
-
Local search
- Each iteration, move to one of its neighbors
- Use random restarts/steps (moving to random neighbors) to prevent getting stuck in local optima
-
Constrained satisfaction problem
- The objective function is the number of unsatisfied constraints
-
Global search with parallel search
- Get
individuals/total assignments - At each iteration update all individuals - if any individual is a solution the search can stop
- Get
-
Simulated Annealing
- Pick a random variable and value, adopting it if it is an improvement
- If it is not an improvement, adopt it with probability
- Decrease temperature over time
-
Gradient descent
- Walk along the gradient of the objective function to find a minima
Roulette wheel: return the first individual whose fitness added to the running total is larger than a random number between 1 and sum of the fitnesses.
Probabilistic Inference and Belief Networks
For a full assignment:
If it is the probability over all variables, it is called the joint probability distribution.
Basic Machine Learning
- Error = num incorrect / total
- Naïve Bayes model: assume features only dependent on class variable (thing being predicted)
- Laplace smoothing: add pseudo-count to reduce confidence
- Add pseudo-count to counts for every tuple
- K-nearest neighbors
- Non-parametric, instance-based learning: needs to store all examples
- Uses k examples closest to one being retrieved and method to merge them
Artificial Neural Networks
Prediction:
Activation function:
Learning:
Weeks 01-02: Searching the State Space
State Space
State: object representing a possible configuration of the world (agent and environment).
State space: set of all possible states (cross product of all elements of the state).
State Space Graphs
- Actions change the state of the world
- Each action may cause a change in the state
- Basically a FSM
Directed Graph
Many problems can be abstracted into the problem of finding a path in a directed graphs.
- Node: state (vertices)
- Arc: action (edges)
- Path: sequence of
nodes (of length in this example). Implemented as a sequence of arcs in practice - Arcs can have an associate cost: cost of path is sum of cost of arcs
- Solution: path from a start node to a goal node - multiple starting and goal nodes allowed
Explicit Graphs
- Entire graph in memory, stored as adjacency list or matrix
- Complexity measured in terms of number of vertices/edges
Implicit Graphs
outgoing_arcsmethod returns a set of outgoing directed arcs for the given node- Graph generating on the fly
- Complexity measured by depth of goal node (path length) and average branching factor (number of outgoing arcs at each node)
Searching Graphs
Generic algorithm
- Frontier: list of paths (starting from a start node) that have been explored
- Explore graph and update frontier until goal node encountered
- Way in which paths are added and removed: search strategy
- e.g. pruning: store visited nodes to avoid cycles
def search(graph, start_nodes, is_goal_node):
frontier = [[s] for s in start_nodes]
while len(frontier) != 0:
path = frontier.pop()
if is_goal_node[path[-1]]:
return path
for n in path[-1].neighbors():
frontier.append(path + n)
- The value selected from the frontier at each stage defines the search strategy - above, the frontier object is passed to the search procedure using
pop - The
neighborsfunction defines the graph -outgoing_arcsis used above - The
goalfunction defines solution that is used- Finish after you remove from the frontier. Otherwise, in graphs with costs, the lowest cost path may not be found
- If more than one answer is required, the search can continue - use the
yieldkeyword
DFS
- Pop last element from stack
- If algorithm continues, the paths that extend the popped element are pushed to the stack
- Thus, the algorithm expands the deepest path
- Does not guarantee it will find the solution with the fewest arcs
- Does not halt on every graph and is not complete - is not guaranteed to find a solution if one exists
- No pruning, so it can get stuck in a cycle
- Time complexity as function of length of the selected path:
-
, where is the branching factor and is the depth
-
- Space complexity as a function of the length of the selected path:
-
for a given depth - there can only be frontier paths (plus the branching of the current node that was explored)
-
Explicit graphs
- Used in exercises
- Nodes are specified as a set - order does not matter
- Edges are in a list: (tail, head, cost?) - order does matter
Tracing the frontier:
- Each line starts with a plus or minus
+to indicate that this will be added to the frontier-to indicate that something has been selected and returned from the frontier- List of nodes with no separator character
- Optional
!at the end - means pruning has occurred- When adding or removing from the frontier but the end node is already expanded
BFS
- FIFO queue
- Pop from the start
- Check if goal node. If not
- Enqueue paths that extend the node to the queue
- Shallowest path expanded first
- Guarantees the solution with fewest arcs will be found first
- Complete - if there is a solution it will find it
- Does not always halt - if there is no solution and there is a cycle
- Time complexity:
- Space complexity:
Lowest-cost-first search (LCFS)
- Cost of path: sum of costs of arcs
- Frontier is priority queue ordered by path cost
- At each stage, select the path on the frontier with the lowest cost
- Finds the optimal solution: least-cost path to goal node
Priority Queue
- Each element has priority
- Element with higher priority always selected/removed/dequeued before element with lower cost
- Queue is stable: if two or more elements have the same priority, FIFO
- Does not affect the correctness of the algorithm, but makes it deterministic
- Python has
heapq- Won’t automatically be stable
Pruning
Two problems: cycles - an infinite search tree, and when expanding multiple paths leads to the same node.
We need memory - the frontier should keep track of which nodes have been expanded/closed.
- Expansion happens when the frontier is removed from the path and you add new elements to it
- Store the expanded nodes as a set
- The nodes must be hashable
Prune when:
- When adding a path to a frontier but the end node has already been expanded
- When the frontier is asked for the path, but the end node of that path has already been expanded
LCFS finds an optimal solution, but it explores options in every direction, and knows nothing about the goal location.
Heuristics
Extra knowledge that can be used to guide a search:
-
is an estimate of the cost of the shortest path from a given node to a goal node - An underestimate (or equal): if there is no path to a goal node, any estimate it gives will be an underestimate
- If so, the heuristic function is admissible
- Multiple goal nodes: one approach is to return the ‘closest’ goal node
Best-first Search
- Select the path on the frontier with minimal
-value - Priority queue ordered by
- Explores more promising paths first - usually faster than LCFS
- May NOT find the optimal solution
A* Search Strategy
- Not as wasteful as LCFS
- Not as greedy as best-first-search
- Avoid expanding paths that are already expensive
-
is a path and is the last node on -
is the real cost from the starting node to -
is an estimate of the cost from to the closest goal node -
is the estimated total cost of the path -
The frontier is a priority queue, ordered by
-
Should be admissible (underestimate) - makes it prefer less-explored (i.e. shallower) paths
-
Fails if there is pruning and the heuristic is not monotone
- An expensive path may be expanded before a cheaper one ending at the same node. If pruning occurs, the cheaper path cannot be used
Monotonicity
A requirement that is stronger than admissibility. A function is monotone/consistent if, for any two nodes
That is, the estimated cost from
Thus,
Finding good heuristics
Solve a simpler version of the problem:
- Finding admissible heuristics is hard
- Admissible heuristics are usually consistent. Yay!
- Inadmissible heuristics are often quite effective, although that sacrifices optimality
- Multiplying by some constant is a hacky way of making it admissible
Sliding puzzle example:
- Number of misplaced tiles: admissible as only one tile can move at a time and if n tiles are in the wrong spot, it needs at least n steps
- Total Manhattan distance: closer to the actual value, so improves performance
Dominance Relation
Dominance: for two heuristics, if
Heuristics form a semi-lattice: if neither are dominating, could use
The bottom of the lattice is the zero heuristic - makes A* search just a LCFS.
Bidirectional Search
- Search from both the start nodes and the goal nodes simultaneously
- Can be used with BFS, LCFS, or A*
-
, thus exponential savings in time and space
Bounded Depth-First Search
Takes a bound (cost or depth) and does not expand paths that exceed the bound:
- Explores part of the search tree
- Uses space linear in the depth of the search
- Kind of acts like BFS while using little memory
Iterative-deepening Search
Uses less memory but more CPU when compared to BFS:
- Start with bound
- Do a bounded depth-first search with bound
- While a solution is not found, increment
and repeat - Nothing is remembered between iterations; wasteful
- This will find the same first solution as BFS
- But linear in depth of goal node:
- If there is no path to the goal: identical behavior to BFS. Infinite loop if not pruning
Complexity with solution at depth
| Level | BFS | Iterative Deepening | Num. Nodes |
|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Total |
|
|
As branching factor increases, complexity gets closer and closer to BFS - thus, it is not very wasteful.
Week 03: Propositions and Inference
Simple Language: Propositional Definite Clauses
- Atom: symbol starting with lower case letter
- Body: atom or of the form
, where and are bodies - Definite clause: atom or rule of form
, where is an atom and is a body. If it has an empty body, is called an atomic clause/fact - Knowledge base: set of definite clauses
Interpretation
An interpretation assigns a truth value to each atom. Thus, there are
- Body
is true in if both and are true in - Rule
is false in if is true and is false - i.e. If
is true, must be true
- i.e. If
- Knowledge base
is true in iff every clause in is true in
Models and Logical Consequences
- Model of a set of clauses: interpretation in which all the clauses are true
- If is
a set of clauses and is a conjunction of atoms, is a logical consequence of ( ) if is true in every model of
Example:
$$
KB = \begin{cases}
p \leftarrow q,\
q,\
r \leftarrow s
\end{cases}
$$
Proof Procedures
- A possibly non-deterministic algorithm for deriving consequences of a knowledge base
- Given a proof procedure,
means can be derived from the knowledge base - A proof procedure is sound if
implies - Every consequence it finds is true
- A proof procedure is complete if
implies - All consequences are found by the procedure
Bottom-Up Proof Procedure
If
First, set
Then set
Repeat until no more clauses can be selected. (Only atomic clauses can be added to the set at the beginning).
Proof of soundness
If there is a
Thus, there must be clause of the form
Fixed Point
The
If
-
is a model of - If
in is false in . Then, is false but each is true in - Thus,
can be added to (WHY?), meaning it is not the fixed point
- Thus,
- If
-
is called a minimal model
Proof of Completeness
- Suppose
; is true in all models of - Thus,
is true in the minimal model - Thus,
is in the fixed point - Thus,
must have been generated by the bottom up algorithm - Thus,
Top-Down Procedure
Search backwards from a query to determine if it is a logical consequence of
An answer clause:
The SLD resolution of the answer clause on atom
Basically: replace the atom with its clause, repeating until no more replacements can be made.
An answer is an answer clause
Derivations
Derivation of query
-
is the answer clause -
is the answer clause obtained by resolving -
is an answer
Procedure
To solve the query
-
- Until
is an answer: - Select
atom from the body of - Choose a clause
from with as the head ( ) - Replace
with the body of
- Choose a clause
- If the clause is just a definition (e.g.
), then it does not get replaced with anything; shrinks
- Select
Either don’t-care non-determinism, in which case if a selection does not lead to a solution, there is no point in trying other alternatives, or don’t-know-non-determinism, in which other choices may lead to a solution.
A successful derivation would return
A failing derivation would return something in the form
Weeks 04-05: Prolog
Prolog - programming in logic
Intro
- Describe the situation of interest
- Ask a question
- Prolog logically deduces new facts, and gives deductions back as answers
- Prolog has an interactive interpreter
- To exit, use
halt. - Load a knowledge base using
consult(filename). - Reload using
reconsult(filename). make.reloads all changed source files- Ask queries in interactive mode (
?:) - Comments:
/**/and% write(+Term)always succeeds. It prints out stuff.trace/1
- To exit, use
Basic Syntax
- Facts and rules are both clauses
- The end of a clause is marked with a full stop
happy(yolanda).is a facthappyis a predicate; a test - it is ‘true’ (provably true) or ‘false’ (unknown) for the given argument; it is not a function call
listensToMusic(yolanda) :- happy(yolanda).is a rule- If RHS (head) is true, then LHS (body) must be true
:-means
,means conjunction () ;means disjunction (). It can defined by having two rules; this is just syntactic sugar
Variables
Variables start with an underscore or upper case letter, and may contain upper, lower, digits or underscores. happy(X). returns a term that replaces X such that the rule is met. Typing j tries to find another term that satisfies the rule.
Conjunction can also be used e.g. happy(X), listensToMusic(X)..
Variables can be in the knowledge base as well e.g. jealous(X,Y) :- loves(X,Z), loves(Y,Z).
Atoms
A sequence of characters (upper, lower, digits, underscore) starting with a lowercase letter OR a sequence of special characters (:, ,, ;, ., :-). Atoms can be enclose in single quotes if it does not meet the naming requirements (e.g. 'Yolanda').
Complex terms
Functor directly followed by a sequence of arguments - put in brackets, separated by commas. e.g. listensToMusic(yolanda), hide(X,father(father(father(butch)))..
Arity
The number of arguments a complex term has; predicates with the same functor but different arity can be defined.
Unification
Two terms unify if they are the same term or contain variables that can be uniformly instantiated with terms in a way such that the resulting terms are equal.
e.g. woman(mia) = woman(mia). woman(Z) and woman(mia) will be unified, with Z taking the value of mia.
e.g. defining horizontal(line(point(X,Y), point(X,Z))). and running horizontal(line(point(1,2), point(X,3))). will unify to X=1.
Search Tree
f(a).
f(b).
g(a).
g(b).
h(b).
k(X):-f(X),g(X),h(X).
Asking ?: k(Y). will:
- Set
Y=X - Replace
k(Y).withf(X), g(X), h(X) - Try
X=a f(a)succeedsg(a), h(a).: dead end- Try
X=b f(b),g(b),h(b)becomesg(b), h(b)becomesh(b)becomes empty; success- Get
Y=B
Recursion
For recursive predicates using conjunction, place the non-recursive term first - using DFS, so will run out of memory otherwise.
numeral(0).
numeral(succ(X)):-numeral(X).
Where succ(X) returns X + 1: 3 could be defined as numeral(succ(succ(succ(0)))). numeral(X). will continue going on forever. Beware of running out of memory e.g. p:-p..
Addition
add(0,X,X). is the base clause. No return values, so three arguments needed (the last is the return value).
add(succ(X),Y,succ(Z)):-add(X,Y,Z). is for the recursive case.
Dif predicate
mother(X, Y) :- parent(X, Y), female(X).
sister(X, Y) :- parent(Z, X), parent(Z, Y), female(X).
Any female is their own sister so the dif predicate is required: append X \= Y to the end of the sister body.
Predicate description - argument mode indicator
Character prepended to argument when describing a functor:
- A
+argument must be fully instantiated: must be input - A
-argument must be unbound: must be an variable - A
?argument can be either instantiated or unbound
Logical quantification
Variables that appear in the head of a rule are universally quantified.
Variables that appear only in the body are existentially quantified.
path(X,Y) :- edge(X,Z), path(Z,Y).: For all nodes X and Y, there exists a node Z such that there is an edge from X to X and a path from X to Y
Lists
A finite sequence of elements e.g. [[], dead(z), [2, [b, c]], Z, 3].
Lists as implemented as linked lists. A non-empty list has two parts:
- The head: the first item in the list (type element)
- The tail: everything else (type list)
- The last element will be a special empty list
[]- it has neither a head or tail
- The last element will be a special empty list
The | operator decomposes a list into a head and tail:
[Head|Tail] = [a, b, c, d](or vice-versa) setsHeadtoaandTailto[b, c, d][X,Y,Z] = [a, b, c, d]also works, settingZto[c, d][X|Y] = []fails as the empty list has neither a head or tail.
Anonymous variables
If you do not need the value of a variable, use _ - the anonymous variable. Two instances of _ may not be equal.
Membership
Is an element a member of a list? Use member/2:
member(X, [X|_]). % If element is the head, it is in the list
member(X, [_|T]) :- member(X, T). % Else, recurse through the list until it fails
member(X, [a, b, c]) can unify to three separate values.
% Exercise: a2b/2; first length all a's and second list all b's; both of the same length
a2b([], []).
a2b([a, L1], [b, L2]) :- a2b(L1, L2).
Append
Append two lists together using append/3, where the third argument is the result of concatenating the lists together.
append([], L, L).
append([H|L1], L2, [H|L3]) :- append(L1, L2, L3).
% If L3 is result, first elements of L1 and L3 must be the same. Hence, remove the first element from both L1 and L3 until L1 is empty.
Concatenating a list is done by traversing down one of the lists; hence, it is inefficient.
Prefix and Suffix
prefix(P, L) :- append(P, _, L). with prefix(X, [a, b, c]) generating all possible prefix lists of [a, b, c].
suffix(S, L) :- append(_, S, L). with suffix(X, [a, b, c]) generating all possible suffix lists of [a, b, c].
Sublist
Sublists are prefixes of suffixes of the list:
sublist(Sub, List) :- suffix(Suffix, List), prefix(Sub, Suffix).
Reversing a list
If given [H|T], reverse the list by reversing T and appending the list to H.
naiveReverse([], []):
naiveReverse([H|T], R) :- naiveReverse(T, RT), append(RT, [H], R).
This is inefficient: append/3 is O(n) and this is done at each stage, so it is O(n^2).
By using an accumulator, we can improve things:
- The accumulator is a list, initially empty
- Head of the list is prepended to the head of the accumulator
- Repeat until the list is empty
% Third argument is an accumulator
accReverse([], L, L). % If list is empty, reversed array is the accumulator
accReverse([H|T], Acc, Rev) :- accReverse(T, [H|Acc], Rev).
reverse(L1, L2) :- accReversse(L1, [], L2).
% append/3 not used
List Acc
[a, b, c, d] []
[b, c, d] [a]
[c, d] [b, a]
[d] [c, b, a]
[] [d, c, b, a]
Arithmetic and other operators
| C | Prolog |
|---|---|
< |
< |
<= |
=< |
== |
=:= |
!= |
=/= |
>= |
>= |
> |
> |
These force the left and right hand arguments to be evaluated.
= is the unification predicate and \= is the negation. == is the identity predicate, which succeeds if the arguments are identical.
2+2 = 4. is false as +(2, 2) does not unify to 4. You must use 2+2 =:= 4.
! is the cutback operator - it suppresses backtracking. The fail predicate always fails. Using these two allows us to invert the result:
neg(Goal) :- Goal, !, fail.
If Goal unifies, it gets to ! so can never backtrack. Then it gets to fail an fails. If the ! was not there, it would attempt to evaluate Goal again.
As this is so common, there is a built in operator that does this: \+.
Week 06: Constraint Satisfaction Problems
These problems are characterized by:
- A set of variables
, , \dots, - A set of domains for each variable:
is the domain for - A set of constraints on various subsets which specify legal combinations of values for these variables (e.g.
)
A solution is a n-tuple of values for the variables that satisfies the constraints.
Examples
Australian map colouring:
- Variables:
- Domains:
- Constraints:
Sudoku:
- Variables: unfilled values
- Domains: 1-9
- Constraints: sudoku rules
Eight queens puzzle:
- Variables: row in which queen is located
- Domains: 0-8
- Constraint: no two queens in the same row
Basic Algorithms
Generate-and-Test algorithm
Generate the assignment space
It is exponential in the number of variables.
Backtracking
Systematically explore D by instantiating variables one at a time, evaluating each constraint as all its variables are bound.
Thus, any partial assignment that doesn’t satisfy the constraints can be pruned (e.g. if
CSP as graph search
A node is an assignment of values to some of the variables.
Search:
- Select a variable
that is not assigned to node - Generate the neighbor to
where has been assigned for all possible values of - Prune if the assignment is not consistent with the constraints
The start node is the empty assignment, and the goal node is a total assignment that satisfies the constraints.
CSP Algorithms
CSP is NP-hard. However, some instances of CSP can be solved more efficiently by exploiting certain properties.
Constraint Networks
An instance of CSP can be represented as a network:
- Oval node for each variable
- Each variable has a domain associated with it
- Rectangle node for each constraint
- Arcs from each variable to constraints that involve it
Arc Consistency
An arc
A network is arc consistent if all arcs are arc consistent.
If an constraint has only one variable in its scope and every value in the domain satisfies the constraint, then the arc is domain consistent.
If there is an arc that is not arc consistent, remove values from
Example:
- Variables
- Domain
for all variables -
One arc would be
-
- To make it arc consistent,
must be removed from the domain of
By repeating this with other nodes, the network can be made arc consistent.
Arc Consistency Algorithm
Arcs can be considered in series. An arc,
def GAC(variables, domains, constraints):
# GAC: Generalized Arc Consistency algorithm
return GAC2(variables, domains, constraints, [((X, C) for C in constraints if X in scope(C)) for X in variables].flatten())
def GAC2(variables, domains, constraints, todo):
while not todo.isEmpty():
X, constraint = todo.pop() # The variable is in the scope of the constraint (i.e. it is relevant)
Y = scope(todo)
Y.pop(X) # Need to remove the chosen from the set of constraints
new_domain = [x for x in domain(X)
# such that there exists a set (y_1,..., y_n) such that is_consistent(X=x, Y_1=y_1, ..., Y_n=y_n)
]
if new_domain != domain(X):
# need to re-check all variables that are involved with any constraints involving X
todo += [(Z, C) for C in constraints if X in scope(C) and Z in scope(C) and X != Z]
domain(X) = new_domain
There are three possible outcomes of this algorithm:
- One or more domains are empty: there is no solution
- Each domain has a single value: unique solution
- Some domains have multiple values: there may or may not be a solution. More work must be done
Domain Splitting
Split a problem into a number of disjoint cases and solve each case separately: the set of solutions is the union of all solutions to each case.
e.g. if
Algorithm
Given a CSP instance:
- Select any variable
that has more than one value in its domain - Split
into two disjoint, non-empty sets - Create two new CSP instances with the new domain for
def CSPSolver(variables, domains, constraints, todo):
if [domain for domain in domains if len(domain) == 0] is not empty:
return False
else if [domain[0] for domain in domains if len(domain) == 1] is the same length:
# Every domain has one possible value
return (domain[0] for domain in domains)
X = [X for X in domains if domain(X) > 1][0]
D1, D2 = split(domain(X))
domains1 = domains.replace(X, D1)
domains2 = domains.replace(X, D2)
todo = [(Z, C) for C in constraints if X in scope(C) and Z in scope(C) and X != Z]
# Domain of X has been split so need to recheck all variables that are involved with constraints involving X
return CSPSolver(variables, domains1, constraints, todo) or CSPSolver(variables, domains2, constraints, todo)
Variable Elimination
Eliminate variables one-by-one, passing constraints onto their neighbors.
Constraints can be thought of as a relation containing tuples for all possible valid values.
A join operation can be done on two constraints to capture both constraints, joining on the variable being eliminated. Then, this table can be projected with that column (variable being eliminated) removed.
Algorithm
- If there is only one variable, return the intersections of the constraints
- Select a variable
- Join the constraints in which
appears to form constraint - Project
onto its variables other than to form - Replace all constraints in which
appears with - Recursively solve
def VE_CSP(variables, constraints):
if len(variables) == 1:
return join(...constraints)
else:
X = variables.pop() # select variable to eliminate and remove from variables
constraints_X = [C for C in constraints if X in scope(C)]
R = join(...constraints_X)
R2 = project(all_variables_but_X)
new_constraints = [C for C in constraints if X not in scope(C)] + R2
recursed = VE_CSP(variables, new_constraints)
return join(R, recursed)
Week 07: Local and Global Search
An optimization problem has:
- A set of variables and their domains
- An objective function
The assignment that optimizes (maximize/minimize) the value of the objective function must be found.
A constrained optimization problem adds a set of constraints which determines which assignments are allowed.
Local Search
Use algorithms to iteratively improve a state.
A single current state is kept in memory and in each iteration, we move to one of its neighbors to improve it.
Most local search algorithms are greedy. Two such algorithms are hill climbing and greedy descent.
e.g. TSP: start with any complete tour, and perform pairwise exchanges (pick two edges, switch the vertex to one in the other edge). This type of approach can get close to the optimal solution quickly.
Local Search for CSPs
CSP can be reduced to an optimization problem.
If each variable is assigned a value, a conflict is an unsatisfied constraint: the goal is to find an assignment with zero conflicts.
Hence, the heuristic function to to be minimized will be the number of conflicts.
Example: n-queens problem
Put n queens on an n by n board such that no two queens can attack each other.
Heuristic: number of pairs of queens that can attack each other.
One queen on each column, queens can move up and down only. For each state there will be
Variants of Greedy Descent
-
Find the variable-value pair that minimizes the number of conflicts at each step
-
Select the variable that participates in the most number of conflicts, and find the value of that variable that minimizes this
-
Select a variable that appears in any conflict and find the value of that variable that minimizes this
Issues
Can get stuck in local optima/flat areas of the landscape - randomized greedy descent can sometimes help:
- Random step: move to a random neighbor
- Random restart: reassign random values to all variables
These two make the search global.
Parallel Search
A total assignment is called an individual.
Maintain a population of k individuals instead of one and update each individual at every stage. If an individual is a solution, it can be reported (and the search can stop).
With local search, random restarts will occur when it reaches a non-zero local minimum and finish when a solution is found.
With parallel search, if a solution is found all individuals can be stopped.
Simulated Annealing
- Pick a variable at random
- Pick a value at random
- If it is an improvement, adopt it
- If not. adopt it with some probability
Given:
- The current assignment is
- The proposed assignment
- The objective function
- The current temperature parameter is
The probability of adopting the new value is:
As temperature gets reduced, the probability of accepting a change decreases.
Gradient Descent
Objective function must be (mostly) differentiable:
def gradient_descent(f, initial_guess):
# f = objective function
x = initial_guess # x is a vector
while magnitude(grad(f)(x)) > epsilon:
# f is a multi-variable function, so \nabla f returns a vector
x = x - step_size * grad(f)(x)
# Steepest slope is along the vector, so walk along it
Evolutionary Algorithms
Requires:
- Representation
- Evaluation function
- Selection of parents
- Reproduction operators
- Initialize procedures
- Parameter settings
Flow
After initializing the population:
- Calculate fitness of individuals
- Terminate if ‘perfect’ individual found, time limit reached etc.
- Select individuals (some randomness required)
- Crossover: select two or more candidate individuals and somehow combine them
- Mutation
Evaluation/Fitness Function
A value relating to how ‘good’ an individual is.
Used for:
- Parent selection: which parents will reproduce
- Measure of convergence: when performance no longer increases significantly between generations
- Steady state: selecting which individuals will die
Selection
Better individuals should have a higher chance of surviving and breeding.
Roulette wheel
Each individual is assigned to a part of the roulette wheel, with the ‘angle’ being proportional to its fitness, and it is spun n times to select n individuals.
- Sum the fitness of individuals to variable T
- Generate a random number N between 1 and T
- Return the first individual whose fitness added to the running total is equal to or larger than N
def roulette_wheel_select(population, fitness, r):
total_fitness = sum([fitness(individual) for individual in population])
rolling_sum = 0
for individual in population:
rolling_sum += fitness(individual)
if rolling_sum > total_fitness * r:
return individual
Tournaments
n (size of the tournament) individuals chosen randomly: the fittest one is selected as a parent.
If n is one, it is equivalent to randomly selecting an individual.
If n is the population size, it is completely deterministic.
Elitism
Keeps at least one copy of the fittest solution so far for the next generation - ensures the fitness will not decrease over generations.
Reproduction Operators
Crossover
Two parents produce two offspring. 1, 2 or n crossover points are generated:
One point crossover: child has parent one’s chromosomes up until the random cross over point; after that is the other parent’s chromosomes.
n point crossover: at n separate points, it swaps from getting chromosomes from one parent to the other.
Mutation
Mutation gives a low probability of randomly changing the gene of a child.
Mutation brings in diversity as compared to combining candidates to (hopefully) produce better children.
Randomly Generating Programs (Trees)
- Pick a non-terminal
- Pick children for that element, either a terminal or non-terminal
- Repeat
Mutation: pick a random node and replace the subtree with a randomly-generated subtree.
Crossover: pick a random node in each parent and exchange them.
Week 08: Probabilistic Inference and Belief Networks
In the real world, there will be uncertainty and randomness due to several factors:
- Theoretical and modelling limitations
- Coin toss: incomplete physics model
- Sensory and measurement limitations
- Measurements not accurate enough
- Computational limitations
- Computation too time consuming
Hence, using probabilities is often required.
Random Variable
Some aspect of the world about which there is uncertainty.
RVs are denoted with a capital letter and have an associated domain.
Unobserved RVs have distributions; a table of probabilities of values. A probability is a single number e.g.
Joint distribution over a set of RVs: a map from assignments/outcomes/atomic events to reals;
Event: set E of assignments;
Marginalization (summing out): projecting a joint distribution to a sub-distribution over subset of variables:
Conditional probability:
Conditional distribution: probability distribution over some variables given fixed values of others. If
To get the whole conditional distribution at once, select the joint probabilities matching the evidence and normalize the selection (make it sum to 1). Example:
Product rule:
Chain rule:
Probabilistic Inference
Computing a desired probability from other known probabilities.
This allows us to invert a conditional distribution - often one conditional is simple but the other is tricky. $$ \begin{aligned} P(x,y|z) &= P(y,x|z) \ \ P(x,y|z) &= \frac{P(x,y,z)}{P(z)} \ P(y|x,z) &= \frac{P(x,y,z)}{P(x,z)} \ \therefore P(x,y,z) &= P(y|x,z) \cdot P(x,z) \ \therefore P(x,y|z) &= \frac{P(y|x,z) \cdot P(x,z)}{P(z)} \ &= P(y|x,z) \cdot P(x|z) \ \ \therefore P(x|y,z) &= \frac{P(y|x,z) \cdot P(x|z)}{P(y,z)} \ \text{if z is implicit}: \ P(x|y) &= \frac{P(y|x) \cdot P(x)}{P(y)} \text{ (Baye’s rule)} \end{aligned} $$
Inference by Enumeration
A more general procedure:
-
are evidence variables -
are query variables -
are hidden variables
These variables can be referred to as
First, select entries consistent with the evidence.
Then, sum out
Finally, normalize the remaining entries.
Complexity of Models
Simple models are easier to build, explain, and usually lead to lower time complexity and space requirements.
To measure complexity, count the number of free parameters that must be specified; a joint distribution over n variables, each with a domain size of d requires
Independence
Two variables are independent if: $$ \begin{aligned} P(X, Y) &= P(X) \cdot P(Y) \text{ or} \ \forall x, y: P(x, y) &= P(X=x) \cdot P(Y=y) \end{aligned} $$ That is, their joint distribution factors into a product of two simpler distributions.
Absolute independence:
Independence can be used as a modelling assumption; if all the variables are independent, instead of having
Absolute (unconditional) independence is vary rare; conditional independence is more robust. For a given value of
In this case:
This can occur if
Belief Networks
- A way to describe complex joint distributions
- Sometimes called Baye’s nets or Bayesian networks
- Local interactions chain together to give global, indirect interactions
Graphical Model Notation
Arcs:
- Allows dependence between variables
- Arcs don’t force dependence
- Arrows may imply causation (but don’t have to)
Nodes:
- One node per RV
- Either assigned (observed) or unassigned (unobserved)
- Conditionally independent of its non-descendants given its parents’ state
- Conditionally independent of all other nodes given its parents, children, and children’s parents’ state
- Called a Markov blanket
Distributions:
- A collection of distributions (CPTs) over each node; one for each combination of the parents’ values
-
for all combinations of
-
D-separation can be used to decide if a set of nodes
Encoding
BNs implicitly encode joint distributions; this can be calculated as a product of local conditional distributions.
Example:
More generally, if you have a full assignment, multiplying the relevant conditionals gives the probability:
product = 1
for i in range(1, n):
product *= probability(x[i] | parents(i))
Thus, we can reconstruct any entry of the full joint. However, not every BN can represent every full joint.
Inference Enumeration
Computing
(NB:
This has to be computed for every value in the domain of
- Answering
: no evidence; all variables except the query are hidden - Answering
: answer , then pick result for - Answering
:
Week 09: Basic Machine Learning
Learning: improving behavior based experience. This could be:
- The range of behaviors increasing
- The accuracy of its tasks increasing
- The speed at which it executes its tasks is faster
Components of a learning problem:
- Task: the behavior/task being improved e.g. classification
- Data: experiences used to improve the performance of its tasks
- Measure of improvement: a way of measuring the performance/improvement e.g. accuracy of classification
Learning architecture:
- The learner is fed experiences/data and background knowledge/bias
- The model is what does the reasoning/prediction
- The reasoner is fed the problem/task, and outputs an answer/performance
Supervised Learning
Given the following as input:
- A set of input attributes/features/random variables:
- A target feature
(discrete class value or continuous/real value) - what is being predicted - A set of training examples/instances where the value of the input and target variables are given
This is fed to a learning algorithm to build a predictive model that takes a new instance and returns/predicts the value for the target feature.
For continuous target variables, regression is used.
Measuring performance
Common performance measures:
In binary classification problems, one class is called positive (p) and the other negative (n).
Common performance measures for regression:
- Mean squared error (MSE):
- Mean absolute error
Training and Test Sets
A set (multi-set) of examples is divided into training and test examples; this is required as the model can overfit the training data, giving high performance on the training data but low performance on unseen data.
The more complex the model is, the lower the error on the training data.
A general pattern is that at a certain complexity, increasing the complexity of the model increases the error on the test data.
Naïve Bayes Model
Where:
- Features
are independent given the class variable -
: prior distribution of -
: likelihood conditional distributions -
: posterior distribution
Conditional probabilities can be estimated from labelled data.
Find
Problem: hard to learn
Thus, assume input features are conditionally independent given the class model.
Example: Building a Classifier
Determine if the patient is susceptible to heart disease (y/n) given family history (t/f), fasting blood sugar level (l/h), BMI (l, n, h).
Model it as
The class can take two values, so there are two tables per feature and two rows for
NB: in the quiz, you only store value for when class is true.
To calculate
Laplace Smoothing
Zero counts in small data sets lead to zero probabilities - this is too strong a claim based on only a sample.
To fix this, add a non-negative pseudo-count to the counts - this can reduce the confidence.
- e.g.
is the number of examples in the dataset where and -
is the number of examples in the dataset
Given these:
This is equivalent to:
The greater the pseudo-count is, the closer the probabilities will even out (closer to
Parametric vs Non-Parametric Models
Parametric models described with a set of parameters; learning means finding the optimal values for these parameters.
Non-parametric models are not characterized by parameters - a family of this is called instance-based learning:
- Instance-based learning is based on memorization of the dataset
- The cost of learning is 0; all the cost is in the computation of the prediction
- It is called lazy-learning: learning is put off until it is required
k-Nearest Neighbors
An example of a instance-based learning algorithm is k-nearest neighbors:
- It uses the local neighborhood to obtain a prediction - the k memorized examples most similar to the one being classified is retrieved
- A distance function is used to compare similarity (e.g. Euclidean or Manhattan distance)
- If the distance function is changed, how examples are classified changes
Training only requires storing all the examples.
Prediction:
- Let
be the k most similar examples to -
; given the k nearest neighbors to , calculate which value it should have
If k is too high, it will be under-fit.
Geometrically, each data point
Week 10: Artificial Neural Networks
Perceptron
A neuron receives signals from multiple inputs (inputs are weighted), and if the overall signal is above a threshold, the neural fires. A perceptron models this with:
- Inputs
- Parameters (weights)
, and - Output (activation function):
Sometimes, bias is represented as another weight
For this course,
A perceptron can be seen as a predicate: given a vector
The function partitions the input space into two sections: if there are two inputs, the decision boundary will be a straight line:
If there are three inputs, the decision boundary is a plane. For n dimensions, it will be a hyperplane.
The vector,
Learning
Given a data set - a collection of training vectors of the form
-
Randomly initialize the weights and bias
-
While there are mis-classifications and the number of iterations is less than the maximum number of epochs:
- For each training example, classify the example. If it is misclassified, then update the weights, where:
-
is the learning rate -
is the value for input -
is the actual output -
is the current prediction (perceptron output):
-
- Update the weights/bias using:
-
-
(the same equation can be used for bias as above if it is represented as a virtual input)
-
- For each training example, classify the example. If it is misclassified, then update the weights, where:
If examples are linearly separable, the weights and bias will, in finite time, converge to values that produce a perfect separation.
If
Multi-Layer Perceptrons
Motivation
y
^
|
| false true
|
| true false
|-----------------> x
A single perceptron cannot partition these four points into a decision boundary. However, this can be done with multi-layer perceptrons.
Define two perceptrons,
y P_1 y P_2
^ ^
| ----------- |
| false(0) | true(1) | false(1) true(1)
|---------- | -----------
| true(1) false(1) | true(1) | false(0)
|-------------------> x |--------------------> x
Now, pass the output of the perceptrons as input to another perceptron
P2 P_3
^
| --------
| (0, 1) | (1, 1) <- two points superimposed
| ----------
| (1, 0) |
|---------------------> P1
Now, this perceptron can form a decision boundary that correctly partitions the input space.
Description
The feed-forward networks we are dealing with arranges perceptrons into layers where:
- Adjacent layers are fully connected: all outputs from one layer are used as inputs for each perceptron in the next layer
- There are no backwards connections (DAG)
- Layers are not skipped
Some notes:
- The first layer contains only input nodes
- Hence, the number of input nodes is the number of inputs in the problem domain
- The last layer is called the output layer
- A network with only one layer is an identify function
- Weights and biases are between layers
- Between layer
and : - The number of weights is
- The number of biases is
.
- The number of weights is
- Between layer
- Layers between the input and output are called hidden layers
As more layers/neurons are added, the complexity of the boundary shape(s) can increase. If you have two dimensions:
- With 2 layers (one perceptron - the first layer contains the input nodes), a straight line can be formed
- With 3 layers, any polygon can be formed
- The number of perceptrons in the hidden layer determines the number of sides of the polygon
- If are two perceptrons, not a polygon but two intersecting lines
- With 4 layers, multiple polygons can be formed
- Polygons within polygons etc.
Multi-class Classification
This can be done by having multiple numeric outputs and picking the node with the largest value.
Outputting numeric values instead of a Boolean requires the Sigmoid function:
Error Function
The mean squared error is typically used, where
The weights can be updated incrementally:
Typical Architecture
The number of input nodes is determined by the number of attributes and the number of outputs is determined by the number of classes. A single hidden layer is enough for many classification tasks.
Guidelines:
- Use as few hidden layers/nodes as possible
- Forces better generalization
- Fewer weights need to be found, reducing training time and cost
- Too many nodes may lead to overfitting
- For the hidden layer
- Make a guess at how many nodes you need - a number between the number of input and output nodes
- If unsuccessful, increase the number of nodes
- Is successful, reduce the number of nodes to force better generalization
Week 11: Games - Non-cooperative Multi-agent Systems
Many problems can be modelled as games: multiple agents with (possibly competing) interests. They can be described by:
- Actions available to each agent in various stages of a game
- Utility (payoff) functions; one for each agent. They assign a real number to every possible outcome of the game (typically terminal states)
- If the utility functions are the same, the agents become cooperative
- Strategy functions; one for each function. They determine what action is taken in each state
- Typically, goal is to find the strategy maximizes the utility for an agent or groups of agents
Example: papers scissors rock:
- Simultaneous action game instead of turn-based
- 9 possible outcomes
- One utility function for each player (
1for win,-1for loss,0for tie)- The sum of the scores is a constant value
Game Trees
In perfect-information games, a game tree is a finite tree where the nodes are states and the arcs correspond to actions by the agents.
- Each internal node is a labelled with an agent (or nature) - the agent controls the node
- Each internal node labelled with nature has a probability distribution over its children
- Each arc coming out of a node with agent i corresponds to an action for i
- The leaves represent final outcomes and are labelled with a utility function for each agent
Hence:
- A complete game tree contains all possibilities
- The root is the current state
- A play-through is a path through the tree
Perfect-information, Zero-sum, Turn-based games
Properties:
- Typically two agents
- Take turns to play
- No chance
- State of the game fully observable by all agents
- Utility values for each agents are the opposite of the other: zero sum
- Hence, one player tries to maximize the utility and the other minimize
- Each level of the tree switches between a max and min node as the players switch turns
These properties creates adversary.
Optimal Strategy: Min-Max Function
Designed to find the best move at each stage:
- Generate the whole game tree
- Apply the utility function to each leaf
- Back-up values from the leaves through branch nodes
- A max node computes the max of its child values, and vice-versa for a min node
- If a max node at the root, choose the move leading to the value with the largest value (and vice-versa if root is a min node)
This is optimal if both players are playing optimally.
def min_max_decision(state, root_is_max = True):
if root_is_max:
return min([(min_value(child), child) for child in state.children()])[1]
else:
return max([(max_value(child), child) for child in state.children()])[1]
def max_value(state):
if state.is_terminal:
return state.calculate_utility()
return max([(min_value(child), child) for child in state.children()])[1]
def min_value(state):
if state.is_terminal:
return state.calculate_utility()
return min([(max_value(child), child) for child in state.children()])[1]
Reducing Search Space
Game tree size:
- Tic-tac-toe
-
legal actions per state -
moves total -
: searching the entire tree reasonable
-
- Chess
-
legal moves per state -
for a typical game -
: searching the entire tree completely infeasible
-
Hence, the search space needs to be reduced:
- Pruning
- Heuristic evaluation of states (e.g. ML, number of pieces left in chess)
- Table lookup instead of search (e.g. for opening/closing situations)
- Monte-Carlo tree search: randomly sample tree
Alpha-Beta Pruning
If an option is bad compared to other available options, there is no point in expending search time to find out exactly how bad it is.
Example:
MAX 3
/ \
/ \
Min 3 ?
/ | \ / \
Leaf 3 9 6 2 ?
We have computed that the value for the first child (min node) is 3 and that the first child of the second child has a value of 2. Hence, the second child has a maximum value of 2, so regardless of the true value of the second child, the outcome of the max will not change.
More generally:
(player) MAX (player) MIN
/ \ / \
min m ... max m ...
... ...
\ \
min ? max ?
/ \ / \
max m > n n ? min m < n n ?
If m > n, the max node is guaranteed to get a utility of at least m - hence, the min node with utility n or less will never be reached. Thus, it does not need to be evaluated further. The opposite occurs for a min node.
The algorithm has:
-
: highest-value choice found for a max node higher-up (parents) in the tree. Initially -
: lowest-value choice found for a min node higher-up (parents) in the tree. Initially $ \infty$
These two values should be passed down to the child nodes during search. If the value returned by the child node is greater than
from math import inf
def alpha_beta_search(tree, is_max_node = True, alpha = -inf, beta = inf):
# Input: nested array. Numbers are leaf nodes, arrays are inner nodes
# Returns a tuple of the best utility and path that gets that utility (index numbers)
# If path is none, pruning occurred or it is a leaf node
best_path = None
best_utility = -inf if is_max_node else inf
if isinstance(tree, int):
return (tree, None)
for (i, child) in enumerate(tree):
utility, path = alpha_beta_search(child, not is_max_node, alpha, beta)
path = [i] if path is None else ([i] + path) # Append index of child to path received by child
if is_max_node:
if utility > best_utility:
(best_utility, best_path) = (utility, path)
if utility > alpha:
# This child is now the largest child encountered by this max node or any parent max nodes
alpha = utility
else:
if utility < best_utility:
(best_utility, best_path) = (utility, path)
if utility < beta:
beta = utility
if alpha >= beta:
return (utility, None)
# In the case that this is a max node:
# The child node is min node and its value is larger than or equal to the smallest value
# encountered so far by any parent min nodes. Hence, this (max) node pick this child or
# a larger one, so this node's parent will reject it. Thus, this node can be pruned
# Similar logic applies if it is a min node
return (best_utility, best_path)
Effectiveness
- Pruning does not affect the final result
- An entire sub-tree can be pruned
Worst case: if branches are ordered, no pruning takes place.
Best case: each player’s best move is the left-most child/evaluated first.
Good move ordering improves effectiveness. Examples:
- Sort moves by the remembered move values found last time
- Expand captures first, then treats, then forward moves etc.
- Run iterative deepening search, sort by value last iteration
Alpha-beta search often gives us
Static (Heuristic Evaluation) Function
Estimates how good the current board configuration (a non-terminal state) is for a player.
A typical function could evaluate how good the state is for the player subtract the opponent’s score. If the board evaluation is
This can be used to perform a cut-off search: after a maximum depth is reached, use the heuristic evaluation instead of finding the actual utility.
Sidenote: three players. We now have two degrees of freedom; the utility is a tuple, not a scalar. Each player will attempt to maximize it’s own dimensions.
Week 12: Concluding Remarks
Final Exam:
- Most programming questions have feedback; some have hidden test cases
- Covers content from the entire course: slightly less emphasis on content from lab test such that lab test (20%) + exam give even grade coverage
- 2019 exam will be available: note that some exam questions have been used in this year’s quizzes
- Past exams had some different content (e.g. Clojure, computer vision)
- Wrong selections in multi-choice questions take away from the mark for the grade (higher weight than correct selections)
- No access to lecture notes; there will be info on how to run prolog etc.
Topics covered:
- Planning (graph algorithms, path finding)
- Problem solving (constraint satisfaction problems)
- Logic
- Reasoning (classical logic, probabilistic reasoning)
- Working with uncertainty (probabilistic methods)
- Learning
- Competing with adversaries (games)
A recurring pattern across these topics is that the problems they solve can be reduced to search problems.
- The industrial revolution fundamentally changed how physical goods were produced, redistributing wealth and job opportunities. Will AI lead to a fundamental shift in how decisions are made?
- Fairness: AI in justice systems, recruiting etc.
- Singularity: recursive self-improvement leading to runaway systems
- AI probably won’t destroy humanity; we’re doing a good enough job already
Related topics:
- Probability theory, multivariate statistics, Bayesian inference
- Linear algebra, optimization, functional analysis
CSSE courses:
- COSC401 machine learning (more fundamental and theoretical)
- COSC440 deep learning (new!): deep neural networks
- COSC420 intelligent tutoring systems (unavailable - Tanja away next year): applied AI
- COSC428 computer vision
- DATA430: medical data informatics: applied AI