All Files in ‘COSC367 (2020-S2)’ Merged

Exam Notes

Prune on insert/remove if the end node has been expanded.

A*: f(path)=cost(path)+h(path[1])f(path) = cost(path) + h(path[-1]).

Bidirectional search (BFS, LCFS, A*): 2bd/2bd2 * b^{d/2} \ll b^d; saves time and space.

Iterative deepening: max depth incremented until solution found. DFS is pow(b,k)\geq pow(b, k); iterative is bkbb12\leq b^k \cdot \frac{b}{b-1}^2.

Propositions and Inference

Soundness and Completeness

Bottom-up/forward-chaining

Fixed point: set of consequences generated.

If II is the interpretation where every element of the fixed point is true and every other one is false, II is the minimal model of the KB.

Top-down procedure

Answer clause: yesa1amyes \leftarrow a_1 \land \dots \land a_m.

Until the answer clause is an answer (yesyes \leftarrow \text{}), repeatedly run SLD resolution.

Prolog

Constraint Satisfaction Problems

A set of variables with domains for each, and set of constraints; a solution is an assignment that satisfies the constraints.

Constraint Network:

Arc Consistency

For a given constraint and variable XX in the scope of the constraint, the arc is arc consistent if, for each element in the domain of XX, there is a valid assignment for all other variables in the constraint’s scope.

Elements may need to be removed from the domain of XX to make it arc consistent.

Algorithm:

Empty domains: no solution; multiple values in domains: may or may not have a solution.

Domain Splitting

Halve the domain for some variable, create two instances of the problem with the new domain, and solve both, revisiting only constraints involving the split variable.

Variable Elimination

Eliminate variables by passing constraints on to their neighbors.

Roulette wheel: return the first individual whose fitness added to the running total is larger than a random number between 1 and sum of the fitnesses.

Probabilistic Inference and Belief Networks

P(xy,z)=P(yz)P(x,yz) P(x|y,z) = \frac{P(y|z)}{P(x,y|z)}

For a full assignment:

P(x1,,xn)=i=1nP(xiparents(Xi)) P(x_1, \dots, x_n) = \prod_{i=1}^{n}{P(x_i | parents(X_i))}

If it is the probability over all variables, it is called the joint probability distribution.

Basic Machine Learning

Artificial Neural Networks

Prediction: a=i=0nwixia = \sum_{i = 0}^{n}{w_i x_i}, where x0=1x_0 = 1.

Activation function: g(a):bool=a0g(a): bool = a \geq 0.

Learning: weightweight+η_learning_ratex(actualprediction)weight \leftarrow weight + \eta\_learning\_rate \cdot x(actual - prediction). Repeat for each training example and loop until no mis-classifications or limit reached.

Weeks 01-02: Searching the State Space

State Space

State: object representing a possible configuration of the world (agent and environment).

State space: set of all possible states (cross product of all elements of the state).

State Space Graphs

Directed Graph

Many problems can be abstracted into the problem of finding a path in a directed graphs.

Explicit Graphs

Implicit Graphs

Searching Graphs

Generic algorithm

def search(graph, start_nodes, is_goal_node):
  frontier = [[s] for s in start_nodes]
  while len(frontier) != 0:
    path = frontier.pop()
    if is_goal_node[path[-1]]:
      return path

  for n in path[-1].neighbors():
    frontier.append(path + n)

DFS

Explicit graphs

Tracing the frontier:

BFS

Lowest-cost-first search (LCFS)

Priority Queue

Pruning

Two problems: cycles - an infinite search tree, and when expanding multiple paths leads to the same node.

We need memory - the frontier should keep track of which nodes have been expanded/closed.

Prune when:

LCFS finds an optimal solution, but it explores options in every direction, and knows nothing about the goal location.

Heuristics

Extra knowledge that can be used to guide a search:

A* Search Strategy

f(p)=cost(p)+h(n)f(p) = cost(p) + h(n) where:

Monotonicity

A requirement that is stronger than admissibility. A function is monotone/consistent if, for any two nodes nn and nn' (which are reachable from nn):

h(n)cost(n,n)+h(n) h(n) \leq cost(n, n') + h(n')

That is, the estimated cost from nn to the goal must be less than the estimated cost of going to the goal via nn'. Where ss is the start node:

f(n)=cost(s,n)+h(n)=cost(s,n)+cost(n,n)+h(n)f(n)cost(s,n)+h(n)f(n)f(n) \begin{aligned} f(n') &= cost(s, n') + h(n') \\ &= cost(s, n) + cost(n, n') + h(n') \\ \therefore f(n') &\leq cost(s, n) + h(n) \\ \therefore f(n') &\leq f(n) \\ \end{aligned}

Thus, f(n)f(n) is non-decreasing along any path.

Finding good heuristics

Solve a simpler version of the problem:

Sliding puzzle example:

Dominance Relation

Dominance: for two heuristics, if ha>hch_a > h_c if n:ha(n)hc(n)\forall{n}: h_a(n) \geq h_c(n).

Heuristics form a semi-lattice: if neither are dominating, could use max(ha(n),hc(n))max(h_a(n), h_c(n)).

The bottom of the lattice is the zero heuristic - makes A* search just a LCFS.

Takes a bound (cost or depth) and does not expand paths that exceed the bound:

Uses less memory but more CPU when compared to BFS:

Complexity with solution at depth kk and branching factor bb:

Level BFS Iterative Deepening Num. Nodes
11 11 (only looks at nodes once) kk (repeat search kk times) bb
22 11 k1k - 1 b2b^2
kk 11 11 bkb^k
Total bk\geq b^k bk(bb1)2\leq b^k(\frac{b}{b-1})^2

As branching factor increases, complexity gets closer and closer to BFS - thus, it is not very wasteful.

Week 03: Propositions and Inference

Simple Language: Propositional Definite Clauses

Interpretation

An interpretation assigns a truth value to each atom. Thus, there are 2num_atoms2^{num\_atoms} possible interpretations.

Models and Logical Consequences

Example: $$ KB = \begin{cases} p \leftarrow q,\ q,\ r \leftarrow s \end{cases} $$ m=(p=true,q=true,r=false,s=false)m = (p = true, q = true, r = false, s = false) would be a model of KBKB.

Proof Procedures

Bottom-Up Proof Procedure

If hb1bmh \leftarrow b_1 \land \cdots \land b_m is a clause in the knowledge base and each bib_i has been derived (all are consequences), then can be derived. This method is called forward chaining.

First, set c:={}c:=\{\}. Then, select a clause h  b1bmh \leftarrow \; b_1 \land \cdots \land b_m in KBKB such that:

Then set C:=C{h}C := C \cup \{h\}.

Repeat until no more clauses can be selected. (Only atomic clauses can be added to the set at the beginning).

KBgKB \vdash g if gCg \in C at the end of the procedure.

Proof of soundness

If there is a gg such that KBgKB \vdash g but KBgKB \nvDash g, there must be some atoms added to CC which aren’t true in every model of KBKB. Call the first such atom hh.

Thus, there must be clause of the form hb2bmh \leftarrow b_2 \land \dots \land b_m . As hh is the first wrong atom, each bib_i must be true in some interpretation II. Thus, this clause must be false in II; thus, cannot be a model of KBKB.

Fixed Point

The CC generated at the end of the bottom-up algorithm is called a fixed point.

If II is the interpretation in which every element of the fixed point is true, and every other atom is false:

Proof of Completeness

Top-Down Procedure

Search backwards from a query to determine if it is a logical consequence of KBKB (i.e. asking if an atom is true).

An answer clause: yesa1amyes \leftarrow a_1 \land \dots \land a_m.

The SLD resolution of the answer clause on atom aia_i with the clause aib1bpa_i \leftarrow b_1 \land \dots \land b_p is another answer clause:

yes  a1ai1b1bpai+1am yes \leftarrow \; a_1 \land \dots \land a_{i-1} \land b_1 \land \dots \land b_p \land a_{i+1} \land \dots \land a_m

Basically: replace the atom with its clause, repeating until no more replacements can be made.

An answer is an answer clause m=0m=0 with m=0m=0; that is, the answer clause is yesyes \leftarrow.

Derivations

Derivation of query ?q1qk? q_1 \land \dots \land q_k is a sequence of answer clauses γ0,,γn\gamma_0, \dots, \gamma_n such that:

Procedure

To solve the query ?q1qk? q_1 \land \dots \land q_k:

Either don’t-care non-determinism, in which case if a selection does not lead to a solution, there is no point in trying other alternatives, or don’t-know-non-determinism, in which other choices may lead to a solution.

A successful derivation would return ac:=yesac := yes.

A failing derivation would return something in the form ac:=yesa0anac := yes \leftarrow a_0 \land \dots \land a_n. This does not mean it cannot be derived, just that it failed. Use DFS; backtrack.

Weeks 04-05: Prolog

Prolog - programming in logic

Intro

Basic Syntax

Variables

Variables start with an underscore or upper case letter, and may contain upper, lower, digits or underscores. happy(X). returns a term that replaces X such that the rule is met. Typing j tries to find another term that satisfies the rule.

Conjunction can also be used e.g. happy(X), listensToMusic(X)..

Variables can be in the knowledge base as well e.g. jealous(X,Y) :- loves(X,Z), loves(Y,Z).

Atoms

A sequence of characters (upper, lower, digits, underscore) starting with a lowercase letter OR a sequence of special characters (:, ,, ;, ., :-). Atoms can be enclose in single quotes if it does not meet the naming requirements (e.g. 'Yolanda').

Complex terms

Functor directly followed by a sequence of arguments - put in brackets, separated by commas. e.g. listensToMusic(yolanda), hide(X,father(father(father(butch)))..

Arity

The number of arguments a complex term has; predicates with the same functor but different arity can be defined.

Unification

Two terms unify if they are the same term or contain variables that can be uniformly instantiated with terms in a way such that the resulting terms are equal.

e.g. woman(mia) = woman(mia). woman(Z) and woman(mia) will be unified, with Z taking the value of mia.

e.g. defining horizontal(line(point(X,Y), point(X,Z))). and running horizontal(line(point(1,2), point(X,3))). will unify to X=1.

Search Tree

f(a).
f(b).
g(a).
g(b).
h(b).
k(X):-f(X),g(X),h(X).

Asking ?: k(Y). will:

Recursion

For recursive predicates using conjunction, place the non-recursive term first - using DFS, so will run out of memory otherwise.

numeral(0).
numeral(succ(X)):-numeral(X).

Where succ(X) returns X + 1: 3 could be defined as numeral(succ(succ(succ(0)))). numeral(X). will continue going on forever. Beware of running out of memory e.g. p:-p..

Addition

add(0,X,X). is the base clause. No return values, so three arguments needed (the last is the return value).

add(succ(X),Y,succ(Z)):-add(X,Y,Z). is for the recursive case.

Dif predicate

mother(X, Y) :- parent(X, Y), female(X).
sister(X, Y) :- parent(Z, X), parent(Z, Y), female(X).

Any female is their own sister so the dif predicate is required: append X \= Y to the end of the sister body.

Predicate description - argument mode indicator

Character prepended to argument when describing a functor:

Logical quantification

Variables that appear in the head of a rule are universally quantified.

Variables that appear only in the body are existentially quantified.

path(X,Y) :- edge(X,Z), path(Z,Y).: For all nodes X and Y, there exists a node Z such that there is an edge from X to X and a path from X to Y

Lists

A finite sequence of elements e.g. [[], dead(z), [2, [b, c]], Z, 3].

Lists as implemented as linked lists. A non-empty list has two parts:

The | operator decomposes a list into a head and tail:

Anonymous variables

If you do not need the value of a variable, use _ - the anonymous variable. Two instances of _ may not be equal.

Membership

Is an element a member of a list? Use member/2:

member(X, [X|_]). % If element is the head, it is in the list
member(X, [_|T]) :- member(X, T). % Else, recurse through the list until it fails

member(X, [a, b, c]) can unify to three separate values.

% Exercise: a2b/2; first length all a's and second list all b's; both of the same length

a2b([], []).
a2b([a, L1], [b, L2]) :- a2b(L1, L2).

Append

Append two lists together using append/3, where the third argument is the result of concatenating the lists together.

append([], L, L).
append([H|L1], L2, [H|L3]) :- append(L1, L2, L3).
% If L3 is result, first elements of L1 and L3 must be the same. Hence, remove the first element from both L1 and L3 until L1 is empty.

Concatenating a list is done by traversing down one of the lists; hence, it is inefficient.

Prefix and Suffix

prefix(P, L) :- append(P, _, L). with prefix(X, [a, b, c]) generating all possible prefix lists of [a, b, c].

suffix(S, L) :- append(_, S, L). with suffix(X, [a, b, c]) generating all possible suffix lists of [a, b, c].

Sublist

Sublists are prefixes of suffixes of the list:

sublist(Sub, List) :- suffix(Suffix, List), prefix(Sub, Suffix).

Reversing a list

If given [H|T], reverse the list by reversing T and appending the list to H.

naiveReverse([], []):
naiveReverse([H|T], R) :- naiveReverse(T, RT), append(RT, [H], R).

This is inefficient: append/3 is O(n) and this is done at each stage, so it is O(n^2).

By using an accumulator, we can improve things:

% Third argument is an accumulator
accReverse([], L, L). % If list is empty, reversed array is the accumulator
accReverse([H|T], Acc, Rev) :- accReverse(T, [H|Acc], Rev).

reverse(L1, L2) :- accReversse(L1, [], L2).
% append/3 not used
List          Acc
[a, b, c, d]  []
[b, c, d]     [a]
[c, d]        [b, a]
[d]           [c, b, a]
[]            [d, c, b, a]

Arithmetic and other operators

C Prolog
< <
<= =<
== =:=
!= =/=
>= >=
> >

These force the left and right hand arguments to be evaluated.

= is the unification predicate and \= is the negation. == is the identity predicate, which succeeds if the arguments are identical.

2+2 = 4. is false as +(2, 2) does not unify to 4. You must use 2+2 =:= 4.

! is the cutback operator - it suppresses backtracking. The fail predicate always fails. Using these two allows us to invert the result:

neg(Goal) :- Goal, !, fail.

If Goal unifies, it gets to ! so can never backtrack. Then it gets to fail an fails. If the ! was not there, it would attempt to evaluate Goal again.

As this is so common, there is a built in operator that does this: \+.

Week 06: Constraint Satisfaction Problems

These problems are characterized by:

A solution is a n-tuple of values for the variables that satisfies the constraints.

Examples

Australian map colouring:

Sudoku:

Eight queens puzzle:

Basic Algorithms

Generate-and-Test algorithm

Generate the assignment space D=Dv1××DvnD = D_{v_1} \times \dots \times D_{v_n} (cartesian product), then test each assignment with the constraints.

It is exponential in the number of variables.

Backtracking

Systematically explore D by instantiating variables one at a time, evaluating each constraint as all its variables are bound.

Thus, any partial assignment that doesn’t satisfy the constraints can be pruned (e.g. if ABA \neq B, can prune these even if CC, DD etc. have not been instantiated yet).

A node is an assignment of values to some of the variables.

Search:

The start node is the empty assignment, and the goal node is a total assignment that satisfies the constraints.

CSP Algorithms

CSP is NP-hard. However, some instances of CSP can be solved more efficiently by exploiting certain properties.

Constraint Networks

An instance of CSP can be represented as a network:

Arc Consistency

An arc X,r(X,Y)\langle X, r(X, \overline{Y})\rangle is arc consistent if for every value of XX, there a value of Y\overline{Y} such that r(x,y)r(x,\overline{y}) is satisfied. Y\overline{Y} may be a set of multiple variables (variables in the scope of the constraint, except XX).

A network is arc consistent if all arcs are arc consistent.

If an constraint has only one variable in its scope and every value in the domain satisfies the constraint, then the arc is domain consistent.

If there is an arc that is not arc consistent, remove values from XX’s domain to make it arc consistent.

Example:

One arc would be A,A+B=C\langle{A, A+B=C}\rangle:

By repeating this with other nodes, the network can be made arc consistent.

Arc Consistency Algorithm

Arcs can be considered in series. An arc, X,r(X,Y)\langle X, r(X, \overline{Y})\rangle, needs to be revisited if the domain of one of the YY’s is reduced.

def GAC(variables, domains, constraints):
  # GAC: Generalized Arc Consistency algorithm
  return GAC2(variables, domains, constraints, [((X, C) for C in constraints if X in scope(C)) for X in variables].flatten())

def GAC2(variables, domains, constraints, todo):
  while not todo.isEmpty():
    X, constraint = todo.pop() # The variable is in the scope of the constraint (i.e. it is relevant)
    Y = scope(todo)
    Y.pop(X) # Need to remove the chosen from the set of constraints

    new_domain = [x for x in domain(X)
      # such that there exists a set (y_1,..., y_n) such that is_consistent(X=x, Y_1=y_1, ..., Y_n=y_n)
    ]

    if new_domain != domain(X):
      # need to re-check all variables that are involved with any constraints involving X
      todo += [(Z, C) for C in constraints if X in scope(C) and Z in scope(C) and X != Z]
      domain(X) = new_domain

There are three possible outcomes of this algorithm:

Domain Splitting

Split a problem into a number of disjoint cases and solve each case separately: the set of solutions is the union of all solutions to each case.

e.g. if X0,1X \in {0,1}, find all solutions where X=0X=0, then where X=1X=1.

Algorithm

Given a CSP instance:

def CSPSolver(variables, domains, constraints, todo):
  if [domain for domain in domains if len(domain) == 0] is not empty:
    return False
  else if [domain[0] for domain in domains if len(domain) == 1] is the same length:
    # Every domain has one possible value
    return (domain[0] for domain in domains)

  X = [X for X in domains if domain(X) > 1][0]
  D1, D2 = split(domain(X))

  domains1 = domains.replace(X, D1)
  domains2 = domains.replace(X, D2)

  todo = [(Z, C) for C in constraints if X in scope(C) and Z in scope(C) and X != Z]
  # Domain of X has been split so need to recheck all variables that are involved with constraints involving X

  return CSPSolver(variables, domains1, constraints, todo) or CSPSolver(variables, domains2, constraints, todo)

Variable Elimination

Eliminate variables one-by-one, passing constraints onto their neighbors.

Constraints can be thought of as a relation containing tuples for all possible valid values.

A join operation can be done on two constraints to capture both constraints, joining on the variable being eliminated. Then, this table can be projected with that column (variable being eliminated) removed.

Algorithm

def VE_CSP(variables, constraints):
  if len(variables) == 1:
    return join(...constraints)
  else:
    X = variables.pop() # select variable to eliminate and remove from variables
    constraints_X = [C for C in constraints if X in scope(C)]
    R = join(...constraints_X)
    R2 = project(all_variables_but_X)

    new_constraints = [C for C in constraints if X not in scope(C)] + R2
    recursed = VE_CSP(variables, new_constraints)
    return join(R, recursed)

Week 07: Local and Global Search

An optimization problem has:

The assignment that optimizes (maximize/minimize) the value of the objective function must be found.

A constrained optimization problem adds a set of constraints which determines which assignments are allowed.

Use algorithms to iteratively improve a state.

A single current state is kept in memory and in each iteration, we move to one of its neighbors to improve it.

Most local search algorithms are greedy. Two such algorithms are hill climbing and greedy descent.

e.g. TSP: start with any complete tour, and perform pairwise exchanges (pick two edges, switch the vertex to one in the other edge). This type of approach can get close to the optimal solution quickly.

Local Search for CSPs

CSP can be reduced to an optimization problem.

If each variable is assigned a value, a conflict is an unsatisfied constraint: the goal is to find an assignment with zero conflicts.

Hence, the heuristic function to to be minimized will be the number of conflicts.

Example: n-queens problem

Put n queens on an n by n board such that no two queens can attack each other.

Heuristic: number of pairs of queens that can attack each other.

One queen on each column, queens can move up and down only. For each state there will be n(n1)n(n-1) neighbors (each can move to n1n-1 locations).

Variants of Greedy Descent

Issues

Can get stuck in local optima/flat areas of the landscape - randomized greedy descent can sometimes help:

These two make the search global.

A total assignment is called an individual.

Maintain a population of k individuals instead of one and update each individual at every stage. If an individual is a solution, it can be reported (and the search can stop).

With local search, random restarts will occur when it reaches a non-zero local minimum and finish when a solution is found.

With parallel search, if a solution is found all individuals can be stopped.

Simulated Annealing

Given:

The probability of adopting the new value is:

e(h(n)h(n))/T e^{(h(n) - h(n'))/T}

As temperature gets reduced, the probability of accepting a change decreases.

Gradient Descent

Objective function must be (mostly) differentiable:

def gradient_descent(f, initial_guess):
  # f = objective function
  x = initial_guess # x is a vector
  while magnitude(grad(f)(x)) > epsilon:
    # f is a multi-variable function, so \nabla f returns a vector
    x = x - step_size * grad(f)(x)
    # Steepest slope is along the vector, so walk along it

Evolutionary Algorithms

Requires:

Flow

After initializing the population:

Evaluation/Fitness Function

A value relating to how ‘good’ an individual is.

Used for:

Selection

Better individuals should have a higher chance of surviving and breeding.

Roulette wheel

Each individual is assigned to a part of the roulette wheel, with the ‘angle’ being proportional to its fitness, and it is spun n times to select n individuals.

def roulette_wheel_select(population, fitness, r):
  total_fitness = sum([fitness(individual) for individual in population])

  rolling_sum = 0

  for individual in population:
    rolling_sum += fitness(individual)
    if rolling_sum > total_fitness * r:
      return individual

Tournaments

n (size of the tournament) individuals chosen randomly: the fittest one is selected as a parent.

If n is one, it is equivalent to randomly selecting an individual.

If n is the population size, it is completely deterministic.

Elitism

Keeps at least one copy of the fittest solution so far for the next generation - ensures the fitness will not decrease over generations.

Reproduction Operators

Crossover

Two parents produce two offspring. 1, 2 or n crossover points are generated:

One point crossover: child has parent one’s chromosomes up until the random cross over point; after that is the other parent’s chromosomes.

n point crossover: at n separate points, it swaps from getting chromosomes from one parent to the other.

Mutation

Mutation gives a low probability of randomly changing the gene of a child.

Mutation brings in diversity as compared to combining candidates to (hopefully) produce better children.

Randomly Generating Programs (Trees)

Mutation: pick a random node and replace the subtree with a randomly-generated subtree.

Crossover: pick a random node in each parent and exchange them.

Week 08: Probabilistic Inference and Belief Networks

In the real world, there will be uncertainty and randomness due to several factors:

Hence, using probabilities is often required.

Random Variable

Some aspect of the world about which there is uncertainty.

RVs are denoted with a capital letter and have an associated domain.

Unobserved RVs have distributions; a table of probabilities of values. A probability is a single number e.g. P(W=rain)=0.1P(W = rain) = 0.1. The probabilities sum to 1 and none are negative.

Joint distribution over a set of RVs: a map from assignments/outcomes/atomic events to reals; P(X1=x1,,Xn=xn)P(X_1 = x_1, \dots, X_n = x_n).

Event: set E of assignments; P(E)=(x1,,xnE)P(x1,,xn)P(E) = \sum_{(x_1, \dots, x_n \in E)}{P(x_1, \dots, x_n)}.

Marginalization (summing out): projecting a joint distribution to a sub-distribution over subset of variables: P(X1=x1)=x2domain(X2)P(X1=x1,X2=x2)P(X_1 = x_1) = \sum_{x_2 \in domain(X_2)}{P(X_1 = x_1, X_2 = x_2)}.

Conditional probability: P(ab)=P(a,b)P(b)P(a|b) = \frac{P(a, b)}{P(b)}.

Conditional distribution: probability distribution over some variables given fixed values of others. If WW and TT take binary values, P(W,T)P(W, T) is a 2 by 2 table, P(WT)P(W|T) is two 2-row tables, each summing to 1.

To get the whole conditional distribution at once, select the joint probabilities matching the evidence and normalize the selection (make it sum to 1). Example:

P(Train)P(T|rain): select rows where R=rainR=rain, then divide the probabilities by the sum P(warm,rain)+P(cold,rain)P(warm, rain) + P(cold, rain).

P(x1x2)=P(x1,x2)P(x2)=P(x1,x2)x1domain(X1)P(x1,x2) \begin{aligned} P(x_1|x_2) &= \frac{P(x_1, x_2)}{P(x_2)} \\ &= \frac{P(x_1, x_2)}{\sum_{x_1 \in domain(X_1)}{P(x_1, x_2)}} \end{aligned}

Product rule:

P(xy)=P(x,y)P(y)P(x,y)=P(xy)P(y) \begin{aligned} P(x|y) &= \frac{P(x, y)}{P(y)} \\ \therefore P(x, y) &= P(x|y) \cdot P(y) \end{aligned}

Chain rule: P(x1,x2,x3)=P(x1)P(x2x1)P(x3x1,x2)P(x_1, x_2, x_3) = P(x_1) \cdot P(x_2|x_1) \cdot P(x_3|x_1, x_2). More generally:

P(x1,,xn)=i=1nP(xix1,,xi1) P(x_1, \dots, x_n) = \prod_{i=1}^{n}{P(x_i|x_1, \dots, x_{i-1})}

Probabilistic Inference

Computing a desired probability from other known probabilities.

P(x,y)=P(xy)P(y)=P(yx)P(x)P(x, y) = P(x|y) \cdot P(y) = P(y|x) \cdot P(x). By dividing this by the marginal, we get Baye’s rule:

P(xy)=P(yx)P(x)P(y) P(x|y) = \frac{P(y|x) \cdot P(x)}{P(y)}

This allows us to invert a conditional distribution - often one conditional is simple but the other is tricky. $$ \begin{aligned} P(x,y|z) &= P(y,x|z) \ \ P(x,y|z) &= \frac{P(x,y,z)}{P(z)} \ P(y|x,z) &= \frac{P(x,y,z)}{P(x,z)} \ \therefore P(x,y,z) &= P(y|x,z) \cdot P(x,z) \ \therefore P(x,y|z) &= \frac{P(y|x,z) \cdot P(x,z)}{P(z)} \ &= P(y|x,z) \cdot P(x|z) \ \ \therefore P(x|y,z) &= \frac{P(y|x,z) \cdot P(x|z)}{P(y,z)} \ \text{if z is implicit}: \ P(x|y) &= \frac{P(y|x) \cdot P(x)}{P(y)} \text{ (Baye’s rule)} \end{aligned} $$

Inference by Enumeration

A more general procedure: P(Y1,,Yme1,,ek)P(Y_1, \dots, Y_m|e_1, \dots, e_k) where:

These variables can be referred to as X1,,XnX_1, \dots, X_n.

First, select entries consistent with the evidence.

Then, sum out HH:

P(Y1,,Ym,e1,,ek)=h1,,hrP(Y1,,Ym,h1,,hr,e1,,ek) P(Y_1, \dots, Y_m, e_1, \dots, e_k) = \sum_{h_1, \dots, h_r}{P(Y_1, \dots, Y_m, h_1, \dots, h_r, e_1, \dots, e_k)}

Finally, normalize the remaining entries.

Complexity of Models

Simple models are easier to build, explain, and usually lead to lower time complexity and space requirements.

To measure complexity, count the number of free parameters that must be specified; a joint distribution over n variables, each with a domain size of d requires dnd^n entries in the table, and the number of free parameters will be dn1d^n - 1 (the last one can be inferred as probabilities must sum to 1).

Independence

Two variables are independent if: $$ \begin{aligned} P(X, Y) &= P(X) \cdot P(Y) \text{ or} \ \forall x, y: P(x, y) &= P(X=x) \cdot P(Y=y) \end{aligned} $$ That is, their joint distribution factors into a product of two simpler distributions.

Absolute independence: X ⁣ ⁣ ⁣YX{\perp\!\!\!\perp}Y

Independence can be used as a modelling assumption; if all the variables are independent, instead of having dn1d^n - 1 parameters in the joint model, we only need nd1nd - 1 rows.

Absolute (unconditional) independence is vary rare; conditional independence is more robust. For a given value of ZZ, the probability of XX is independent of YY; X ⁣ ⁣ ⁣YZX{\perp\!\!\!\perp}Y|Z if:

x,y,z:P(x,yz)=P(xz)P(yz) orx,y,z:P(xz,y)=P(xz) \begin{aligned} &\forall x, y, z: P(x, y|z) = P(x|z) \cdot P(y|z) \text{ or}\\ &\forall x, y, z: P(x|z, y) = P(x|z) \end{aligned}

In this case:

P(X,Y,Z)=P(XY,Z)P(Y,Z)=P(XY,Z)P(YZ)P(Z)=P(XZ)P(YZ)P(Z) \begin{aligned} P(X, Y, Z) &= P(X|Y, Z) \cdot P(Y, Z) \\ &= P(X|Y, Z) \cdot P(Y|Z) \cdot P(Z) \\ &= P(X|Z) \cdot P(Y|Z) \cdot P(Z) \end{aligned}

This can occur if XX and YY are both dependent on ZZ but are independent of each other; the value of XX modifies the probability of the parent, ZZ’s, value and thus modifies the probability of YY in turn.

Belief Networks

Graphical Model Notation

Arcs:

Nodes:

Distributions:

D-separation can be used to decide if a set of nodes XX is independent of YY given ZZ.

Encoding

BNs implicitly encode joint distributions; this can be calculated as a product of local conditional distributions.

Example:

ABACB,CDCE \begin{aligned} A &\rightarrow B \\ A &\rightarrow C \\ B, C &\rightarrow D \\ C &\rightarrow E \\ \end{aligned}
P(a,b,c,d,e)=P(ea,b,c,d)P(a,b,c,d)(by the product rule)=P(ec)P(a,b,c,d)(e dependent on c but independent of all others)=P(ec)P(da,b,c)P(a,b,c)=P(ec)P(db,c)P(a,b,c)=P(ec)P(db,c)P(ca,b)P(a,b)=P(ec)P(db,c)P(ca)P(ba)P(a) \begin{aligned} P(a, b, c, d, e) &= P(e|a, b, c, d) \cdot P(a, b, c, d) \quad\text{(by the product rule)} \\ &= P(e|c) \cdot P(a, b, c, d) \quad\text{(e dependent on c but independent of all others)} \\ &= P(e|c) \cdot P(d|a, b, c) \cdot P(a, b, c) \\ &= P(e|c) \cdot P(d|b, c) \cdot P(a, b, c) \\ &= P(e|c) \cdot P(d|b, c) \cdot P(c|a, b) \cdot P(a, b) \\ &= P(e|c) \cdot P(d|b, c) \cdot P(c|a) \cdot P(b|a) \cdot P(a) \\ \end{aligned}

More generally, if you have a full assignment, multiplying the relevant conditionals gives the probability:

P(x1,,xn)=i=1nP(xiparents(Xi)) P(x_1, \dots, x_n) = \prod_{i=1}^{n}{P(x_i | parents(X_i))}
product = 1
for i in range(1, n):
  product *= probability(x[i] | parents(i))

Thus, we can reconstruct any entry of the full joint. However, not every BN can represent every full joint.

Inference Enumeration

Computing P(Ye)P(Y | e). If HH are the hidden variables and α\alpha normalizes the values so the sum of the probabilities is 1:

P(Ye)=αP(Y,e)=αHP(Y,e,H) P(Y|e) = \alpha \cdot P(Y, e) = \alpha \sum_H{P(Y, e, H)}

(NB: H\sum_H means H1H2\sum_{H_1}{\sum_{H_2}{\dots}})

This has to be computed for every value in the domain of YY.

Week 09: Basic Machine Learning

Learning: improving behavior based experience. This could be:

Components of a learning problem:

Learning architecture:

Supervised Learning

Given the following as input:

This is fed to a learning algorithm to build a predictive model that takes a new instance and returns/predicts the value for the target feature.

For continuous target variables, regression is used.

Measuring performance

Common performance measures:

error=number of incorrectly classified instancestotal number of instances \text{error} = \frac{\text{number of incorrectly classified instances}}{\text{total number of instances}}
accuracy=1error=number of correctly classified instancestotal number of instances \text{accuracy} = 1 - \text{error} = \frac{\text{number of correctly classified instances}}{\text{total number of instances}}

In binary classification problems, one class is called positive (p) and the other negative (n).

Common performance measures for regression:

Training and Test Sets

A set (multi-set) of examples is divided into training and test examples; this is required as the model can overfit the training data, giving high performance on the training data but low performance on unseen data.

The more complex the model is, the lower the error on the training data.

A general pattern is that at a certain complexity, increasing the complexity of the model increases the error on the test data.

Naïve Bayes Model

P(CX1,,Xn)=αi=1nP(XiC)P(C) P(C | X_1, \dots, X_n) = \alpha \cdot \prod_{i=1}^n{P(X_i | C)} \cdot P(C)

Where:

Conditional probabilities can be estimated from labelled data.

Find P(Classan_input_vector)P(Class | an\_input\_vector) for different classes and pick the class with the highest probability.

Problem: hard to learn P(ClassEvidence)P(Class | Evidence) as there needs to be examples for every possible assignment. As the number of features increases, the number of assignments grows exponentially.

Thus, assume input features are conditionally independent given the class model.

Example: Building a Classifier

Determine if the patient is susceptible to heart disease (y/n) given family history (t/f), fasting blood sugar level (l/h), BMI (l, n, h).

Model it as HistHist, BGBG, BMIBMI all having ClassClass as a parent:

P(ClassHist,BG,BMI)=αP(Class)P(HistClass)P(BGClass)P(BMIClass) P(Class | Hist, BG, BMI) = \alpha \cdot P(Class) \cdot P(Hist | Class) \cdot P(BG | Class) \cdot P(BMI | Class)

The class can take two values, so there are two tables per feature and two rows for HistHist/BGBG per table (three for BMIBMI as it has three values).

NB: in the quiz, you only store value for when class is true.

To calculate α\alpha, calculate the sum of P(ClassHist,BG,BMI)P(Class | Hist, BG, BMI) for all values of ClassClass, and then take the inverse.

Laplace Smoothing

Zero counts in small data sets lead to zero probabilities - this is too strong a claim based on only a sample.

To fix this, add a non-negative pseudo-count to the counts - this can reduce the confidence.

domain(A)domain(A) is the set of values AA can take; domain(A)|domain(A)| is the number of values AA can take

count(constraints)count(constraints) is the number of examples in the dataset that satisfy the given constraints:

Given these:

P(A=aB=b)count(A=aB=b)+pseudo_countadomain(A)count(A=a,B=b)+pseudo_count P(A=a | B=b) \approx \frac{count(A=a | B=b) + pseudo\_count}{\sum_{a' \in domain(A)}{ count(A=a', B=b) + pseudo\_count }}

This is equivalent to:

P(A=aB=b)count(A=aB=b)+pseudo_countcount(B=b)+pseudo_countdomain(A) P(A=a | B=b) \approx \frac{count(A=a | B=b) + pseudo\_count}{count(B=b) + pseudo\_count \cdot |domain(A)|}

The greater the pseudo-count is, the closer the probabilities will even out (closer to 1domain(A)\frac{1}{|domain(A)|}).

Parametric vs Non-Parametric Models

Parametric models described with a set of parameters; learning means finding the optimal values for these parameters.

Non-parametric models are not characterized by parameters - a family of this is called instance-based learning:

k-Nearest Neighbors

An example of a instance-based learning algorithm is k-nearest neighbors:

Training only requires storing all the examples.

Prediction: H(xnew)H(x_new):

If k is too high, it will be under-fit.

Geometrically, each data point xx defines a ‘cell’ of space where any point within that space has xx as the closest point to it. As the target feature is discrete, decision boundaries can be made where the class is different on each side of the boundary.

Week 10: Artificial Neural Networks

Perceptron

A neuron receives signals from multiple inputs (inputs are weighted), and if the overall signal is above a threshold, the neural fires. A perceptron models this with:

a=i=1nwixi+bias a = \sum_{i = 1}^{n}{w_i\cdot x_i} + bias

Sometimes, bias is represented as another weight w0w_0 - in this case, there is a virtual input x0=1x_0 = 1 and hence:

a=i=0nwixi a = \sum_{i = 0}^{n}{w_i\cdot x_i}

For this course, g(a)=1g(a) = 1 if a0a \ge 0 and 00 otherwise; a Heaviside (step) function.

A perceptron can be seen as a predicate: given a vector xx, f(x)=1f(x) = 1 if the predicate over xx is true, and 0 otherwise. Hence, it can be used in decision making and binary classification problems (f(x)=1f(x) = 1 if in the positive class).

The function partitions the input space into two sections: if there are two inputs, the decision boundary will be a straight line: w1w_1 and w2w_2 determine the gradient and w0w_0 determines the intercept. For a given value of these weights, the decision boundary can be found (e.g. points where x1x_1 or x2x_2 equal zero).

If there are three inputs, the decision boundary is a plane. For n dimensions, it will be a hyperplane.

The vector, w\underline{w} (without w0w_0), can be thought of as the normal to the line. The minimum distance between the origin and the decision boundary is w0w\frac{w_0}{\|\underline{w}\|}.

w\underline{w} can be used to show which side of the hyper-plane will be classified as positive (the direction it points in will be positive).

Learning

Given a data set - a collection of training vectors of the form (x1,,xn,t)(x_1, \dots, x_n, t) where tt is the target value:

If examples are linearly separable, the weights and bias will, in finite time, converge to values that produce a perfect separation.

If η\eta, the learning rate is too high, the boundary may oscillate back and never perfectly partition the values. If it is too small, it will take a long time to converge.

Multi-Layer Perceptrons

Motivation

y
^
|
| false    true
|
| true     false
|-----------------> x

A single perceptron cannot partition these four points into a decision boundary. However, this can be done with multi-layer perceptrons.

Define two perceptrons, P1P_1 and P2P_2 which receive the same vectors but have their own weights and biases. By some algorithm, P1P_1 could partition the input space so that the upper left is separated from the rest of the points, and P2P_2 the same for the the bottom right.

y          P_1            y   P_2
^                        ^
|          -----------   |
| false(0) | true(1)     | false(1)  true(1)
|----------              |         -----------
| true(1)    false(1)    | true(1) | false(0)
|-------------------> x  |--------------------> x

Now, pass the output of the perceptrons as input to another perceptron P3P_3:

P2        P_3
^
| --------
| (0, 1)  |  (1, 1) <- two points superimposed
|         ----------
|           (1, 0)  |
|---------------------> P1

Now, this perceptron can form a decision boundary that correctly partitions the input space.

Description

The feed-forward networks we are dealing with arranges perceptrons into layers where:

Some notes:

As more layers/neurons are added, the complexity of the boundary shape(s) can increase. If you have two dimensions:

Multi-class Classification

This can be done by having multiple numeric outputs and picking the node with the largest value.

Outputting numeric values instead of a Boolean requires the Sigmoid function:g(a)=11+eag(a) = \frac{1}{1 + e^{-a}}. This function is differentiable at all points and handles uncertainty.

Error Function

The mean squared error is typically used, where tit_i is the desired output (according to the training data), yiy_i is the output of the network, and nn is the number of examples in the training set:

E=i=1n(tiyi)2 E=\sum_{i=1}^{n}{(t_i - y_i)^2}

The weights can be updated incrementally:

WWηE(W) W \leftarrow \text W - \eta \nabla E(W)

E(W)\nabla E(W) is the gradient of the error; a vector of partial derivatives (derivatives for each input scalar given all other inputs are fixed). The gradient in the output layer is easy to compute, but in the hidden layer neurons can influence multiple other neurons, so back-propagation is needed (not covered).

Typical Architecture

The number of input nodes is determined by the number of attributes and the number of outputs is determined by the number of classes. A single hidden layer is enough for many classification tasks.

Guidelines:

Week 11: Games - Non-cooperative Multi-agent Systems

Many problems can be modelled as games: multiple agents with (possibly competing) interests. They can be described by:

Example: papers scissors rock:

Game Trees

In perfect-information games, a game tree is a finite tree where the nodes are states and the arcs correspond to actions by the agents.

Hence:

Perfect-information, Zero-sum, Turn-based games

Properties:

These properties creates adversary.

Optimal Strategy: Min-Max Function

Designed to find the best move at each stage:

  1. Generate the whole game tree
  2. Apply the utility function to each leaf
  3. Back-up values from the leaves through branch nodes
    • A max node computes the max of its child values, and vice-versa for a min node
  4. If a max node at the root, choose the move leading to the value with the largest value (and vice-versa if root is a min node)

This is optimal if both players are playing optimally.

def min_max_decision(state, root_is_max = True):
  if root_is_max:
    return min([(min_value(child), child) for child in state.children()])[1]
  else:
    return max([(max_value(child), child) for child in state.children()])[1]

def max_value(state):
  if state.is_terminal:
    return state.calculate_utility()

  return max([(min_value(child), child) for child in state.children()])[1]

def min_value(state):
  if state.is_terminal:
    return state.calculate_utility()

  return min([(max_value(child), child) for child in state.children()])[1]

Reducing Search Space

Game tree size:

Hence, the search space needs to be reduced:

Alpha-Beta Pruning

If an option is bad compared to other available options, there is no point in expending search time to find out exactly how bad it is.

Example:

MAX           3
           /     \
          /       \
Min      3         ?
       / | \     /   \
Leaf  3  9  6   2     ?

We have computed that the value for the first child (min node) is 3 and that the first child of the second child has a value of 2. Hence, the second child has a maximum value of 2, so regardless of the true value of the second child, the outcome of the max will not change.

More generally:

(player)   MAX              (player)   MIN
           / \                         / \
min       m   ...            max     m    ...
               ...                         ...
                 \                           \
min               ?          max              ?
                /   \                       /   \
max    m > n   n     ?       min    m < n  n     ?

If m > n, the max node is guaranteed to get a utility of at least m - hence, the min node with utility n or less will never be reached. Thus, it does not need to be evaluated further. The opposite occurs for a min node.

The algorithm has:

These two values should be passed down to the child nodes during search. If the value returned by the child node is greater than α\alpha for a max node, then α\alpha should be updated, and vice versa if it is a min node. If αβ\alpha \geq \beta, the node can be pruned.

from math import inf

def alpha_beta_search(tree, is_max_node = True, alpha = -inf, beta = inf):
  # Input: nested array. Numbers are leaf nodes, arrays are inner nodes
  # Returns a tuple of the best utility and path that gets that utility (index numbers)
  # If path is none, pruning occurred or it is a leaf node
  best_path = None
  best_utility = -inf if is_max_node else inf

  if isinstance(tree, int):
    return (tree, None)

  for (i, child) in enumerate(tree):
    utility, path = alpha_beta_search(child, not is_max_node, alpha, beta)
    path = [i] if path is None else ([i] + path) # Append index of child to path received by child

    if is_max_node:
      if utility > best_utility:
        (best_utility, best_path) = (utility, path)

        if utility > alpha:
          # This child is now the largest child encountered by this max node or any parent max nodes
          alpha = utility
    else:
      if utility < best_utility:
        (best_utility, best_path) = (utility, path)

        if utility < beta:
          beta = utility

    if alpha >= beta:
      return (utility, None)
      # In the case that this is a max node:
      # The child node is min node and its value is larger than or equal to the smallest value
      # encountered so far by any parent min nodes. Hence, this (max) node pick this child or
      # a larger one, so this node's parent will reject it. Thus, this node can be pruned
      # Similar logic applies if it is a min node

  return (best_utility, best_path)
Effectiveness

Worst case: if branches are ordered, no pruning takes place.

Best case: each player’s best move is the left-most child/evaluated first.

Good move ordering improves effectiveness. Examples:

Alpha-beta search often gives us O(bd2)O(b^\frac{d}{2}); an improvement over O(bd)O(b^d) (i.e. take the square root of the branching factor).

Static (Heuristic Evaluation) Function

Estimates how good the current board configuration (a non-terminal state) is for a player.

A typical function could evaluate how good the state is for the player subtract the opponent’s score. If the board evaluation is XX for one player, it is X-X for its opponent.

This can be used to perform a cut-off search: after a maximum depth is reached, use the heuristic evaluation instead of finding the actual utility.

Sidenote: three players. We now have two degrees of freedom; the utility is a tuple, not a scalar. Each player will attempt to maximize it’s own dimensions.

Week 12: Concluding Remarks

Final Exam:

Topics covered:

A recurring pattern across these topics is that the problems they solve can be reduced to search problems.

Related topics:

CSSE courses: