Week 10: Artificial Neural Networks

Perceptron

A neuron receives signals from multiple inputs (inputs are weighted), and if the overall signal is above a threshold, the neural fires. A perceptron models this with:

a=i=1nwixi+bias a = \sum_{i = 1}^{n}{w_i\cdot x_i} + bias

Sometimes, bias is represented as another weight w0w_0 - in this case, there is a virtual input x0=1x_0 = 1 and hence:

a=i=0nwixi a = \sum_{i = 0}^{n}{w_i\cdot x_i}

For this course, g(a)=1g(a) = 1 if a0a \ge 0 and 00 otherwise; a Heaviside (step) function.

A perceptron can be seen as a predicate: given a vector xx, f(x)=1f(x) = 1 if the predicate over xx is true, and 0 otherwise. Hence, it can be used in decision making and binary classification problems (f(x)=1f(x) = 1 if in the positive class).

The function partitions the input space into two sections: if there are two inputs, the decision boundary will be a straight line: w1w_1 and w2w_2 determine the gradient and w0w_0 determines the intercept. For a given value of these weights, the decision boundary can be found (e.g. points where x1x_1 or x2x_2 equal zero).

If there are three inputs, the decision boundary is a plane. For n dimensions, it will be a hyperplane.

The vector, w\underline{w} (without w0w_0), can be thought of as the normal to the line. The minimum distance between the origin and the decision boundary is w0w\frac{w_0}{\|\underline{w}\|}.

w\underline{w} can be used to show which side of the hyper-plane will be classified as positive (the direction it points in will be positive).

Learning

Given a data set - a collection of training vectors of the form (x1,,xn,t)(x_1, \dots, x_n, t) where tt is the target value:

If examples are linearly separable, the weights and bias will, in finite time, converge to values that produce a perfect separation.

If η\eta, the learning rate is too high, the boundary may oscillate back and never perfectly partition the values. If it is too small, it will take a long time to converge.

Multi-Layer Perceptrons

Motivation

y
^
|
| false    true
|
| true     false
|-----------------> x

A single perceptron cannot partition these four points into a decision boundary. However, this can be done with multi-layer perceptrons.

Define two perceptrons, P1P_1 and P2P_2 which receive the same vectors but have their own weights and biases. By some algorithm, P1P_1 could partition the input space so that the upper left is separated from the rest of the points, and P2P_2 the same for the the bottom right.

y          P_1            y   P_2
^                        ^
|          -----------   |
| false(0) | true(1)     | false(1)  true(1)
|----------              |         -----------
| true(1)    false(1)    | true(1) | false(0)
|-------------------> x  |--------------------> x

Now, pass the output of the perceptrons as input to another perceptron P3P_3:

P2        P_3
^
| --------
| (0, 1)  |  (1, 1) <- two points superimposed
|         ----------
|           (1, 0)  |
|---------------------> P1

Now, this perceptron can form a decision boundary that correctly partitions the input space.

Description

The feed-forward networks we are dealing with arranges perceptrons into layers where:

Some notes:

As more layers/neurons are added, the complexity of the boundary shape(s) can increase. If you have two dimensions:

Multi-class Classification

This can be done by having multiple numeric outputs and picking the node with the largest value.

Outputting numeric values instead of a Boolean requires the Sigmoid function:g(a)=11+eag(a) = \frac{1}{1 + e^{-a}}. This function is differentiable at all points and handles uncertainty.

Error Function

The mean squared error is typically used, where tit_i is the desired output (according to the training data), yiy_i is the output of the network, and nn is the number of examples in the training set:

E=i=1n(tiyi)2 E=\sum_{i=1}^{n}{(t_i - y_i)^2}

The weights can be updated incrementally:

WWηE(W) W \leftarrow \text W - \eta \nabla E(W)

E(W)\nabla E(W) is the gradient of the error; a vector of partial derivatives (derivatives for each input scalar given all other inputs are fixed). The gradient in the output layer is easy to compute, but in the hidden layer neurons can influence multiple other neurons, so back-propagation is needed (not covered).

Typical Architecture

The number of input nodes is determined by the number of attributes and the number of outputs is determined by the number of classes. A single hidden layer is enough for many classification tasks.

Guidelines: