Week 09: Basic Machine Learning

Learning: improving behavior based experience. This could be:

Components of a learning problem:

Learning architecture:

Supervised Learning

Given the following as input:

This is fed to a learning algorithm to build a predictive model that takes a new instance and returns/predicts the value for the target feature.

For continuous target variables, regression is used.

Measuring performance

Common performance measures:

error=number of incorrectly classified instancestotal number of instances \text{error} = \frac{\text{number of incorrectly classified instances}}{\text{total number of instances}}
accuracy=1error=number of correctly classified instancestotal number of instances \text{accuracy} = 1 - \text{error} = \frac{\text{number of correctly classified instances}}{\text{total number of instances}}

In binary classification problems, one class is called positive (p) and the other negative (n).

Common performance measures for regression:

Training and Test Sets

A set (multi-set) of examples is divided into training and test examples; this is required as the model can overfit the training data, giving high performance on the training data but low performance on unseen data.

The more complex the model is, the lower the error on the training data.

A general pattern is that at a certain complexity, increasing the complexity of the model increases the error on the test data.

Naïve Bayes Model

P(CX1,,Xn)=αi=1nP(XiC)P(C) P(C | X_1, \dots, X_n) = \alpha \cdot \prod_{i=1}^n{P(X_i | C)} \cdot P(C)

Where:

Conditional probabilities can be estimated from labelled data.

Find P(Classan_input_vector)P(Class | an\_input\_vector) for different classes and pick the class with the highest probability.

Problem: hard to learn P(ClassEvidence)P(Class | Evidence) as there needs to be examples for every possible assignment. As the number of features increases, the number of assignments grows exponentially.

Thus, assume input features are conditionally independent given the class model.

Example: Building a Classifier

Determine if the patient is susceptible to heart disease (y/n) given family history (t/f), fasting blood sugar level (l/h), BMI (l, n, h).

Model it as HistHist, BGBG, BMIBMI all having ClassClass as a parent:

P(ClassHist,BG,BMI)=αP(Class)P(HistClass)P(BGClass)P(BMIClass) P(Class | Hist, BG, BMI) = \alpha \cdot P(Class) \cdot P(Hist | Class) \cdot P(BG | Class) \cdot P(BMI | Class)

The class can take two values, so there are two tables per feature and two rows for HistHist/BGBG per table (three for BMIBMI as it has three values).

NB: in the quiz, you only store value for when class is true.

To calculate α\alpha, calculate the sum of P(ClassHist,BG,BMI)P(Class | Hist, BG, BMI) for all values of ClassClass, and then take the inverse.

Laplace Smoothing

Zero counts in small data sets lead to zero probabilities - this is too strong a claim based on only a sample.

To fix this, add a non-negative pseudo-count to the counts - this can reduce the confidence.

domain(A)domain(A) is the set of values AA can take; domain(A)|domain(A)| is the number of values AA can take

count(constraints)count(constraints) is the number of examples in the dataset that satisfy the given constraints:

Given these:

P(A=aB=b)count(A=aB=b)+pseudo_countadomain(A)count(A=a,B=b)+pseudo_count P(A=a | B=b) \approx \frac{count(A=a | B=b) + pseudo\_count}{\sum_{a' \in domain(A)}{ count(A=a', B=b) + pseudo\_count }}

This is equivalent to:

P(A=aB=b)count(A=aB=b)+pseudo_countcount(B=b)+pseudo_countdomain(A) P(A=a | B=b) \approx \frac{count(A=a | B=b) + pseudo\_count}{count(B=b) + pseudo\_count \cdot |domain(A)|}

The greater the pseudo-count is, the closer the probabilities will even out (closer to 1domain(A)\frac{1}{|domain(A)|}).

Parametric vs Non-Parametric Models

Parametric models described with a set of parameters; learning means finding the optimal values for these parameters.

Non-parametric models are not characterized by parameters - a family of this is called instance-based learning:

k-Nearest Neighbors

An example of a instance-based learning algorithm is k-nearest neighbors:

Training only requires storing all the examples.

Prediction: H(xnew)H(x_new):

If k is too high, it will be under-fit.

Geometrically, each data point xx defines a ‘cell’ of space where any point within that space has xx as the closest point to it. As the target feature is discrete, decision boundaries can be made where the class is different on each side of the boundary.