07. Frequent Items, Association Rules, Similarity

Frequent Items

Exercise

100 items numbered 1 to 100; 100 baskets numbered 1 to 100.

Item i is in basket $b$ iff $i \% b == 0$ . Hence, $i = 1$ in all 100 baskets.

If support threshold is 5:

What items are frequent? $[1, \dots, 20]$

What pairs of items are frequent? items $(a, b)$ if both factor of bucket $c$ for $1 \ge c \ge 20$ .

Numbers greater than 20 can be eliminated immediately: the item alone does not exceed the support threshold, so pairs including that item cannot exceed the support threshold either.

Association Rules

If-then rules about the contents of baskets.

$\{ i_1, \dots, i_k \} \rightarrow j$ means that if a baskets contains all items $i_1, \dots, i_j$ , then it is likely to contain $j$ .

There are many rules; we want to find significant/interesting ones.

Confidence

The probability of $j$ given $I = \{ i_1, \dots, i_k \}$ :

\text{confidence}(I \rightarrow j) = \frac{\text{support}(I \cup j)}{\text{support}(I)}

Not all high-confidence rules are interesting (e.g. milk could be purchased very often so there are many association rules predicting milk will be bought).

Interest

The difference between the confidence in $I \rightarrow j$ and the proportion of baskets that contain $j$ , $P(j)$ .

\text{Interest}(I \rightarrow j) = |\text{confidence}(I \rightarrow j) - P(j)|

The standard threshold for interest is $0.5$ .

A-Priori Algorithm

Naïve approach to finding frequent pairs:

Read file once, count occurrence of each pairs
From each basket with $n$ items, generate its pairs ( $\frac{n(n-1)}{2}$ )

Even finding pairs is an $O(n^2)$ algorithm. Hence, a more efficient algorithm is needed.

Key idea: monotonicity. If a set of items $I$ appears at least $s$ times, so does every subset $J$ of $I$ .

Contrapositive: if item $i$ does not appear in $s$ baskets, then no set including $i$ can appear in $s$ baskets.

Base case: item-set containing nothing.

Pass 1: read baskets and count in main memory number of occurrences for each individual item. $O(n)$ memory required.

Pass 2: Filter out items where the count is less than the support threshold. Then, calculate counts for all pairs. $O(n^2)$ memory. Memory also required for $O(n)$ frequent items from pass 1.

Similarity

Many problems can be expressed as finding ‘similar’ sets: finding nearest neighbors in a high-dimensional space

Examples:

Searching for pages with similar words/phrasing e.g. plagiarism detection
Customers who purchased similar products - find products with similar customer sets

Given:

Data points $x_1, x_2, \dots$ (may need to flatten out matrices into a vector)
Distance function $d(x_1, x_2)

Goal:

Find all pairs of data points $(x_i, x_j)$ that are within some threshold distance $s$

This can be done in $O(N)$

Distance metrics

Jaccard Distance: similarity of two sets is the size of their intersection divided by size of their union

\text{similarity}(C_1, C_2) = \frac{| C_1 \cap C_2 |}{| C_1 \cup C_2 |}

Distance Measure Axioms

\begin{aligned} d(x, y) &> 0 \\ d(x, y) &= 0 \text{ iff } x = y \\ d(x, Y) &= d(y, x) \\ d(x, y) &< d(x, z) + d(z, y) \end{aligned}

The last one is triangle inequality: ensures that direct route is always faster.