07. Frequent Items, Association Rules, Similarity

Frequent Items

Exercise

100 items numbered 1 to 100; 100 baskets numbered 1 to 100.

Item i is in basket bb iff i%b==0i \% b == 0. Hence, i=1i = 1 in all 100 baskets.

If support threshold is 5:

What items are frequent? [1,,20][1, \dots, 20]

What pairs of items are frequent? items (a,b)(a, b) if both factor of bucket cc for 1c201 \ge c \ge 20.

Numbers greater than 20 can be eliminated immediately: the item alone does not exceed the support threshold, so pairs including that item cannot exceed the support threshold either.

Association Rules

If-then rules about the contents of baskets.

{i1,,ik}j\{ i_1, \dots, i_k \} \rightarrow j means that if a baskets contains all items i1,,iji_1, \dots, i_j, then it is likely to contain jj.

There are many rules; we want to find significant/interesting ones.

Confidence

The probability of jj given I={i1,,ik}I = \{ i_1, \dots, i_k \}:

confidence(Ij)=support(Ij)support(I) \text{confidence}(I \rightarrow j) = \frac{\text{support}(I \cup j)}{\text{support}(I)}

Not all high-confidence rules are interesting (e.g. milk could be purchased very often so there are many association rules predicting milk will be bought).

Interest

The difference between the confidence in IjI \rightarrow j and the proportion of baskets that contain jj, P(j)P(j).

Interest(Ij)=confidence(Ij)P(j) \text{Interest}(I \rightarrow j) = |\text{confidence}(I \rightarrow j) - P(j)|

The standard threshold for interest is 0.50.5.

A-Priori Algorithm

Naïve approach to finding frequent pairs:

Even finding pairs is an O(n2)O(n^2) algorithm. Hence, a more efficient algorithm is needed.

Key idea: monotonicity. If a set of items II appears at least ss times, so does every subset JJ of II.

Contrapositive: if item ii does not appear in ss baskets, then no set including ii can appear in ss baskets.

Base case: item-set containing nothing.

Pass 1: read baskets and count in main memory number of occurrences for each individual item. O(n)O(n) memory required.

Pass 2: Filter out items where the count is less than the support threshold. Then, calculate counts for all pairs. O(n2)O(n^2) memory. Memory also required for O(n)O(n) frequent items from pass 1.

Similarity

Many problems can be expressed as finding ‘similar’ sets: finding nearest neighbors in a high-dimensional space

Examples:

Given:

Goal:

This can be done in O(N)O(N)

Distance metrics

Jaccard Distance: similarity of two sets is the size of their intersection divided by size of their union

similarity(C1,C2)=C1C2C1C2 \text{similarity}(C_1, C_2) = \frac{| C_1 \cap C_2 |}{| C_1 \cup C_2 |}

Distance Measure Axioms

d(x,y)>0d(x,y)=0 iff x=yd(x,Y)=d(y,x)d(x,y)<d(x,z)+d(z,y) \begin{aligned} d(x, y) &> 0 \\ d(x, y) &= 0 \text{ iff } x = y \\ d(x, Y) &= d(y, x) \\ d(x, y) &< d(x, z) + d(z, y) \end{aligned}

The last one is triangle inequality: ensures that direct route is always faster.