Frequent Items
Exercise
100 items numbered 1 to 100; 100 baskets numbered 1 to 100.
Item i is in basket
If support threshold is 5:
What items are frequent?
What pairs of items are frequent? items
Numbers greater than 20 can be eliminated immediately: the item alone does not exceed the support threshold, so pairs including that item cannot exceed the support threshold either.
Association Rules
If-then rules about the contents of baskets.
There are many rules; we want to find significant/interesting ones.
Confidence
The probability of
Not all high-confidence rules are interesting (e.g. milk could be purchased very often so there are many association rules predicting milk will be bought).
Interest
The difference between the confidence in
The standard threshold for interest is
A-Priori Algorithm
Naïve approach to finding frequent pairs:
- Read file once, count occurrence of each pairs
- From each basket with
items, generate its pairs ( )
Even finding pairs is an
Key idea: monotonicity. If a set of items
Contrapositive: if item
Base case: item-set containing nothing.
Pass 1: read baskets and count in main memory number of occurrences for each individual item.
Pass 2: Filter out items where the count is less than the support threshold. Then, calculate counts for all pairs.
Similarity
Many problems can be expressed as finding ‘similar’ sets: finding nearest neighbors in a high-dimensional space
Examples:
- Searching for pages with similar words/phrasing e.g. plagiarism detection
- Customers who purchased similar products - find products with similar customer sets
Given:
- Data points
(may need to flatten out matrices into a vector) - Distance function $d(x_1, x_2)
Goal:
- Find all pairs of data points
that are within some threshold distance
This can be done in
Distance metrics
Jaccard Distance: similarity of two sets is the size of their intersection divided by size of their union
Distance Measure Axioms
The last one is triangle inequality: ensures that direct route is always faster.