Similarity
Finding near-neighbors in a high-dimensional space.
Jaccard Similarity
If dimensions are Boolean values (e.g. if some specific keyword present or not):
Jaccard bag similarity:
- Don’t remove duplicates; if number appears multiple times, count it multiple times
- e.g. count the number of pairs for the intersection
- Size of union is sum of sizes of the sets
- Max similarity is
Cosine Distance
Angle whose cosine is the normalized dot product of the vectors:
Converting to frequency set: convert from array to table with value and number of times the value appears in the array.
Other Distance Measures
-
norm: Manhattan difference; sum of differences in each dimension -
norm: Pythagorean distance -
: maximum of difference in any one dimension - Edit distance
- Hamming distance
Hashing
Given a large number (
Hashing has previously been useful for finding exact matches; for this problem, we need a locality-sensitive hash that puts similar documents into the same bucket.
Shingling
Converting documents to sets (e.g. set of words).
- Documents can be characters or words (usually characters)
e.g.
When words are misspelt, only a few characters are wrong/out of order and hence, most of the shingles are likely to be the same as the correctly-spelt word.
Compression: hash long shingles and represent a document by the set of hash values. This can lead to false positives, but these can be checked using a slower check before returning the results.
Caveat:
-
okay for short documents -
better for long documents
Sampling may be required for larger datasets.
Min-Hashing
NB: see DATA301 Exam Notes for better notes.
Naïve approach is to compute pairwise similarities for every pair of documents:
The problem can be reduced to finding subsets that have significant intersections: encode sets as a vector of Booleans, one for every dimension; intersections are bitwise ANDs and unions bitwise ORs.
The set of documents can be stored as a Boolean matrix:
- Each row represents an element (shingles)
- Each column represents a set (documents)
matrix[e][s]is1if shingleshingles[e]appears in documents- Column similarity is the Jaccard similarity of the corresponding sets
This generates a sparse matrix and requires
Key idea: ‘hash’ each column
- The signature fits in RAM
- If
is high, there is a high probability that and vice versa if they are dissimilar
This works as the domain of the shingles is very large but only a small number of them are likely to be encountered. Hashing reduces the size of the domain to reduce memory requirements at the cost of some overlaps.
Hence, given a set of documents, the hash function outputs an array of hashes.
By doing this multiple times with different hash functions, the similarity of two signatures becomes the fraction of hash functions they agree with.
Hence, this produces a signature matrix
For some hash function
Locality-Sensitive Hashing
Find documents with Jaccard similarity of at least
The signature matrix
For each column, generate hashes for each band and put them into buckets (separate set of buckets for each band).
The more buckets match for a pair of documents, the more likely they are to be similar:
Using 1 band of one row gives a linear relationship between similarity and the probability of sharing a bucket. However, if you have