08. Similarity and Hashing

Similarity

Finding near-neighbors in a high-dimensional space.

Jaccard Similarity

If dimensions are Boolean values (e.g. if some specific keyword present or not):

similarity(C1,C2)=C1C2C1C2distance=1similarity \text{similarity}(C_1, C_2) = \frac{\left| C_1 \cap C_2 \right|} {\left| C_1 \cup C_2 \right|} \text{distance} = 1 - \text{similarity}

Jaccard bag similarity:

Cosine Distance

Angle whose cosine is the normalized dot product of the vectors:

d(p1,p2)=θ=arccosp1p2p1p2 d(p_1, p_2) = \theta = \arccos{\frac{p_1 \cdot p_2}{|p_1| \cdot |p_2|}}

Converting to frequency set: convert from array to table with value and number of times the value appears in the array.

Other Distance Measures

Hashing

Given a large number (N>1,000,000N > 1,000,000) of documents, find ‘near duplicate’ pairs.

Hashing has previously been useful for finding exact matches; for this problem, we need a locality-sensitive hash that puts similar documents into the same bucket.

Shingling

Converting documents to sets (e.g. set of words).

kk-shingle or kk-gram is a sequence of kk tokens that appears in a document:

e.g. D1=abcabD_1 = abcab 2-gram: S(D1)={ab,bc,ca}S(D_1) = \{ ab, bc, ca \} (and S(D1)={ab,bc,ca,ab}S'(D_1) = \{ab, bc, ca, ab\})

When words are misspelt, only a few characters are wrong/out of order and hence, most of the shingles are likely to be the same as the correctly-spelt word.

Compression: hash long shingles and represent a document by the set of hash values. This can lead to false positives, but these can be checked using a slower check before returning the results.

Caveat: kk value must be picked appropriately:

Sampling may be required for larger datasets.

Min-Hashing

NB: see DATA301 Exam Notes for better notes.

Naïve approach is to compute pairwise similarities for every pair of documents: O(n2)O(n^2).

The problem can be reduced to finding subsets that have significant intersections: encode sets as a vector of Booleans, one for every dimension; intersections are bitwise ANDs and unions bitwise ORs.

The set of documents can be stored as a Boolean matrix:

This generates a sparse matrix and requires O(n2)O(n^2) space.

Key idea: ‘hash’ each column CC to a small signature (bucket) h(C)h(C) such that:

This works as the domain of the shingles is very large but only a small number of them are likely to be encountered. Hashing reduces the size of the domain to reduce memory requirements at the cost of some overlaps.

Hence, given a set of documents, the hash function outputs an array of hashes.

By doing this multiple times with different hash functions, the similarity of two signatures becomes the fraction of hash functions they agree with.

Hence, this produces a signature matrix MM, each row representing a hash function and each column a set (documents).

For some hash function hπh_\pi:

P(hπ(C1)=hπ(C2))similarity(C1,C2) P(h_\pi(C_1) = h_\pi(C_2)) \approx \text{similarity}(C_1, C_2)

Locality-Sensitive Hashing

Find documents with Jaccard similarity of at least ss. LHS uses a function f(x,y)f(x, y) that determines if xx and yy are candidate pairs - pairs of elements whose similarity must be evaluated.

The signature matrix MM is partitioned into bb bands, with rr rows (hash functions) per band.

For each column, generate hashes for each band and put them into buckets (separate set of buckets for each band).

The more buckets match for a pair of documents, the more likely they are to be similar:

Using 1 band of one row gives a linear relationship between similarity and the probability of sharing a bucket. However, if you have bb bands of rr rows:

P(sharing buckets)=1(1tr)b P(\text{sharing buckets}) = 1 - (1 - t^r)^b

rbrb gives the number of hash functions per document.