08. Similarity and Hashing

Similarity

Finding near-neighbors in a high-dimensional space.

Jaccard Similarity

If dimensions are Boolean values (e.g. if some specific keyword present or not):

\text{similarity}(C_1, C_2) = \frac{\left| C_1 \cap C_2 \right|} {\left| C_1 \cup C_2 \right|} \text{distance} = 1 - \text{similarity}

Jaccard bag similarity:

Don’t remove duplicates; if number appears multiple times, count it multiple times
- e.g. count the number of pairs for the intersection
- Size of union is sum of sizes of the sets
Max similarity is $0.5$

Cosine Distance

Angle whose cosine is the normalized dot product of the vectors:

d(p_1, p_2) = \theta = \arccos{\frac{p_1 \cdot p_2}{|p_1| \cdot |p_2|}}

Converting to frequency set: convert from array to table with value and number of times the value appears in the array.

Other Distance Measures

$L_1$ norm: Manhattan difference; sum of differences in each dimension
$L_2$ norm: Pythagorean distance
$L_\infty$ : maximum of difference in any one dimension
Edit distance
Hamming distance

Hashing

Given a large number ( $N > 1,000,000$ ) of documents, find ‘near duplicate’ pairs.

Hashing has previously been useful for finding exact matches; for this problem, we need a locality-sensitive hash that puts similar documents into the same bucket.

Shingling

Converting documents to sets (e.g. set of words).

$k$ -shingle or $k$ -gram is a sequence of $k$ tokens that appears in a document:

Documents can be characters or words (usually characters)

e.g. $D_1 = abcab$ 2-gram: $S(D_1) = \{ ab, bc, ca \}$ (and $S'(D_1) = \{ab, bc, ca, ab\}$ )

When words are misspelt, only a few characters are wrong/out of order and hence, most of the shingles are likely to be the same as the correctly-spelt word.

Compression: hash long shingles and represent a document by the set of hash values. This can lead to false positives, but these can be checked using a slower check before returning the results.

Caveat: $k$ value must be picked appropriately:

$k = 5$ okay for short documents
$k = 10$ better for long documents

Sampling may be required for larger datasets.

Min-Hashing

NB: see DATA301 Exam Notes for better notes.

Naïve approach is to compute pairwise similarities for every pair of documents: $O(n^2)$ .

The problem can be reduced to finding subsets that have significant intersections: encode sets as a vector of Booleans, one for every dimension; intersections are bitwise ANDs and unions bitwise ORs.

The set of documents can be stored as a Boolean matrix:

Each row represents an element (shingles)
Each column represents a set (documents)
matrix[e][s] is 1 if shingle shingles[e] appears in document s
Column similarity is the Jaccard similarity of the corresponding sets

This generates a sparse matrix and requires $O(n^2)$ space.

Key idea: ‘hash’ each column $C$ to a small signature (bucket) $h(C)$ such that:

The signature fits in RAM
If $sim(C_1, C_2)$ is high, there is a high probability that $h(C_1) = h(C_2)$ and vice versa if they are dissimilar

This works as the domain of the shingles is very large but only a small number of them are likely to be encountered. Hashing reduces the size of the domain to reduce memory requirements at the cost of some overlaps.

Hence, given a set of documents, the hash function outputs an array of hashes.

By doing this multiple times with different hash functions, the similarity of two signatures becomes the fraction of hash functions they agree with.

Hence, this produces a signature matrix $M$ , each row representing a hash function and each column a set (documents).

For some hash function $h_\pi$ :

P(h_\pi(C_1) = h_\pi(C_2)) \approx \text{similarity}(C_1, C_2)

Locality-Sensitive Hashing

Find documents with Jaccard similarity of at least $s$ . LHS uses a function $f(x, y)$ that determines if $x$ and $y$ are candidate pairs - pairs of elements whose similarity must be evaluated.

The signature matrix $M$ is partitioned into $b$ bands, with $r$ rows (hash functions) per band.

For each column, generate hashes for each band and put them into buckets (separate set of buckets for each band).

The more buckets match for a pair of documents, the more likely they are to be similar:

Using 1 band of one row gives a linear relationship between similarity and the probability of sharing a bucket. However, if you have $b$ bands of $r$ rows:

P(\text{sharing buckets}) = 1 - (1 - t^r)^b

$rb$ gives the number of hash functions per document.