Locality Sensitive Hashing

Overview

Applications of set-similarity

Many data mining problems can be expressed as finding similar sets:
- Pages with similar words, e.g., for classification by topic.
- Netflix users with similar tastes in movies, for recommendation systems.
- Dual: movies with similar sets of fans.
- Entity resolution.
Example application: Given a body of documents, e.g., the Web, find pairs of documents with a lot of text in common, such as:
- Mirror sites, or approximate mirrors.
  - Application: Don’t want to show both in a search.
- Plagiarism, including large quotations.
- Similar news articles at many news sites.
  - Application: Cluster articles by same story: topic modeling, trend identification.

Problem statement

Given: High dimensional data points $x_1, x_2, …$
- For example: Image is a long vector of pixel colors
\[\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} \rightarrow [1\ 2\ 1\ 0\ 2\ 1\ 0\ 1\ 0]\]
And some distance function $d(x_1,x_2)$
- Which quantifies the distance between $x_1$ and $x_2$
Goal: Find all pairs of data points $x_i, x_j$ that are within some distance threshold $d(x_i,x_j) \leq s$
Naïve solution would take $O(N^2)$ where N is the number of data points
This can be done in $O(N)$!!!

Motivation from frequent itemsets

A-Priori:
- First pass: Find frequent singletons. For a pair to be a frequent pair candidate, its singletons have to be frequent!
- Second pass: Count only candidate pairs!
PCY
- First pass: Count exact frequency of each item:
  - Take pairs of items {i,j}, hash them into B buckets and count of the number of pairs that hashed to each bucket
- Second pass:
  - For a pair {i,j} to be a candidate for a frequent pair, its singletons {i}, {j} have to be frequent and the pair has to hash to a frequent bucket!
What can we learn from PCY:
- First pass: Take documents and hash them to buckets such that documents that are similar hash to the same bucket
- Second pass: Only compare documents that are candidates (i.e., they hashed to a same bucket)
- Benefits: Instead of $O(N^2)$ comparisons, we need $O(N)$ comparisons to find similar documents

Finding Similar Items

Distance measures

Goal: Find near-neighbors in high-dimensional space
- We formally define near neighbors as points that are a small distance apart
For each application, we first need to define what distance means
Jaccard distance/similarity
- The Jaccard similarity of two sets is the size of their intersection divided by the size of their union:
  - $sim(C_1, C_2) = C_1 \cap C_2 / C_1 \cup C_2 $
- Jaccard distance:
  - $d(C_1, C_2) = 1 - C_1 \cap C_2 / C_1 \cup C_2 $
The Jaccard similiarity of two sets is the size of their intersection divided by the size of their union.

Three essential techniques for similar documents

Shingling : convert documents, emails, etc., to sets.
Minhashing : convert large sets to short signatures, while preserving similarity.
Locality sensitive hashing : focus on pairs of signatures likely to be similar.

Shingling

Shingles

Documents as high-dimensional data
Simple approaches
- Document = set of words appearing in document
- Document = set of important words
- Don’t work well for this application: Need to account for ordering of words!
Shingles
- A k-shingle (or k-gram) for a document is a sequence of k characters that appears in the document.
- Example:
  - k = 2;
  - Document:abcab;
  - Set of 2-shingles: {ab, bc, ca}.

Compressing shingles

To compress long shingles, we can hash them to (say) 4 bytes
Represent a document by the set of hash values of its k-shingles
- Idea: Two documents could (rarely) appear to have shingles in common, when in fact only the hash-values were shared
Example:
- k=2; document D1= abcab
- Set of 2-shingles: S(D1) = {ab, bc, ca}
- Select a random hash function $h$ such that
\[\begin{align} \begin{aligned} h(ab) = 1\\ h(bc) = 5\\ h(ca) = 7 \end{aligned} \end{align}\]
- Therefore, the hash of the shingles: h(D1) = {1, 5, 7}

Similarity metrics for shingles

Document $D_i$ is a set of its k-shingles $C_i=S(D_i)$
If we collect all unique k-shingles from all documents, each document can be represent as a 0/1 vector in the space of the k-shingles
- Each unique shingle is a dimension
- Vectors represent documents will be sparse, as no document should contain the majority of all k-shingles

Example

Given the following documents:
- D1: practice makes perfect
- D2: perfect practice prevents poor performance
- D3: persistent practice produces progress
The 1-shingle set for each documents are:
- D1: {practice,makes,perfect}
- D2: {perfect,practice,prevents,poor,performance}
- D3: {persistent,practice,produces,progress}
The 1-shingle space:
- {makes, perfect, performance, persistent, poor, practice, prevents, produces, progress}
The corresponding 0/1 vectors for each documents are:

\[\left[ \begin{array}{cccc} & D1 & D2 & D3\\ makes & 1 & 0 & 0 \\ perfect & 1 & 1 & 0 \\ performance & 0 & 1 & 0 \\ persistent & 0 & 0 & 1 \\ poor & 0 & 1 & 0 \\ practice & 1 & 1 & 1 \\ prevents & 0 & 1 & 0 \\ produces & 0 & 0 & 1 \\ progress & 0 & 0 & 1 \\ \end{array} \right]\]

A natural similarity measure is the Jaccard similarity

Motivation for Minhash/LSH

Documents that are intuitively similar will have many shingles in common.
- k = 5 is ok for short documents
- k = 10 is better for long documents
Problem: find near-duplicate documents among $N=10^6$ documents
Naïvely, we would have to compute pairwise Jaccard similarities for every pair of docs
- $\frac{N(N-1)}{2} \approx 5 \times 10^{11}$ comparisons
- At $10^5\ secs/day$ and $10^6\ comparisons/sec$, it would take 5 days
For $N=10^7$, it takes more than a year…

Minhashing: Jaccard Similarity

Encoding sets as bit vectors

Many similarity problems can be formalized as finding subsets that have significant intersection
Encode sets using 0/1 (bit, boolean) vectors
- One dimension per element in the universal set
Interpret set intersection as bitwise AND, and set union as bitwise OR
Example: C1 = 10111; C2 = 10011
- Size of intersection = 3; size of union = 4,
- Jaccard similarity (not distance) = 3/4
- Distance: $d(C_1,C_2)$ = 1 – (Jaccard similarity) = 1/4

Convert from sets to boolean matrices

Rows: elements of the universal set. In other words, all elements in the union.
Columns: individual sets.
A cell value of 1 in row e and column S if and only if e is a member of S.
A cell value of 0 otherwise.
Column similarity is the Jaccard similarity of the sets of their rows that have the value of 1.
Typically sparse.
This gives you another way to calculate similarity: column similarity = Jaccard similarity.

Finding similar columns

Naïve approach:
- Signatures of columns: small summaries of columns
- Examine pairs of signatures to find similar columns
  - Essential: Similarities of signatures and columns are related
- Optional: Check that columns with similar signatures are really similar

Hashing columns: Signatures

Key idea: “hash” each column C to a small signature h(C), such that:
- (1) h(C) is small enough that the signature fits in RAM
- (2) $sim(C_1, C_2)$ is the same as the “similarity” of signatures $h(C_1)$ and $h(C_2)$
Solution: min-hashing

Minhashing

Imagine the rows permuted randomly.
Define minhash function h(C) = the number of the first (in the permuted order) row in which column C has 1.
Use several (e.g., 100) independent hash functions to create a signature for each column.
The signatures can be displayed in another matrix called the signature matrix
The signature matrix has
- its columns represent the sets and
- the rows represent the minhash values, in order for that column.
Example: Assuming the following matrix for 4 documents, C1, C2, C3, and C4:
- 0/1 vector matrix:
\[\left[ \begin{array}{ccccc} & C1 & C2 & C3 & C4 \\ 1 & 1 & 0 & 1 & 0 \\ 2 & 1 & 0 & 0 & 1 \\ 3 & 0 & 1 & 0 & 1 \\ 4 & 0 & 1 & 0 & 1 \\ 5 & 0 & 1 & 0 & 1 \\ 6 & 1 & 0 & 1 & 0 \\ 7 & 1 & 0 & 0 & 0 \\ \end{array} \right]\]
- We permute the index of the row, then start going through the rows using the values in the Permutation column as indices.
Permutation 1
\[\left[ \begin{array}{ccccc} Permutation & C1 & C2 & C3 & C4 \\ 3 & 1 & 0 & 1 & 0 \\ 4 & 1 & 0 & 0 & 1 \\ 7 & 0 & 1 & 0 & 1 \\ 6 & 0 & 1 & 0 & 1 \\ 1 & 0 & 1 & 0 & 1 \\ 2 & 1 & 0 & 1 & 0 \\ 5 & 1 & 0 & 0 & 0 \\ \end{array} \right]\]

For column C1:
- row with value 1 in permutation column has a 0
- row with value 2 in permutation column has a 1 (first 1)
  - Resulting signature is 2
For column C2:
- row with value 1 in permutation column has a 1 (first 1)
  - Resulting signature is 1
For column C3:
- row with value 1 in permutation column has a 0
- row with value 2 in permutation column has a 1 (first 1)
  - Resulting signature is 2
For column C4:
- row with value 1 in permutation column has a 1 (first 1)
  - Resulting signature is 1
Resulting signature set for this permutation is {2,1,2,1}

</details>

<summary>Permutation 2</summary> [\left[ \begin{array}{ccccc} Permutation & C1 & C2 & C3 & C4
4 & 1 & 0 & 1 & 0
2 & 1 & 0 & 0 & 1
1 & 0 & 1 & 0 & 1
3 & 0 & 1 & 0 & 1
6 & 0 & 1 & 0 & 1
7 & 1 & 0 & 1 & 0
5 & 1 & 0 & 0 & 0
\end{array} \right]]

For column C1:
- row with value 1 in permutation column has a 0
- row with value 2 in permutation column has a 1 (first 1)
  - Resulting signature is 2
For column C2:
- row with value 1 in permutation column has a 1 (first 1)
  - Resulting signature is 1
For column C3:
- row with value 1 in permutation column has a 0
- row with value 2 in permutation column has a 0
- row with value 3 in permutation column has a 0
- row with value 4 in permutation column has a 1 (first 1)
  - Resulting signature is 4
For column C4:
- row with value 1 in permutation column has a 1 (first 1)
  - Resulting signature is 1
Resulting signature set for this permutation is {2,1,4,1}

</details>

<summary>Permutation 3</summary> [\left[ \begin{array}{ccccc} Permutation & C1 & C2 & C3 & C4
1 & 1 & 0 & 1 & 0
3 & 1 & 0 & 0 & 1
7 & 0 & 1 & 0 & 1
6 & 0 & 1 & 0 & 1
2 & 0 & 1 & 0 & 1
5 & 1 & 0 & 1 & 0
4 & 1 & 0 & 0 & 0
\end{array} \right]]

For column C1:
- row with value 1 in permutation column has a 1 (first 1)
  - Resulting signature is 1
For column C2:
- row with value 1 in permutation column has a 0
- row with value 2 in permutation column has a 1 (first 1)
  - Resulting signature is 2
For column C3:
- row with value 1 in permutation column has a 1 (first 1)
  - Resulting signature is 1
For column C4:
- row with value 1 in permutation column has a 0
- row with value 2 in permutation column has a 1 (first 1)
  - Resulting signature is 2
Resulting signature set for this permutation is {1,2,1,2}

</details> - The resulting signature matrix is:

 $$
\left[
\begin{array}{ccccc}
& C1 & C2 & C3 & C4 \\
1 & 2  & 1  & 2  & 1 \\
2 & 2  & 1  & 4  & 1 \\
3 & 1  & 2  & 1  & 2 \\
\end{array}
\right]
$$
 

Minhashing: surprising property

The probability (over all permutations of the rows) that $h(C_1) = h(C_2)$ is the same as $Sim(C_1, C_2)$.
The similarity of signatures is the fraction of the minhash functions in which they agree.
The expected similarity of two signatures equals the Jaccard similarity of the columns.
The longer the signatures, the smaller the expected error will be.
Example:
- Given the following document matrix
\[\left[ \begin{array}{ccccc} & C1 & C2 & C3 & C4 \\ 1 & 1 & 0 & 1 & 0 \\ 2 & 1 & 0 & 0 & 1 \\ 3 & 0 & 1 & 0 & 1 \\ 4 & 0 & 1 & 0 & 1 \\ 5 & 0 & 1 & 0 & 1 \\ 6 & 1 & 0 & 1 & 0 \\ 7 & 1 & 0 & 0 & 0 \\ \end{array} \right]\]
- and its corresponding signature matrix
\[\left[ \begin{array}{ccccc} & C1 & C2 & C3 & C4 \\ 1 & 2 & 1 & 2 & 1 \\ 2 & 2 & 1 & 4 & 1 \\ 3 & 1 & 2 & 1 & 2 \\ \end{array} \right]\]
- The Jaccard similarity calculation for the two matrices are:
\[\left[ \begin{array}{cccc} Similarity & C1\ and\ C3 & C2\ and\ C4 & C1\ and\ C2 \\ Column & \frac{2}{4} & 1 & 0 \\ Signature & \frac{2}{3} & 1 & 0 \\ \end{array} \right]\]

Minhashing: implementation

Can’t realistically permute billion of rows:
- Too many permutation entries to store.
- Random access on big data (big no no).
How to calculate hashes in sequential order?
- Pick approximately 100 hash functions (100 permutations).
- Go through the data set row by row.
- For each row r, for each hash function i,
  - Maintain a variable M(i,c) which will maintain the smallest value value of $h_i(r)$ for which column c has 1 in row r.

Locality-sensitive-hashing (LSH)

Overview

Generate from the collection of signatures a list of candidate pairs: pairs of elements where similarity must be evaluated.
- For signature matrices: hash columns to many buckets and make elements of the same bucket candidate pairs.
Pick a similarity threshold t, a fraction < 1.
- We want a pair of columns c and d of the signature matrix M to be a candidate pair if and only if their signatures agree in at least fraction t of the rows.

LSH

Big idea: hash columns of signature matrix M several times and arrange that only similar columns are likely to hash to the same bucket.
Reality: we don’t need to study the entire column.

Divide matrix M into b bands of r rows each.
For each band, hash its portion of each column to a hash table with k buckets, with k as large as possible.
Candidate column pairs are those that hash to the same bucket for at least one band.
Fine tune b and r.
We will not go into the math here …

Hands on LSH

Download the set of inaugural speeches from https://www.cs.wcupa.edu/lngo/data/inaugural_speeches.zip.