Therefore, the hash of the shingles: h(D1) = {1, 5, 7}
3.3. Similarity metrics for shingles
Document $D_i$ is a set of its k-shingles $C_i=S(D_i)$
If we collect all unique k-shingles from all documents, each document can be represent as a 0/1 vector in the space of the k-shingles
Each unique shingle is a dimension
Vectors represent documents will be sparse, as no document should contain the majority of all k-shingles
???example “Example” - Given the following documents: - D1: practice makes perfect - D2: perfect practice prevents poor performance - D3: persistent practice produces progress - The 1-shingle set for each documents are: - D1: {practice,makes,perfect} - D2: {perfect,practice,prevents,poor,performance} - D3: {persistent,practice,produces,progress} - The 1-shingle space: - {makes, perfect, performance, persistent, poor, practice, prevents, produces, progress} - The corresponding 0/1 vectors for each documents are:
$$
\left[
\begin{array}{ccccc}
Permutation & C1 & C2 & C3 & C4 \\
3 & 1 & 0 & 1 & 0 \\
4 & 1 & 0 & 0 & 1 \\
7 & 0 & 1 & 0 & 1 \\
6 & 0 & 1 & 0 & 1 \\
1 & 0 & 1 & 0 & 1 \\
2 & 1 & 0 & 1 & 0 \\
5 & 1 & 0 & 0 & 0 \\
\end{array}
\right]
$$
- For column C1:
- row with value 1 in permutation column has a 0
- row with value 2 in permutation column has a 1 (first 1)
- Resulting signature is 2
- For column C2:
- row with value 1 in permutation column has a 1 (first 1)
- Resulting signature is 1
- For column C3:
- row with value 1 in permutation column has a 0
- row with value 2 in permutation column has a 1 (first 1)
- Resulting signature is 2
- For column C4:
- row with value 1 in permutation column has a 1 (first 1)
- Resulting signature is 1
- Resulting signature set for this permutation is {2,1,2,1}
$$
\left[
\begin{array}{ccccc}
Permutation & C1 & C2 & C3 & C4 \\
4 & 1 & 0 & 1 & 0 \\
2 & 1 & 0 & 0 & 1 \\
1 & 0 & 1 & 0 & 1 \\
3 & 0 & 1 & 0 & 1 \\
6 & 0 & 1 & 0 & 1 \\
7 & 1 & 0 & 1 & 0 \\
5 & 1 & 0 & 0 & 0 \\
\end{array}
\right]
$$
- For column C1:
- row with value 1 in permutation column has a 0
- row with value 2 in permutation column has a 1 (first 1)
- Resulting signature is 2
- For column C2:
- row with value 1 in permutation column has a 1 (first 1)
- Resulting signature is 1
- For column C3:
- row with value 1 in permutation column has a 0
- row with value 2 in permutation column has a 0
- row with value 3 in permutation column has a 0
- row with value 4 in permutation column has a 1 (first 1)
- Resulting signature is 4
- For column C4:
- row with value 1 in permutation column has a 1 (first 1)
- Resulting signature is 1
- Resulting signature set for this permutation is {2,1,4,1}
$$
\left[
\begin{array}{ccccc}
Permutation & C1 & C2 & C3 & C4 \\
1 & 1 & 0 & 1 & 0 \\
3 & 1 & 0 & 0 & 1 \\
7 & 0 & 1 & 0 & 1 \\
6 & 0 & 1 & 0 & 1 \\
2 & 0 & 1 & 0 & 1 \\
5 & 1 & 0 & 1 & 0 \\
4 & 1 & 0 & 0 & 0 \\
\end{array}
\right]
$$
- For column C1:
- row with value 1 in permutation column has a 1 (first 1)
- Resulting signature is 1
- For column C2:
- row with value 1 in permutation column has a 0
- row with value 2 in permutation column has a 1 (first 1)
- Resulting signature is 2
- For column C3:
- row with value 1 in permutation column has a 1 (first 1)
- Resulting signature is 1
- For column C4:
- row with value 1 in permutation column has a 0
- row with value 2 in permutation column has a 1 (first 1)
- Resulting signature is 2
- Resulting signature set for this permutation is {1,2,1,2}
Pick approximately 100 hash functions (100 permutations).
Go through the data set row by row.
For each row r, for each hash function i,
Maintain a variable M(i,c) which will maintain the smallest value value of $h_i(r)$ for which column c has 1 in row r.
5. Locality-sensitive-hashing (LSH)
5.1. Overview
Generate from the collection of signatures a list of candidate pairs: pairs of elements where similarity must be evaluated.
For signature matrices: hash columns to many buckets and make elements of the same bucket candidate pairs.
Pick a similarity threshold t, a fraction < 1.
We want a pair of columns c and d of the signature matrix M to be a candidate pair if and only if their signatures agree in at least fraction t of the rows.
5.2. LSH
Big idea: hash columns of signature matrix M several times and arrange that only similar columns are likely to hash to the same bucket.
Reality: we don’t need to study the entire column.
Divide matrix M into b bands of r rows each.
For each band, hash its portion of each column to a hash table with k buckets, with k as large as possible.
Candidate column pairs are those that hash to the same bucket for at least one band.
Fine tune b and r.
We will not go into the math here …
5.3. Hands on LSH
Download the set of inaugural speeches from https://www.cs.wcupa.edu/lngo/data/inaugural_speeches.zip.