Locality Sensitive Hashing

Locality Sensitive Hashing

1. Overview

1.1. Applications of set-similarity

1.2. Problem statement

1.3. Motivation from frequent itemsets


2. Finding Similar Items

2.1. Distance measures

2.2. Three essential techniques for similar documents


3. Shingling

3.1. Shingles

3.2. Compressing shingles

3.3. Similarity metrics for shingles

???example “Example” - Given the following documents: - D1: practice makes perfect - D2: perfect practice prevents poor performance - D3: persistent practice produces progress - The 1-shingle set for each documents are: - D1: {practice,makes,perfect} - D2: {perfect,practice,prevents,poor,performance} - D3: {persistent,practice,produces,progress} - The 1-shingle space: - {makes, perfect, performance, persistent, poor, practice, prevents, produces, progress} - The corresponding 0/1 vectors for each documents are:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$$
\left[
\begin{array}{cccc}
            & D1 & D2 & D3\\
makes       & 1  & 0  & 0 \\
perfect     & 1  & 1  & 0 \\
performance & 0  & 1  & 0 \\
persistent  & 0  & 0  & 1 \\
poor        & 0  & 1  & 0 \\
practice    & 1  & 1  & 1 \\
prevents    & 0  & 1  & 0 \\
produces    & 0  & 0  & 1 \\
progress    & 0  & 0  & 1 \\
\end{array}
\right]
$$

3.4. Motivation for Minhash/LSH


4. Minhashing: Jaccard Similarity

4.1. Encoding sets as bit vectors

4.2. Convert from sets to boolean matrices

4.3. Finding similar columns

4.4. Hashing columns: Signatures

4.5. Minhashing

4.6. Minhashing: surprising property

4.7. Minhashing: implementation


5. Locality-sensitive-hashing (LSH)

5.1. Overview

5.2. LSH

5.3. Hands on LSH