similar sets: same story: topic modeling, trend identification.distance between $x_1$ and $x_2$near neighbors as points that are a small distance apartdistance meansJaccard similarity of two sets is the size of their intersection divided by the size of their union: | $sim(C_1, C_2) = | C_1 \cap C_2 | / | C_1 \cup C_2 | $ |
Jaccard distance: | $d(C_1, C_2) = 1 - | C_1 \cap C_2 | / | C_1 \cup C_2 | $ |
Jaccard similiarity of two sets is the size of their intersection divided by the size of their union.
important wordsabcab;compress long shingles, we can hash them to (say) 4 bytesJaccard similarity 1 in row e and column S if and only if e is a member of S.0 otherwise.1.
min-hashing minhash function h(C) = the number of the first (in the permuted order) row in which column C has 1.minhash values, in order for that column.Permutation column as indices.</details>
4 & 1 & 0 & 1 & 0
2 & 1 & 0 & 0 & 1
1 & 0 & 1 & 0 & 1
3 & 0 & 1 & 0 & 1
6 & 0 & 1 & 0 & 1
7 & 1 & 0 & 1 & 0
5 & 1 & 0 & 0 & 0
\end{array} \right]]
</details>
1 & 1 & 0 & 1 & 0
3 & 1 & 0 & 0 & 1
7 & 0 & 1 & 0 & 1
6 & 0 & 1 & 0 & 1
2 & 0 & 1 & 0 & 1
5 & 1 & 0 & 1 & 0
4 & 1 & 0 & 0 & 0
\end{array} \right]]
</details> - The resulting signature matrix is:
1
2
3
4
5
6
7
8
9
10
$$
\left[
\begin{array}{ccccc}
& C1 & C2 & C3 & C4 \\
1 & 2 & 1 & 2 & 1 \\
2 & 2 & 1 & 4 & 1 \\
3 & 1 & 2 & 1 & 2 \\
\end{array}
\right]
$$
t, a fraction < 1. c and d of the signature matrix M to be a candidate pair if and only if their signatures agree in at least fraction t of the rows.
b bands of r rows each.k buckets, with k as large as possible.Candidate column pairs are those that hash to the same bucket for at least one band.b and r.