Frequent Itemsets

Frequent Itemsets

1. Association rule discovery

1.1. Overview

1.2. The market-basket model

1.3. Example applications


2. Definition of a frequent itemsets.

2.1. Definition

2.2. Example of frequent itemsets

2.3. Association rules

2.4. Interesting association rules

2.5. Finding association rules

2.6. Mining association rules

2.7. Example


3. Finding frequent itemsets

3.1. Computation model

3.2. Computational cost

3.3. Main-memory bottleneck

3.4. Finding frequent pairs

3.5. Naive approach

3.6. Counting pairs in memory

???note “Approach 1: Count all pairs using a matrix”

1
2
3
4
5
6
7
- Requires 4 bytes per pair
- If n = total number items
    - Count pair of items {i, j} only if $i \leq j$
    - Keep pair counts in lexicographic order: 
        - {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},…
    - Pair {i, j} is at position (i –1)(n – i/2) + j –1
    - Total number of pairs n(n –1)/2; total bytes= $2n^2$

???note “Approach 2: Keep a table of triples [i, j, c]”

1
2
3
4
5
6
- The count of the pair of items {i, j} is c.
- If integers and item ids are 4 bytes, we need approximately 
12 bytes for pairs with count > 0
- Plus some additional overhead for the hashtable
- **but only for pairs with count > 0**
- Beats Approach 1 if less than 1/3 of possible pairs actually occur

4. A-Priori algorithm.

4.1. Overview

4.2. A-Priori algorithm

4.3. Frequent triples

4.4. Example


5. PCY (Park-Chen-Yu) Algorithm

5.1. Observation

5.2. PCY Algorithm - First Pass

1
2
3
4
5
6
FOR (each basket) :
    FOR (each item in the basket) :
        add 1 to item’s count;
    FOR (each pair of items) :
        hash the pair to a bucket;
        add 1 to the count for that bucket;

5.3. PCY Algorithm - Second Pass


6. SON: Savasere-Omiecinski-Navathe

6.1. Overview

6.2. SON: MapReduce Phase 1

6.3. SON: MapReduce Phase 2