
S = set of Items
| Avatar | LOTR | Matrix | Pirates | |
|---|---|---|---|---|
| Alice | 1 | 0.2 | ||
| Bob | 0.5 | 0.3 | ||
| Carol | 0.2 | 1 | ||
| David | 0.4 |
???question “Key Problems” - Gathering known ratings for matrix - How to collect the data in the utility matrix - Extrapolate unknown ratings from the known ones - Mainly interested in high unknown ratings - We are not interested in knowing what you don’t like but what you like - Evaluating extrapolation methods - How to measure success/performance of recommendation methods
???info “Gathering ratings” - Explicit - Ask people to rate items - Doesn’t work well in practice: people can’t be bothered - Implicit - Learn ratings from user actions - E.g., purchase implies high rating - What about low ratings?
???info “Extrapolating utilities” - Key problem: Utility matrix U is sparse - Most people have not rated most items - Cold start: - New items have no ratings - New users have no history - Three approaches to recommender systems: - Content-based - Collaborative - Latent factor based
similar content
similar to x’s ratings
???question “Potential problem?” Treats missing ratings as negative
| HP1 | HP2 | HP3 | TW | SW1 | SW2 | SW3 | |
|---|---|---|---|---|---|---|---|
| A | 4 | 5 | 1 | ||||
| B | 5 | 5 | 4 | ||||
| C | 2 | 4 | 5 | ||||
| D | 3 | 3 |
???info “Immediate intuition” A is more similar to B than A is to C
???failure “Jaccard similarity” - $r_A={1,0,0,1,1,0,0}$ - $r_B={1,1,1,0,0,0,0}$ - $r_C={0,0,0,1,1,1,0}$ - $sim(A,B) = \frac{1}{5}$ - $sim(A,C) = \frac{2}{4}$ - A is more similar to C than to B!!!
???warning “Cosine similarity” - $r_A={4,0,0,5,1,0,0}$ - $r_B={5,5,4,0,0,0,0}$ - $r_C={0,0,0,2,4,5,0}$
1
2
3
4
5
6
7
8
9
10
11
12
13
14
$$
sim(x,y)=cos(r_x,r_y)=\frac{r_x \cdot r_y}{||r_x|| \cdot ||r_y||}
$$
$$
sim(A,B)=\frac{r_A \cdot r_B}{||r_A|| \cdot ||r_B||}=\frac{4 \times 5 + 0 \times 5 + 0 \times 4 + 5 \times 0 + 1 \times 0 + 0 \times 0 + 0 \times 0 }{\sqrt{4^2+5^2+1^2} \times \sqrt{5^2+5^2+4^2}}=0.380
$$
$$
sim(A,C)=\frac{r_A \cdot r_C}{||r_A|| \cdot ||r_C||}=\frac{5 \times 2 + 1 \times 4}{\sqrt{4^2+5^2+1^2} \times \sqrt{2^2+4^2+5^2}}=0.322
$$
- A is more similar to B than to C!
- What else?
???success “Normalization” - Sutract the row mean - This normalizes data and turns cosine similarity into Pearson correlation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
$$
sim(x,y)=cos(r_x,r_y)=\frac{r_x \cdot r_y}{||r_x|| \cdot ||r_y||}
$$
$$
sim(x,y) = \frac{\sum_{s \in S_{xy}}(r_{xs} - \overline{r_x})(r_{ys} - \overline{r_y})}{\sqrt{\sum_{s \in S_{xy}}(r_{xs} - \overline{r_x})^2}\sqrt{\sum_{s \in S_{xy}}(r_{ys} - \overline{r_y})^2}}
$$
- $r_A={4,0,0,5,1,0,0}$ and $\overline{r_A}=\frac{10}{3}$
- $r_B={5,5,4,0,0,0,0}$ and $\overline{r_B}=\frac{14}{3}$
- $r_C={0,0,0,2,4,5,0}$ and $\overline{r_C}=\frac{11}{3}$
- $r_D={0,3,0,0,0,0,3}$ and $\overline{r_D}=3$
| | HP1 | HP2 | HP3 | TW | SW1 | SW2 | SW3 |
| - | --- | --- | ---- | ---- | ---- | --- | --- |
| A | 2/3 | | | 5/3 | -7/3 | | |
| B | 1/3 | 1/3 | -2/3 | | | | |
| C | | | | -5/3 | 1/3 | 4/3 | |
| D | | 0 | | | | | 0 |
- sim(A,B)=0.092
- sim(A,C)=-0.559
- A is more similar to B than to C!
- Because the value is also the correlation value:
- A is very similar to B
- A is very different from C
Can use same similarity metrics and prediction functions as in user-user model
| $u_1$ | $u_2$ | $u_3$ | $u_4$ | $u_5$ | $u_6$ | $u_7$ | $u_8$ | $u_9$ | $u_{10}$ | $u_{11}$ | $u_{12}$ | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| $m_1$ | 1 | 3 | 5 | 5 | 4 | |||||||
| $m_2$ | 5 | 4 | 4 | 2 | 1 | 3 | ||||||
| $m_3$ | 2 | 4 | 1 | 2 | 3 | 4 | 3 | 5 | ||||
| $m_4$ | 2 | 4 | 5 | 4 | 2 | |||||||
| $m_5$ | 4 | 3 | 4 | 2 | 2 | 5 | ||||||
| $m_6$ | 1 | 3 | 3 | 2 | 4 |
???info “Estimate rating of movie 1 by user 5”
1
2
3
4
5
6
7
8
| | $u_1$ | $u_2$ | $u_3$ | $u_4$ | $u_5$ | $u_6$ | $u_7$ | $u_8$ | $u_9$ | $u_{10}$ | $u_{11}$ | $u_{12}$ |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | -------- | -------- | -------- |
| $m_1$ | 1 | | 3 | | ??? | 5 | | | 5 | | 4 | |
| $m_2$ | | | 5 | 4 | | | 4 | | | 2 | 1 | 3 |
| $m_3$ | 2 | 4 | | 1 | 2 | | 3 | | 4 | 3 | 5 | |
| $m_4$ | | 2 | 4 | | 5 | | | 4 | | | 2 | |
| $m_5$ | | | 4 | 3 | 4 | 2 | | | | | 2 | 5 |
| $m_6$ | 1 | | 3 | | 3 | | | 2 | | | 4 | |
???info “Identify movies similar to movie 1, rated by user 5” - Pearson correlation for similarity calculation: - Movie 3 - Movie 5
1
2
3
4
5
6
7
8
| | $u_1$ | $u_2$ | $u_3$ | $u_4$ | $u_5$ | $u_6$ | $u_7$ | $u_8$ | $u_9$ | $u_{10}$ | $u_{11}$ | $u_{12}$ | $sim(1,m_i)$ |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | -------- | -------- | -------- | ------------ |
| $m_1$ | 1 | | 3 | | ??? | 5 | | | 5 | | 4 | | 1.0 |
| $m_2$ | | | 5 | 4 | | | 4 | | | 2 | 1 | 3 | -0.18 |
| $m_3$ | 2 | 4 | | 1 | 2 | | 3 | | 4 | 3 | 5 | | 0.41 |
| $m_4$ | | 2 | 4 | | 5 | | | 4 | | | 2 | | -0.10 |
| $m_5$ | | | 4 | 3 | 4 | 2 | | | | | 2 | 5 | -0.31 |
| $m_6$ | 1 | | 3 | | 3 | | | 2 | | | 4 | | 0.59 |
???info “Similarity weights” - $s_{1,3} = 0.41$ - $s_{1,6} = 0.59$
1
2
3
4
5
6
7
8
| | $u_1$ | $u_2$ | $u_3$ | $u_4$ | $u_5$ | $u_6$ | $u_7$ | $u_8$ | $u_9$ | $u_{10}$ | $u_{11}$ | $u_{12}$ | $sim(1,m_i)$ |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | -------- | -------- | -------- | ------------ |
| $m_1$ | 1 | | 3 | | ??? | 5 | | | 5 | | 4 | | 1.0 |
| $m_2$ | | | 5 | 4 | | | 4 | | | 2 | 1 | 3 | -0.18 |
| $m_3$ | 2 | 4 | | 1 | 2 | | 3 | | 4 | 3 | 5 | | 0.41 |
| $m_4$ | | 2 | 4 | | 5 | | | 4 | | | 2 | | -0.10 |
| $m_5$ | | | 4 | 3 | 4 | 2 | | | | | 2 | 5 | -0.31 |
| $m_6$ | 1 | | 3 | | 3 | | | 2 | | | 4 | | 0.59 |
???info “Predict rating” - $s_{1,3} = 0.41$ - $s_{1,6} = 0.59$ - $r_{1,5} = \frac{0.41 * 2 + 0.59 * 3}{0.41 + 0.59} = 2.6$
1
2
3
4
5
6
7
8
| | $u_1$ | $u_2$ | $u_3$ | $u_4$ | $u_5$ | $u_6$ | $u_7$ | $u_8$ | $u_9$ | $u_{10}$ | $u_{11}$ | $u_{12}$ | $sim(1,m_i)$ |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | -------- | -------- | -------- | ------------ |
| $m_1$ | 1 | | 3 | | 2.6 | 5 | | | 5 | | 4 | | 1.0 |
| $m_2$ | | | 5 | 4 | | | 4 | | | 2 | 1 | 3 | -0.18 |
| $m_3$ | 2 | 4 | | 1 | 2 | | 3 | | 4 | 3 | 5 | | 0.41 |
| $m_4$ | | 2 | 4 | | 5 | | | 4 | | | 2 | | -0.10 |
| $m_5$ | | | 4 | 3 | 4 | 2 | | | | | 2 | 5 | -0.31 |
| $m_6$ | 1 | | 3 | | 3 | | | 2 | | | 4 | | 0.59 |

???tip “Add data” - Leverage all the data - Don’t try to reduce data size in an effort to make fancy algorithms work - Simple methods on large data do best
1
2
3
4
- Add more data
- e.g., add IMDB data on genres
- [More data beats better algorithms](http://anand.typepad.com/datawocky/2008/03/more-data-usual.html)
???example “Example: modeling local and global effects” - Global: - Mean movie rating: 3.7 stars - The Sixth Sense is 0.5 stars above avg. - Joe rates 0.2 stars below avg. - Baseline estimation: Joe will rate The Sixth Sense 4 stars - Local neighborhood (CF/NN): - Joe didn’t like related movie Signs - Final estimate: Joe will rate The Sixth Sense 3.8 stars
\(r_{xi} = b_{xi} + \frac{\sum_{j \in N(i;x)}s_{ij} \cdot (r_{xj} - b_{xj})}{\sum_{j \in N(i;x)}s_{ij}}\)
???question “Problems” - Similarity measures are “arbitrary” - Pairwise similarities neglect interdependencies among users - aking a weighted average can be restricting
???success “Solution” Instead of $s_{ij}$ use $w_{ij}$ that we estimate directly from data
1
2
3
4
5
6
7
8
9
10
11
$$
r_{xi} = b_{xi} + \sum_{j \in N(i;x)}w_{ij}(r_{xj} - b_{xj})
$$
- $N(i;x)$ is set of movies rated by user x that are similar to movie i
- $w_{ij}$ is the interpolation weight (some real number)
- $w_{ij}$ models interaction between pairs of movies (it does not depend on user $x$)
- Goals:
- Find $w_{ij}$ that minimize SSE (sum of squared errors) on training data!
- Models relationships between item i and its neighbors j
- $w_{ij}$ can be learned/estimated based on x and all other users that rated i
| HP1 | HP2 | HP3 | TW | SW1 | SW2 | SW3 | |
|---|---|---|---|---|---|---|---|
| A | 4 | 5 | 1 | ||||
| B | 5 | 5 | 4 | ||||
| C | 2 | 4 | 5 | ||||
| D | 3 | 3 |
small number of features in movies: genres, actors, actresses, theme songs, etc.

