Recommendation Systems

Recommendation Systems

Recommendations

Overview
  • Shelf space is a scarce commodity for traditional retailers
    • Also: TV networks, movie theaters,…
  • Web enables near-zero-cost dissemination of information about products
    • From scarcity to abundance
  • More choice necessitates better filters
    • Recommendation engines
    • How Into Thin Air made Touching the Void a bestseller: http://www.wired.com/wired/archive/12.10/tail.html
Types of recommendations
  • Editorial and hand curated
    • List of favorites
    • Lists of “essential” items
  • Simple aggregates
    • Top 10, Most Popular, Recent Uploads
  • Tailored to individual users
    • Amazon, Netflix, …
Formal Model
  • X = set of Customers
  • S = set of Items

  • Utility function u: $X \times S \rightarrow R$
    • R = set of ratings
    • R is a totally ordered set
    • e.g., 0-5 stars, real number in [0,1]
  Avatar LOTR Matrix Pirates
Alice 1   0.2  
Bob   0.5   0.3
Carol 0.2   1  
David       0.4
Key Problems
  • Gathering known ratings for matrix
    • How to collect the data in the utility matrix
  • Extrapolate unknown ratings from the known ones
    • Mainly interested in high unknown ratings
    • We are not interested in knowing what you don’t like but what you like
  • Evaluating extrapolation methods
    • How to measure success/performance of recommendation methods
Gathering ratings
  • Explicit
    • Ask people to rate items
    • Doesn’t work well in practice: people can’t be bothered
  • Implicit
    • Learn ratings from user actions
      • E.g., purchase implies high rating
    • What about low ratings?
Extrapolating utilities
  • Key problem: Utility matrix U is sparse
    • Most people have not rated most items
    • Cold start:
      • New items have no ratings
      • New users have no history
  • Three approaches to recommender systems:
    • Content-based
    • Collaborative
    • Latent factor based

Content-based Recommender Systems

Main Idea
  • Main idea: Recommend items to customer x similar to previous items rated highly by x
  • Example:
    • Movie recommendations: Recommend movies with same actor(s), director, genre, …
    • Websites, blogs, news: Recommend other sites with similar content
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets
Item profiles
  • For each item, create an item profile
  • Profile is a set (vector) of features
    • Movies: author, title, actor, director,…
    • Text: Set of “important” words in document
  • How to pick important features?
    • Usual heuristic from text mining is TF-IDF (Term frequency * Inverse Doc Frequency)
    • Term … Feature
    • Document … Item
User profiles and prediction
  • User profile possibilities:
    • Weighted average of rated item profiles
    • Variation: weight by difference from average rating for item
  • Prediction heuristic:
    • Given user profile $x$ and item profile $i$, estimate
    \[u(x,i)=cos(x,i)=\frac{x \cdot i}{||x|| \times ||i||}\]
Pros and cons
  • Pros:
    • No need for data on other users
      • No cold-start or sparsity problems
    • Able to recommend to users with unique tastes
    • Able to recommend new & unpopular items
      • No first-rater problem
    • Able to provide explanations
      • Can provide explanations of recommended items by listing content-features that caused an item to be recommended
  • Cons:
    • Finding the appropriate features is hard
      • E.g., images, movies, music
    • Recommendations for new users
      • How to build a user profile?
    • Overspecialization
      • Never recommends items outside user’s content profile
      • People might have multiple interests
      • Unable to exploit quality judgments of other users

Collaborative Filtering

Overview
  • Consider user x
  • Find set N of other users whose ratings are similar to x’s ratings
  • Estimate x’s ratings based on ratings of users in N
How do we find similar users
  • Assume:
    • Ratings are 1 to 3 stars
    • Two users:
      • $r_x=[,_,_,,***]$
      • $r_y=[*,_,,,_]$
  • Jaccard similarity measure
    • $r_x={1,4,5}$ or $r_x={1,0,0,1,1}$
    • $r_y={1,3,4}$ or $r_x={1,0,1,1,0}$
    Potential problem?

Ignores the value of the rating

</details>

  • Cosine similarity measure
    • $r_x={1,0,0,1,3}$
    • $r_y={1,0,2,2,0}$
    \[sim(x,y)=cos(r_x,r_y)=\frac{r_x \cdot r_y}{||r_x|| \times ||r_y||}\]
    Potential problem?

Treats missing ratings as negative

</details>

  • Person correlation coefficient
    • $r_x={1,0,0,1,3}$ with average rating $\overline{r_x}$
    • $r_y={1,0,2,2,0}$ with average rating $\overline{r_y}$
    • $S_{xy}$ are items rated by both x and y
      • $r_{xs}={1,1}$
      • $r_{ys}={1,2}$
    \[sim(x,y) = \frac{\sum_{s \in S_{xy}}(r_{xs} - \overline{r_x})(r_{ys} - \overline{r_y})}{\sqrt{\sum_{s \in S_{xy}}(r_{xs} - \overline{r_x})^2}\sqrt{\sum_{s \in S_{xy}}(r_{ys} - \overline{r_y})^2}}\]
Example differences of similarity metrics
  • HP: Harry Potter
  • TW: Twilight
  • SW: Star Wars
  HP1 HP2 HP3 TW SW1 SW2 SW3
A 4     5 1    
B 5 5 4        
C       2 4 5  
D   3         3
Immediate intuition

A is more similar to B than A is to C

Jaccard similarity
  • $r_A={1,0,0,1,1,0,0}$
  • $r_B={1,1,1,0,0,0,0}$
  • $r_C={0,0,0,1,1,1,0}$
  • $sim(A,B) = \frac{1}{5}$
  • $sim(A,C) = \frac{2}{4}$
  • A is more similar to C than to B!!!
Cosine similarity
  • $r_A={4,0,0,5,1,0,0}$
  • $r_B={5,5,4,0,0,0,0}$
  • $r_C={0,0,0,2,4,5,0}$
\[sim(x,y)=cos(r_x,r_y)=\frac{r_x \cdot r_y}{||r_x|| \cdot ||r_y||}\] \[sim(A,B)=\frac{r_A \cdot r_B}{||r_A|| \cdot ||r_B||}=\frac{4 \times 5 + 0 \times 5 + 0 \times 4 + 5 \times 0 + 1 \times 0 + 0 \times 0 + 0 \times 0 }{\sqrt{4^2+5^2+1^2} \times \sqrt{5^2+5^2+4^2}}=0.380\] \[sim(A,C)=\frac{r_A \cdot r_C}{||r_A|| \cdot ||r_C||}=\frac{5 \times 2 + 1 \times 4}{\sqrt{4^2+5^2+1^2} \times \sqrt{2^2+4^2+5^2}}=0.322\]
  • A is more similar to B than to C!
  • What else?
Normalization
  • Sutract the row mean
    • This normalizes data and turns cosine similarity into Pearson correlation
\[sim(x,y)=cos(r_x,r_y)=\frac{r_x \cdot r_y}{||r_x|| \cdot ||r_y||}\] \[sim(x,y) = \frac{\sum_{s \in S_{xy}}(r_{xs} - \overline{r_x})(r_{ys} - \overline{r_y})}{\sqrt{\sum_{s \in S_{xy}}(r_{xs} - \overline{r_x})^2}\sqrt{\sum_{s \in S_{xy}}(r_{ys} - \overline{r_y})^2}}\]
  • $r_A={4,0,0,5,1,0,0}$ and $\overline{r_A}=\frac{10}{3}$
  • $r_B={5,5,4,0,0,0,0}$ and $\overline{r_B}=\frac{14}{3}$
  • $r_C={0,0,0,2,4,5,0}$ and $\overline{r_C}=\frac{11}{3}$
  • $r_D={0,3,0,0,0,0,3}$ and $\overline{r_D}=3$
  HP1 HP2 HP3 TW SW1 SW2 SW3
A 2/3     5/3 -7/3    
B 1/3 1/3 -2/3        
C       -5/3 1/3 4/3  
D   0         0
  • sim(A,B)=0.092
  • sim(A,C)=-0.559

  • A is more similar to B than to C!
  • Because the value is also the correlation value:
    • A is very similar to B
    • A is very different from C
Rating Predictions
  • From similarity metric to recommendations:
    • Let $r_x$ be the vector of user x’s ratings
    • Let $N$ be the set of $k$ users most similar to $x$ who have rated item $i$
  • Prediction for item $s$ of user $x$:
\[r_{xi} = \frac{1}{k}\sum_{y \in N}r_{yi}\] \[r_{xi} = \frac{\sum_{y \in N}s_{xy} \cdot r_{yi}}{\sum_{y \in N}s_{xy}}\]
  • Other approaches
Item-item collaborative filtering
  • Another view: item-item
  • For item i, find other similar items
  • Estimate rating for item i based on ratings for similar items
  • Can use same similarity metrics and prediction functions as in user-user model

    • $s_{ij}$: similarity of items $i$ and $j$
    • $r_{xj}$: rating of user $x$ on item $j$
    • $N(i;x)$: set items rated by $x$ similar to $i$
\[r_{xi} = \frac{\sum_{j \in N(i;x)}s_{ij} \cdot r_{xj}}{\sum_{j \in N(i;x)}s_{ij}}\]
Example: item to item
  $u_1$ $u_2$ $u_3$ $u_4$ $u_5$ $u_6$ $u_7$ $u_8$ $u_9$ $u_{10}$ $u_{11}$ $u_{12}$
$m_1$ 1   3     5     5   4  
$m_2$     5 4     4     2 1 3
$m_3$ 2 4   1 2   3   4 3 5  
$m_4$   2 4   5     4     2  
$m_5$     4 3 4 2         2 5
$m_6$ 1   3   3     2     4  
Estimate rating of movie 1 by user 5
  $u_1$ $u_2$ $u_3$ $u_4$ $u_5$ $u_6$ $u_7$ $u_8$ $u_9$ $u_{10}$ $u_{11}$ $u_{12}$
$m_1$ 1   3   ??? 5     5   4  
$m_2$     5 4     4     2 1 3
$m_3$ 2 4   1 2   3   4 3 5  
$m_4$   2 4   5     4     2  
$m_5$     4 3 4 2         2 5
$m_6$ 1   3   3     2     4  
Identify movies similar to movie 1, rated by user 5
  • Pearson correlation for similarity calculation:
    • Movie 3
    • Movie 5
  $u_1$ $u_2$ $u_3$ $u_4$ $u_5$ $u_6$ $u_7$ $u_8$ $u_9$ $u_{10}$ $u_{11}$ $u_{12}$ $sim(1,m_i)$
$m_1$ 1   3   ??? 5     5   4   1.0
$m_2$     5 4     4     2 1 3 -0.18
$m_3$ 2 4   1 2   3   4 3 5   0.41
$m_4$   2 4   5     4     2   -0.10
$m_5$     4 3 4 2         2 5 -0.31
$m_6$ 1   3   3     2     4   0.59
Similarity weights
  • $s_{1,3} = 0.41$
  • $s_{1,6} = 0.59$
  $u_1$ $u_2$ $u_3$ $u_4$ $u_5$ $u_6$ $u_7$ $u_8$ $u_9$ $u_{10}$ $u_{11}$ $u_{12}$ $sim(1,m_i)$
$m_1$ 1   3   ??? 5     5   4   1.0
$m_2$     5 4     4     2 1 3 -0.18
$m_3$ 2 4   1 2   3   4 3 5   0.41
$m_4$   2 4   5     4     2   -0.10
$m_5$     4 3 4 2         2 5 -0.31
$m_6$ 1   3   3     2     4   0.59
Predict rating
  • $s_{1,3} = 0.41$
  • $s_{1,6} = 0.59$
  • $r_{1,5} = \frac{0.41 * 2 + 0.59 * 3}{0.41 + 0.59} = 2.6$
  $u_1$ $u_2$ $u_3$ $u_4$ $u_5$ $u_6$ $u_7$ $u_8$ $u_9$ $u_{10}$ $u_{11}$ $u_{12}$ $sim(1,m_i)$
$m_1$ 1   3   2.6 5     5   4   1.0
$m_2$     5 4     4     2 1 3 -0.18
$m_3$ 2 4   1 2   3   4 3 5   0.41
$m_4$   2 4   5     4     2   -0.10
$m_5$     4 3 4 2         2 5 -0.31
$m_6$ 1   3   3     2     4   0.59
CF: Common Practice
  • Define similarity $s_{ij}$ of items $i$ and $j$
  • Select k nearest neighbors $N(i; x)$
    • Items most similar to $i$, that were rated by $x$
  • Estimate rating $r_{xi}$ as the weighted average:
\[r_{xi} = b_{xi} + \frac{\sum_{j \in N(i;x)}s_{ij} \cdot (r_{xj} - b_{xj})}{\sum_{j \in N(i;x)}s_{ij}}\]
  • $b_{xi}$: baseline estimate for $r_{xi}$
  • $b_{xi} = \mu + b_x + b_i$
    • $\mu$ is overall mean rating
    • $b_x$ is the rating deviation of user $x$ (average rating minus overall mean rating)
    • $b_i$ is the rating deviation of movie $i$
CF: Pros and cons
  • Pros:
    • Works for any kind of item
    • No feature selection needed
  • Cons:
    • Cold Start:
      • Need enough users in the system to find a match
    • Sparsity:
      • The user/ratings matrix is sparse
      • Hard to find users that have rated the same items
    • First rater:
      • Cannot recommend an item that has not been previously rated
      • New items, Esoteric items
    • Popularity bias:
      • Cannot recommend items to someone with unique taste
      • Tends to recommend popular items
  • Hybrids:
    • Implement two or more different recommenders and combine predictions
      • Perhaps using a linear model
    • Add content-based methods to collaborative filtering
      • Item profiles for new item problem
      • Demographics to deal with new user problem
Evaluation
  • Compare predictions with known ratings
  • Root-mean-square error (RMSE)
    • $r_{xi}$: predicted
    • ${r^*}_{xi}$: true rating
    \[RMSE = \sqrt{\sum_{xi}(r_{xi} - {r^*}_{xi})^2}\]
  • Precision at top 10:
    • % of those in top 10
  • Rank Correlation:
    • Spearman’s correlation between system’s and user’s complete rankings
Add data
  • Leverage all the data
    • Don’t try to reduce data size in an effort to make fancy algorithms work
    • Simple methods on large data do best
  • Add more data
    • e.g., add IMDB data on genres
  • More data beats better algorithms

The Netflix Prize

Training data
  • Training data
    • 100 million ratings, 480,000 users, 17,770 movies
    • 6 years of data: 2000-2005
  • Test data
    • Last few ratings of each user (2.8 million)
  • Evaluation criterion:
    • Root Mean Square Error
    • Netflix’s system RMSE: 0.9514
  • Competition
    • 2,700+ teams
    • $1 million prize for 10% improvement on Netflix
  • Data no longer available due to later researchers were able to extract personal information based on reviews.
Winner: BellKor Recommender System
Example: modeling local and global effects
  • Global:
    • Mean movie rating: 3.7 stars
    • The Sixth Sense is 0.5 stars above avg.
    • Joe rates 0.2 stars below avg.
    • Baseline estimation: Joe will rate The Sixth Sense 4 stars
  • Local neighborhood (CF/NN):
    • Joe didn’t like related movie Signs
    • Final estimate: Joe will rate The Sixth Sense 3.8 stars
Interpolation weights \[r_{xi} = b_{xi} + \frac{\sum_{j \in N(i;x)}s_{ij} \cdot (r_{xj} - b_{xj})}{\sum_{j \in N(i;x)}s_{ij}}\]
Problems
  • Similarity measures are “arbitrary”
  • Pairwise similarities neglect interdependencies among users
  • aking a weighted average can be restricting
Solution

Instead of $s_{ij}$ use $w_{ij}$ that we estimate directly from data

\[r_{xi} = b_{xi} + \sum_{j \in N(i;x)}w_{ij}(r_{xj} - b_{xj})\]
  • $N(i;x)$ is set of movies rated by user x that are similar to movie i
  • $w_{ij}$ is the interpolation weight (some real number)
  • $w_{ij}$ models interaction between pairs of movies (it does not depend on user $x$)
  • Goals:
    • Find $w_{ij}$ that minimize SSE (sum of squared errors) on training data!
      • Models relationships between item i and its neighbors j
    • $w_{ij}$ can be learned/estimated based on x and all other users that rated i
Recommendations via Optimization
  • Idea: Let’s set values w such that they work well on known (user, item) ratings
  • How to find such values w?
  • Idea: Define an objective function and solve the optimization problem
  • Find $w{ij}$ that minimize SSE on training data!
    • Think if w as a vector of numbers.
\[J(w)=\sum_{x,i} \left( \left[ b_{xi} + \sum_{j \in N(i;x)}w_{ij}(r_{xj} - b_{xj}) \right] - r_{xi} \right)^2\]
Performance
  • Optimization via gradient descent
    • Not there yet
Latent factor models (singular value decomposition - SVD)
  • Utility matrix:
  HP1 HP2 HP3 TW SW1 SW2 SW3
A 4     5 1    
B 5 5 4        
C       2 4 5  
D   3         3
  • Idea:
    • Users respond to a small number of features in movies: genres, actors, actresses, theme songs, etc.
    • We can break utility matrix M(n,m) into two matrices, U (n,d) and V (d,v)
      • U correponds to movies and features
      • V corresponds to features and users
    • Instead of estimating the rating on M, we estimate users’ response toward features, and movies’ relationship with features.
      • We can calculate the missing rating as a dot product of these two matrices.
      • Optimization/estimation of U and V using Stochastic Gradient Descent.
Bringing everything together
  • Baseline predictor
    • Separates users and movies
    • Benefits from insights into user’s behavior
    • Among the main practical contributions of the competition
  • User-Movie interaction
    • Characterizes the matching between users and movies
    • Attracts most research in the field
    • Benefits from algorithmic and mathematical innovations
\[r_{xi} = \mu + b_x + b_i + q_i \cdot p_x\]
  • Example
    • Mean rating: $\mu$ = 3.7
    • You are a critical reviewer: your ratings are 1 star lower than the mean: $b_x$ = -1
    • Star Wars gets a mean rating of 0.5 higher than average movie: $b_i$ = + 0.5
    • Predicted rating for you on Star Wars: = 3.7 - 1 + 0.5 = 3.2
Additional details
  • Sudden rise in the average movie rating (early 2004)
    • Improvements in Netflix
    • GUI improvements
    • Meaning of rating changed
  • Movie age
    • Users prefer new movies without any reasons
    • Older movies are just inherently better than newer ones
\[r_{xi} = \mu + b_x(t) + b_i(t) + q_i \cdot p_x\]
  • Add temporal dependence
Still not there yet
  • Continuous calibration of models