1. The Problem
Movie Recommendation
Given a history of movie ratings given by users, our goal is to recommend suitable movies that the user may like but hasn't seen yet.
Imagine a matrix where rows represent users and columns represent movies. The entries are ratings (e.g., 1-5 stars). Most of this matrix is empty (missing ratings).
| User \ Movie | M1 | M2 | M3 | M4 | M5 |
|---|---|---|---|---|---|
| Ana | 5 | ? | 1 | 5 | ? |
| Bob | ? | 5 | ? | 2 | 4 |
| Carlos | 1 | ? | 5 | ? | 1 |
We rely on dependencies in the data. For example, if Ana and Carlos have similar ratings for M1 and M3, they might have similar tastes for M5.
2. Matrix Factorization
We assume the ratings matrix \( A \) is low-rank. This means users' preferences and movies' characteristics can be described by a small number of latent features (e.g., "Action", "Comedy", "Romance").
User Feature Matrix (\( P \))
Describes how much each user likes each feature.
Movie Feature Matrix (\( Q \))
Describes how much each movie belongs to each feature.
The Prediction Model
The predicted rating for user \( u \) on movie \( i \) is the dot product of their feature vectors:
\[ \hat{A}_{ui} = p_u \cdot q_i \]Matrix form: \( \hat{A} \approx P Q^T \)
3. Optimization Model
We want to find matrices \( P \) and \( Q \) that minimize the difference between predicted ratings and actual ratings (for the ones we know).
Loss Function
Let \( K \) be the set of indices \( (u,i) \) where we have a rating. We minimize the squared error:
\[ \min_{P,Q} \sum_{(u,i) \in K} (A_{ui} - p_u \cdot q_i)^2 \]Regularization
To prevent overfitting (learning noise instead of patterns), we add a penalty for large feature values:
Here \( \lambda \) is the regularization parameter.
4. Training the Model
We use Gradient Descent to find the optimal \( P \) and \( Q \). The gradients for a single rating \( (u,i) \) are:
Update rule for \( p_u \):
\( p_u \leftarrow p_u + \alpha (e_{ui} q_i - \lambda p_u) \)
Update rule for \( q_i \):
\( q_i \leftarrow q_i + \alpha (e_{ui} p_u - \lambda q_i) \)
We repeat this for all known ratings until the error converges.
5. Test Your Understanding
1. Why do we use matrix factorization for recommender systems?
2. What is the purpose of the regularization term \( \lambda (\|p\|^2 + \|q\|^2) \)?
🔍 Spot the Mistake!
A developer implements the loss function:
"I will minimize \( \| PQ^T - A \|_F^2 \) over all entries of the matrix A."