🎬

Recommender Systems

Applying optimization to predict user preferences and recommend content.

🍿

1. The Problem

Movie Recommendation

Given a history of movie ratings given by users, our goal is to recommend suitable movies that the user may like but hasn't seen yet.

Imagine a matrix where rows represent users and columns represent movies. The entries are ratings (e.g., 1-5 stars). Most of this matrix is empty (missing ratings).

User \ Movie M1 M2 M3 M4 M5
Ana 5 ? 1 5 ?
Bob ? 5 ? 2 4
Carlos 1 ? 5 ? 1

We rely on dependencies in the data. For example, if Ana and Carlos have similar ratings for M1 and M3, they might have similar tastes for M5.

🧩

2. Matrix Factorization

We assume the ratings matrix \( A \) is low-rank. This means users' preferences and movies' characteristics can be described by a small number of latent features (e.g., "Action", "Comedy", "Romance").

User Feature Matrix (\( P \))

Describes how much each user likes each feature.

Row u: [Comedy_score, Action_score, ...]

Movie Feature Matrix (\( Q \))

Describes how much each movie belongs to each feature.

Col i: [Comedy_score, Action_score, ...]

The Prediction Model

The predicted rating for user \( u \) on movie \( i \) is the dot product of their feature vectors:

\[ \hat{A}_{ui} = p_u \cdot q_i \]

Matrix form: \( \hat{A} \approx P Q^T \)

📉

3. Optimization Model

We want to find matrices \( P \) and \( Q \) that minimize the difference between predicted ratings and actual ratings (for the ones we know).

Loss Function

Let \( K \) be the set of indices \( (u,i) \) where we have a rating. We minimize the squared error:

\[ \min_{P,Q} \sum_{(u,i) \in K} (A_{ui} - p_u \cdot q_i)^2 \]

Regularization

To prevent overfitting (learning noise instead of patterns), we add a penalty for large feature values:

\[ \min_{P,Q} \sum_{(u,i) \in K} (A_{ui} - p_u \cdot q_i)^2 + \lambda (\|p_u\|^2 + \|q_i\|^2) \]

Here \( \lambda \) is the regularization parameter.

🚀

4. Training the Model

We use Gradient Descent to find the optimal \( P \) and \( Q \). The gradients for a single rating \( (u,i) \) are:

Let error \( e_{ui} = A_{ui} - p_u \cdot q_i \)

Update rule for \( p_u \):
\( p_u \leftarrow p_u + \alpha (e_{ui} q_i - \lambda p_u) \)

Update rule for \( q_i \):
\( q_i \leftarrow q_i + \alpha (e_{ui} p_u - \lambda q_i) \)

We repeat this for all known ratings until the error converges.

✍️

5. Test Your Understanding

1. Why do we use matrix factorization for recommender systems?

2. What is the purpose of the regularization term \( \lambda (\|p\|^2 + \|q\|^2) \)?

🔍 Spot the Mistake!

A developer implements the loss function:

"I will minimize \( \| PQ^T - A \|_F^2 \) over all entries of the matrix A."

Mistake: We don't know all entries of A! We should only sum over known ratings.
Mistake: Frobenius norm cannot be used for matrices.

Previous Chapter

Duality

Next Chapter

Support Vector Machines