Recommender Systems recommendation system
Problem formulation
Example: Predicting movie ratings
Movie rating prediction examples.
Users of the film score: Indicates the degree of preference for movies with a 0-5, represents not seen the movie?.
Movie | Alice(1) | Bob(2) | Carol(3) | Dave(4) |
---|---|---|---|---|
Love at last | 5 | 5 | 0 | 0 |
Romance forever | 5 | ? | ? | 0 |
Cute puppies of love | ? | 4 | 0 | ? |
Nonstop car chases | 0 | 0 | 5 | 4 |
Swords vs. karate | 0 | 0 | 5 | ? |
Through the above table we can see that A and B are related to love like movies, do not like action movies; C and D are just the opposite. We can recommend accordingly Movie / not seen the movie.
Notation is defined as follows:
- \(n_u\) = no. users
- \(n_m\) = no. movies
- \(r(i,j)\) = 1 if user \(j\) has rated movie \(i\)
- \(y^{(i,j)}\) = rating given by user \(j\) to movie \(i\) (defined only if \(r(i,j) = 1\))
This sample, \ (. 4 n_u = \) , \ (n_m. 5 = \) , \ (R & lt (1,2) =. 1 \) , \ (Y ^ {(1,2)} =. 5 \) .
Content-based recommendations
Content-based recommendation system.
Content-based recommender systems
For example appears above, we see the data and see all the missing movie ratings, and try to predict the value of these question marks should be.
In a content-based recommendation system algorithm, we assume for the things we want to recommend some data that are characteristic about these things. Assuming that each movie has two characteristics, such as \ (x_1 \) represents the degree of romantic movies, \ (x_2 \) represents the degree of action movies. The movie title has a feature vector, such as \ (x ^ {(1) } \) is the first eigenvector of the film, to [0.9, 0].
Movie | Alice(1) | Bob(2) | Carol(3) | Dave(4) | \(x_1\)(romance) | \(x_2\)(action) |
---|---|---|---|---|---|---|
Love at last | 5 | 5 | 0 | 0 | 0.9 | 0 |
Romance forever | 5 | ? | ? | 0 | 1.0 | 0.01 |
Cute puppies of love | ? | 4 | 0 | ? | 0.99 | 0 |
Nonstop car chases | 0 | 0 | 5 | 4 | 0.1 | 1.0 |
Swords vs. karate | 0 | 0 | 5 | ? | 1 | 0.9 |
For each user \(j\), learn a parameter \(\theta^{(j)} \in R^3\). Predict user \(j\) as rating movie \(i\) with \((\theta^{(j)})^Tx^{(i)}\) stars.
Now we want to build on these features to a recommendation system algorithm. Suppose we use linear regression model, we have trained a linear regression model for each user, such as \ (\ theta ^ {(1 )} \) is the first parameter of the user model. Thus, we have:
- \ (\ Theta ^ {(J)} \) : User \ (J \) parameter vector for \ (n + 1 \) dimension.
- \ (^ {X (I)} \) : Film \ (I \) feature vectors.
- For users \ (J \) and the movie \ (I \) , we expect rated \ ((\ Theta ^ {(J)}) ^ {^ the Tx (I)} \) .
An example will be described:
Known \ (X ^ {(. 3)} = \ left [\}. 1 the begin {Matrix 0.99 \\ \\ 0 \ Matrix End {} \ right] \) , if we know \ (\ theta ^ {(1 )} = \ left [\} the begin {Matrix. 5 \\ 0 \\ 0 \ Matrix End {} \ right] \) , then the film Alice (3 favorite degree) of \ ((\ theta ^ {( 1) }) ^ {^ the Tx (. 3)} = 0.45 \) .
Optimization objective optimization target
For users \ (J \) , the cost of the linear regression model for the prediction error sum of squares, plus a regularization term:
The To Learn \ (\ Theta ^ {(J)} \) (Parameter for User \ (J \) ):
\ [min _ {\ Theta ^ {(J)}} \ FRAC {. 1} {2} \ sum_ {I : r (i, j) = 1} \ left ((\ theta ^ {(j)}) ^ Tx ^ {(i)} - y ^ {(i, j)} \ right) ^ 2 + \ frac { \ lambda} {2} \ sum
^ n_ {k = 1} (\ theta_k ^ {(j)}) ^ 2 \] where \ (i: r (i, j) = 1 \) means that we count only those users \ (j \) have rated movie. In a general linear regression model, the error term and regularization terms are to be multiplied by \ (\ FRAC. 1 {{}} 2M \) , where we \ (m \) removed. And we do not variance item \ (\ theta_0 \) were regularization process.
The above cost function only for a user, in order to learn all users, all users will be summed cost function:
To learn \(\theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)}\):
\[ min_{\theta^{(1)},\dots,\theta^{(n_u)}}\frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}\left((\theta^{(j)})^Tx^{(i)} - y^{(i,j)} \right)^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum^n_{k=1}(\theta_k^{(j)})^2 \]
Optimization algorithm
Gradient descent update:
If we use a gradient descent method to solve the optimal solution, we calculate the partial derivative of the cost function is obtained as the gradient descent update the formula:
\ [\ theta_k {^ (J)}: = \ {^ theta_k (J)} - \ alpha \ sum_ {i: r (i, j) = 1} \ left ((\ theta ^ {(j)}) ^ Tx ^ {(i)} - y ^ {(i, j)} \ right) x_k ^ {(i)} \ \ (for \ \ k = 0) \\ \ theta_k ^ {(j)}: = \ theta_k ^ {(j)} - \ alpha \ left (\ sum_ {i: r ( i, j) = 1} \ left ((\ theta ^ {(j)}) ^ Tx ^ {(i)} - y ^ {(i, j)} \ right) x_k ^ {(i)} + \ lambda x_k ^ {(j)} \ right) \ \ (for \ \ k \ ne 0) \]
Collaborative filtering
Collaborative filtering algorithm
Problem motivation
In the content-based recommendation system before, for every movie, we have mastered the features available, use these features to train every user parameters. On the contrary, if we have the user's parameters, we can learn the characteristics drawn films.
Optimization algorithm optimization goals
Given \(\theta^{(1)},\dots.\theta^{(n_u)}\), to learn \(x^{(i)}\):
\[ min_{x^{(i)}} \frac{1}{2} \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2}\sum^n_{k=1}(x_k^{(i)})^2 \]
Given \(\theta^{(1)},\dots.\theta^{(n_u)}\), to learn \(x^{(1)},\dots,x^{(n_m)}\):
\[ min_{x^{(1)},\dots,x^{(n_m)}} \frac{1}{2}\sum^{n_m}_{i=1} \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2}\sum^{n_m}_{i=1}\sum^n_{k=1}(x_k^{(i)})^2 \]
Collaborative filtering
Given \(x^{(1)},\dots,x^{(n_m)}\) (and movie ratings), can estimate \(\theta^{(1)},\dots.\theta^{(n_u)}\).
Given \(\theta^{(1)},\dots.\theta^{(n_u)}\), can estimate \(x^{(1)},\dots,x^{(n_m)}\).
If we know that \ (\ theta \) will be able to learn to get \ (x \) , if we know that \ (x \) will learn a \ (\ theta \) to. By random initialization parameters, then stop the iteration will eventually converge to a reasonable set of feature film and a set of reasonable estimates of the parameters of different users.
Collaborative filtering algorithm
Collaborative filtering optimization objective
The above two optimization objectives into one.
Minimizing \(x^{(1)},\dots,x^{(n_m)}\) and \(\theta^{(1)},\dots.\theta^{(n_u)}\) simultaneously:
\(min_{x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots.\theta^{(n_u)}}J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots.\theta^{(n_u)})\),其中代价函数如下:
\[ J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots.\theta^{(n_u)}) = \frac{1}{2}\sum_{(i:j):r(i,j) = 1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x_k^{(i)})^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2 \]
Collaborative filtering algorithm
The Initialize \ (X ^ {(. 1)}, \ DOTS, X ^ {(n_m)}, \ Theta ^ {(. 1)}, \ DOTS. \ Theta ^ {(n_u)} \) to Small Random values. The \ (X \) and \ (\ Theta \) value is initialized to small random values
- The Minimize \ (J (X ^ {(. 1)}, \ DOTS, X ^ {(n_m)}, \ Theta ^ {(. 1)}, \ DOTS. \ Theta ^ {(n_u)}) \) the using gradient descent of (AN or advanced optimization algorithm) for Eg of Every. \ (J =. 1, \ DOTS, n_u, I =. 1, \ DOTS, n_m \) : with a gradient of decrease or some other high-level optimization algorithm to minimize the cost function
\ [x ^ {(i)} _ k: = x ^ {(i)} _ k - \ alpha \ left (\ sum_ {j: r (i, j) = 1} ((\ theta ^ {(j)}) ^ Tx ^ {(i)} - y ^ {(i, j)}) \ theta_k ^ {(j)} + \ lambda x_k ^ {(i)} \ right) \\ \ theta ^ {(j)} _k: = \ theta ^ {( j)} _ k - \ alpha \ left (\ sum_ {i: r (i, j) = 1} ((\ theta ^ {(j)}) ^ Tx ^ {(i) } - y ^ {(i, j)}) x_k ^ {(i)} + \ lambda \ theta_k ^ {(j)} \ right) \] A with User Parameters for \ (\ Theta \) and with A Movie (Learned) Features \ (X \) , Predict Star Rating of A \ (\ Theta the Tx ^ \) . For a given parameter \ (\ Theta \) users and a known characteristic \ (x \) movies can be predicted on the film score for the user with a complete training algorithm
Vectorization: Low rank matrix factorization
Vectorization: low-rank matrix factorization
Before we introduce the collaborative filtering algorithm, this section describes the algorithm to achieve quantification, and talk about other things about the algorithm can do. Such as:
- When given a product that you can find with other product-related.
- One user recently spotted a product, there are no other related products, you can recommend to him.
Do: implement a selection method, where the prediction write collaborative filtering algorithm.
Collaborative filtering
We have a set of data on five films, I would do is, these users can rate movies, grouped into a matrix coexist.
We have five movies, and four user, then the matrix \ (Y \) user rating data is a matrix of five rows and four columns will these films are present in the matrix:
我们记:\(X = \left[ \begin{matrix} (x^{(1)})^T \\ (x^{(2)})^T \\ \dots \\ (x^{(n_m)})^T \end{matrix} \right]\),\(\Theta = \left[ \begin{matrix} (\theta^{(1)})^T \\ (\theta^{(2)})^T \\ \dots \\ (\theta^{(n_u)})^T \end{matrix} \right]\),则评分为:\(\left[ \begin{matrix} (\theta^{(1)})^T(x^{(1)}) & (\theta^{(2)})^T(x^{(1)}) & \dots & (\theta^{(n_u)})^T(x^{(1)}) \\ (\theta^{(1)})^T(x^{(2)}) & (\theta^{(2)})^T(x^{(2)}) & \dots & (\theta^{(n_u)})^T(x^{(2)}) \\ \vdots & \vdots & \vdots & \vdots \\ (\theta^{(1)})^T(x^{(n_m)}) & (\theta^{(2)})^T(x^{(n_m)}) & \dots & (\theta^{(n_u)})^T(x^{(n_m)}) \end{matrix} \right]\).
The above-described quantization matrix is synergistic.
Finding related movies
For each product \(i\), we learn a feature vector \(x^{(i)} \in R^n\).
For each product \ (I \) , which identify the feature vector \ (^ {X (I)} \) .
How to find movies \(j\) related to movie \(i\)?
How to find and \ (i \) related to the movie \ (J \) ?
5 most similar movies to movie \(i\): Find the 5 movies \(j\) with the smallest \(||x^{(i)} - x^{(j)}||\).
Compare the two to identify the same features commodity product, which can identify minimize \ (|| x ^ {(i )} - x ^ {(j)} || \) to 5 products, this 5 commodity is, and \ (i \) is most similar to the five commodity that can be recommended as related products.
Implementational detail: Mean normalization
Users who have not rated any movies
Consider the following set of data, i.e. there is a user without any assessment of the film. At this time measured as Eve before, if we use the ratings of each movie, the minimization formula on the graph, because for any I, Eve did not score too, so the formula \ (r (i, j) = 1 \ ) condition is not satisfied, the relevant portions of the data to minimize Eve does not act, so to minimize Eve data has an effect will only \ (min _ {\ theta ^ {(5)}} \ frac {\ lambda} {2} [ (\ theta_1 ^ {(5) }) ^ 2 + (\ theta_2 ^ 5) ^ 2] \) section. Because both are \ (0 \) , so \ ((\ Theta ^ {(. 5)}) ^ {T ^ X (I)} = 0 \) , then the prediction score of Eve are \ (0 \ ) .
Although the result is too out, but the results can not we used to recommend, because of all the movies, it \ (|| x ^ {(i )} - x ^ {(j)} || \) are \ (0 \) .
Mean Normalization
For a movie, the use of already rated by value (? Not counted), calculate the average, denoted by \ (\ MU \) , then normalized to the original matrix \ (Y \) each subtracting the line number (the film) corresponding to the average to obtain a new \ (the Y \) , shown in the right below, using the new \ (the Y \) matrix learning \ (\ Theta \) and \ (x \) values. Then for Eve, before analysis of minimization is still established, i.e. \ (min _ {\ theta ^ {(5)}} \ frac {\ lambda} {2} [(\ theta_1 ^ {(5)}) ^ 2 + (\ ^ theta_2. 5) ^ 2] \) .
Predicted value after normalization equation is: \ ((\ Theta ^ {(. 5)}) ^ {^ the Tx (I)} + \ MU = \ MU = \ left [\ \\ the begin {2.5} 2.5 Matrix \ \ \\ 2 1.25 2.25 \\ \ Matrix End {} \ right] \) .
In fact, for this prediction it is acceptable to us, because we do not know Eve preferences, so her score predicted to average.
Special case: If there is a movie without scoring a situation, you can consider the mean of each column is \ (0 \) , ie calculate the mean of each column, with \ (Y \) mean minus the corresponding column to get new \ (Y \) matrix.