Machine Learning (Machine Learning) - Andrew Ng (Andrew Ng) study notes (XVI)

Recommender Systems recommendation system

Problem formulation

Example: Predicting movie ratings

Movie rating prediction examples.

Users of the film score: Indicates the degree of preference for movies with a 0-5, represents not seen the movie?.

Movie Alice(1) Bob(2) Carol(3) Dave(4)
Love at last 5 5 0 0
Romance forever 5 ? ? 0
Cute puppies of love ? 4 0 ?
Nonstop car chases 0 0 5 4
Swords vs. karate 0 0 5 ?

Through the above table we can see that A and B are related to love like movies, do not like action movies; C and D are just the opposite. We can recommend accordingly Movie / not seen the movie.

Notation is defined as follows:

  1. \(n_u\) = no. users
  2. \(n_m\) = no. movies
  3. \(r(i,j)\) = 1 if user \(j\) has rated movie \(i\)
  4. \(y^{(i,j)}\) = rating given by user \(j\) to movie \(i\) (defined only if \(r(i,j) = 1\))

This sample, \ (. 4 n_u = \) , \ (n_m. 5 = \) , \ (R & lt (1,2) =. 1 \) , \ (Y ^ {(1,2)} =. 5 \) .

Content-based recommendations

Content-based recommendation system.

Content-based recommender systems

For example appears above, we see the data and see all the missing movie ratings, and try to predict the value of these question marks should be.

In a content-based recommendation system algorithm, we assume for the things we want to recommend some data that are characteristic about these things. Assuming that each movie has two characteristics, such as \ (x_1 \) represents the degree of romantic movies, \ (x_2 \) represents the degree of action movies. The movie title has a feature vector, such as \ (x ^ {(1) } \) is the first eigenvector of the film, to [0.9, 0].

Movie Alice(1) Bob(2) Carol(3) Dave(4) \(x_1\)(romance) \(x_2\)(action)
Love at last 5 5 0 0 0.9 0
Romance forever 5 ? ? 0 1.0 0.01
Cute puppies of love ? 4 0 ? 0.99 0
Nonstop car chases 0 0 5 4 0.1 1.0
Swords vs. karate 0 0 5 ? 1 0.9

For each user \(j\), learn a parameter \(\theta^{(j)} \in R^3\). Predict user \(j\) as rating movie \(i\) with \((\theta^{(j)})^Tx^{(i)}\) stars.

Now we want to build on these features to a recommendation system algorithm. Suppose we use linear regression model, we have trained a linear regression model for each user, such as \ (\ theta ^ {(1 )} \) is the first parameter of the user model. Thus, we have:

  1. \ (\ Theta ^ {(J)} \) : User \ (J \) parameter vector for \ (n + 1 \) dimension.
  2. \ (^ {X (I)} \) : Film \ (I \) feature vectors.
  3. For users \ (J \) and the movie \ (I \) , we expect rated \ ((\ Theta ^ {(J)}) ^ {^ the Tx (I)} \) .

An example will be described:

Known \ (X ^ {(. 3)} = \ left [\}. 1 the begin {Matrix 0.99 \\ \\ 0 \ Matrix End {} \ right] \) , if we know \ (\ theta ^ {(1 )} = \ left [\} the begin {Matrix. 5 \\ 0 \\ 0 \ Matrix End {} \ right] \) , then the film Alice (3 favorite degree) of \ ((\ theta ^ {( 1) }) ^ {^ the Tx (. 3)} = 0.45 \) .

Optimization objective optimization target

For users \ (J \) , the cost of the linear regression model for the prediction error sum of squares, plus a regularization term:

The To Learn \ (\ Theta ^ {(J)} \) (Parameter for User \ (J \) ):
\ [min _ {\ Theta ^ {(J)}} \ FRAC {. 1} {2} \ sum_ {I : r (i, j) = 1} \ left ((\ theta ^ {(j)}) ^ Tx ^ {(i)} - y ^ {(i, j)} \ right) ^ 2 + \ frac { \ lambda} {2} \ sum
^ n_ {k = 1} (\ theta_k ^ {(j)}) ^ 2 \] where \ (i: r (i, j) = 1 \) means that we count only those users \ (j \) have rated movie. In a general linear regression model, the error term and regularization terms are to be multiplied by \ (\ FRAC. 1 {{}} 2M \) , where we \ (m \) removed. And we do not variance item \ (\ theta_0 \) were regularization process.

The above cost function only for a user, in order to learn all users, all users will be summed cost function:

To learn \(\theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)}\):
\[ min_{\theta^{(1)},\dots,\theta^{(n_u)}}\frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}\left((\theta^{(j)})^Tx^{(i)} - y^{(i,j)} \right)^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum^n_{k=1}(\theta_k^{(j)})^2 \]

Optimization algorithm

Gradient descent update:

If we use a gradient descent method to solve the optimal solution, we calculate the partial derivative of the cost function is obtained as the gradient descent update the formula:
\ [\ theta_k {^ (J)}: = \ {^ theta_k (J)} - \ alpha \ sum_ {i: r (i, j) = 1} \ left ((\ theta ^ {(j)}) ^ Tx ^ {(i)} - y ^ {(i, j)} \ right) x_k ^ {(i)} \ \ (for \ \ k = 0) \\ \ theta_k ^ {(j)}: = \ theta_k ^ {(j)} - \ alpha \ left (\ sum_ {i: r ( i, j) = 1} \ left ((\ theta ^ {(j)}) ^ Tx ^ {(i)} - y ^ {(i, j)} \ right) x_k ^ {(i)} + \ lambda x_k ^ {(j)} \ right) \ \ (for \ \ k \ ne 0) \]

Collaborative filtering

Collaborative filtering algorithm

Problem motivation

In the content-based recommendation system before, for every movie, we have mastered the features available, use these features to train every user parameters. On the contrary, if we have the user's parameters, we can learn the characteristics drawn films.

Optimization algorithm optimization goals

Given \(\theta^{(1)},\dots.\theta^{(n_u)}\), to learn \(x^{(i)}\):
\[ min_{x^{(i)}} \frac{1}{2} \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2}\sum^n_{k=1}(x_k^{(i)})^2 \]
Given \(\theta^{(1)},\dots.\theta^{(n_u)}\), to learn \(x^{(1)},\dots,x^{(n_m)}\):
\[ min_{x^{(1)},\dots,x^{(n_m)}} \frac{1}{2}\sum^{n_m}_{i=1} \sum_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2}\sum^{n_m}_{i=1}\sum^n_{k=1}(x_k^{(i)})^2 \]

Collaborative filtering

Given \(x^{(1)},\dots,x^{(n_m)}\) (and movie ratings), can estimate \(\theta^{(1)},\dots.\theta^{(n_u)}\).

Given \(\theta^{(1)},\dots.\theta^{(n_u)}\), can estimate \(x^{(1)},\dots,x^{(n_m)}\).

If we know that \ (\ theta \) will be able to learn to get \ (x \) , if we know that \ (x \) will learn a \ (\ theta \) to. By random initialization parameters, then stop the iteration will eventually converge to a reasonable set of feature film and a set of reasonable estimates of the parameters of different users.

Collaborative filtering algorithm

Collaborative filtering optimization objective

The above two optimization objectives into one.

Minimizing \(x^{(1)},\dots,x^{(n_m)}\) and \(\theta^{(1)},\dots.\theta^{(n_u)}\) simultaneously:

\(min_{x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots.\theta^{(n_u)}}J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots.\theta^{(n_u)})\),其中代价函数如下:
\[ J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots.\theta^{(n_u)}) = \frac{1}{2}\sum_{(i:j):r(i,j) = 1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x_k^{(i)})^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2 \]

Collaborative filtering algorithm

  1. The Initialize \ (X ^ {(. 1)}, \ DOTS, X ^ {(n_m)}, \ Theta ^ {(. 1)}, \ DOTS. \ Theta ^ {(n_u)} \) to Small Random values. The \ (X \) and \ (\ Theta \) value is initialized to small random values

  2. The Minimize \ (J (X ^ {(. 1)}, \ DOTS, X ^ {(n_m)}, \ Theta ^ {(. 1)}, \ DOTS. \ Theta ^ {(n_u)}) \) the using gradient descent of (AN or advanced optimization algorithm) for Eg of Every. \ (J =. 1, \ DOTS, n_u, I =. 1, \ DOTS, n_m \) : with a gradient of decrease or some other high-level optimization algorithm to minimize the cost function
    \ [x ^ {(i)} _ k: = x ^ {(i)} _ k - \ alpha \ left (\ sum_ {j: r (i, j) = 1} ((\ theta ^ {(j)}) ^ Tx ^ {(i)} - y ^ {(i, j)}) \ theta_k ^ {(j)} + \ lambda x_k ^ {(i)} \ right) \\ \ theta ^ {(j)} _k: = \ theta ^ {( j)} _ k - \ alpha \ left (\ sum_ {i: r (i, j) = 1} ((\ theta ^ {(j)}) ^ Tx ^ {(i) } - y ^ {(i, j)}) x_k ^ {(i)} + \ lambda \ theta_k ^ {(j)} \ right) \]
  3. A with User Parameters for \ (\ Theta \) and with A Movie (Learned) Features \ (X \) , Predict Star Rating of A \ (\ Theta the Tx ^ \) . For a given parameter \ (\ Theta \) users and a known characteristic \ (x \) movies can be predicted on the film score for the user with a complete training algorithm

Vectorization: Low rank matrix factorization

Vectorization: low-rank matrix factorization

Before we introduce the collaborative filtering algorithm, this section describes the algorithm to achieve quantification, and talk about other things about the algorithm can do. Such as:

  1. When given a product that you can find with other product-related.
  2. One user recently spotted a product, there are no other related products, you can recommend to him.

Do: implement a selection method, where the prediction write collaborative filtering algorithm.

Collaborative filtering

We have a set of data on five films, I would do is, these users can rate movies, grouped into a matrix coexist.

5. Vectorization - Collaborative filtering

We have five movies, and four user, then the matrix \ (Y \) user rating data is a matrix of five rows and four columns will these films are present in the matrix:

5. Vectorization - Collaborative filtering2

我们记:\(X = \left[ \begin{matrix} (x^{(1)})^T \\ (x^{(2)})^T \\ \dots \\ (x^{(n_m)})^T \end{matrix} \right]\),\(\Theta = \left[ \begin{matrix} (\theta^{(1)})^T \\ (\theta^{(2)})^T \\ \dots \\ (\theta^{(n_u)})^T \end{matrix} \right]\),则评分为:\(\left[ \begin{matrix} (\theta^{(1)})^T(x^{(1)}) & (\theta^{(2)})^T(x^{(1)}) & \dots & (\theta^{(n_u)})^T(x^{(1)}) \\ (\theta^{(1)})^T(x^{(2)}) & (\theta^{(2)})^T(x^{(2)}) & \dots & (\theta^{(n_u)})^T(x^{(2)}) \\ \vdots & \vdots & \vdots & \vdots \\ (\theta^{(1)})^T(x^{(n_m)}) & (\theta^{(2)})^T(x^{(n_m)}) & \dots & (\theta^{(n_u)})^T(x^{(n_m)}) \end{matrix} \right]\).

The above-described quantization matrix is ​​synergistic.

For each product \(i\), we learn a feature vector \(x^{(i)} \in R^n\).

For each product \ (I \) , which identify the feature vector \ (^ {X (I)} \) .

How to find movies \(j\) related to movie \(i\)?

How to find and \ (i \) related to the movie \ (J \) ?

5 most similar movies to movie \(i\): Find the 5 movies \(j\) with the smallest \(||x^{(i)} - x^{(j)}||\).

Compare the two to identify the same features commodity product, which can identify minimize \ (|| x ^ {(i )} - x ^ {(j)} || \) to 5 products, this 5 commodity is, and \ (i \) is most similar to the five commodity that can be recommended as related products.

Implementational detail: Mean normalization

Users who have not rated any movies

Consider the following set of data, i.e. there is a user without any assessment of the film. At this time measured as Eve before, if we use the ratings of each movie, the minimization formula on the graph, because for any I, Eve did not score too, so the formula \ (r (i, j) = 1 \ ) condition is not satisfied, the relevant portions of the data to minimize Eve does not act, so to minimize Eve data has an effect will only \ (min _ {\ theta ^ {(5)}} \ frac {\ lambda} {2} [ (\ theta_1 ^ {(5) }) ^ 2 + (\ theta_2 ^ 5) ^ 2] \) section. Because both are \ (0 \) , so \ ((\ Theta ^ {(. 5)}) ^ {T ^ X (I)} = 0 \) , then the prediction score of Eve are \ (0 \ ) .

6. Mean normalization - example

Although the result is too out, but the results can not we used to recommend, because of all the movies, it \ (|| x ^ {(i )} - x ^ {(j)} || \) are \ (0 \) .

Mean Normalization

For a movie, the use of already rated by value (? Not counted), calculate the average, denoted by \ (\ MU \) , then normalized to the original matrix \ (Y \) each subtracting the line number (the film) corresponding to the average to obtain a new \ (the Y \) , shown in the right below, using the new \ (the Y \) matrix learning \ (\ Theta \) and \ (x \) values. Then for Eve, before analysis of minimization is still established, i.e. \ (min _ {\ theta ^ {(5)}} \ frac {\ lambda} {2} [(\ theta_1 ^ {(5)}) ^ 2 + (\ ^ theta_2. 5) ^ 2] \) .

6. Mean normalization - Mean Normalization

Predicted value after normalization equation is: \ ((\ Theta ^ {(. 5)}) ^ {^ the Tx (I)} + \ MU = \ MU = \ left [\ \\ the begin {2.5} 2.5 Matrix \ \ \\ 2 1.25 2.25 \\ \ Matrix End {} \ right] \) .

In fact, for this prediction it is acceptable to us, because we do not know Eve preferences, so her score predicted to average.

Special case: If there is a movie without scoring a situation, you can consider the mean of each column is \ (0 \) , ie calculate the mean of each column, with \ (Y \) mean minus the corresponding column to get new \ (Y \) matrix.

Guess you like

Origin www.cnblogs.com/songjy11611/p/12318876.html