Recommendation system (2) collaborative filtering

In the article " Recommendation System (1) Overview ", the author introduced two common [candidate Item pool] generation methods: " content-based filtering " and "collaborative filtering " . Among them, content-based filtering is very simple, but of course, its limitations are also obvious. In contrast, collaborative filtering uses the similarity between Users and Items to make recommendations.

Table of contents

1. Movie recommendation example

1.1 One-dimensional embedding

1.2 Two-dimensional embedding

1.3 Understand embedding again

2.Matrix factorization

3. Select the objective function

4. Minimize the objective function

4.1 SGD

4.2 WALTZ

5. Advantages and disadvantages of collaborative filtering

5.1 Advantages

5.2 Disadvantages

5.2.1 Unable to process newly uploaded Items

5.2.2 Difficulty adding auxiliary features

6. References


1. Movie recommendation example

Consider a movie recommendation system where the training data consists of a feedback matrix, where:

  • Each row represents a user.
  • Each column represents an Item (a movie).

Feedback about the movie falls into one of two categories:

  • Explicit – Users specify how much they like a particular movie by providing a numerical rating.
  • Implicit - If a user watches a movie, the system infers that the user is interested.

For simplicity, we assume that the feedback matrix is ​​binary; that is, a value of 1 indicates interest in the movie.

When a user visits the homepage, the system should recommend movies based on:

  • Similarity to movies the user has liked in the past
  • Similar movies the user likes

For ease of illustration, let's hand-design some features for the movie described in the table below:

Movie Rating describe
The Dark Knight Rises PG-13 In this sequel to The Dark Knight, set in the DC Comics universe , Batman works to save Gotham City from nuclear annihilation.
Harry Potter and the Philosopher's Stone PG An orphan discovers he is a wizard and enrolls at Hogwarts School of Witchcraft and Wizardry, where he fights his first battle against the evil Voldemort.
shrek PG An adorable ogre and his donkey sidekick set out on a mission to rescue Princess Fiona, who is imprisoned in a castle by a dragon.
triplets of belleville PG-13 When a champion professional cyclist is kidnapped during the Tour de France, his grandmother and overweight dog, with the help of three elderly jazz singers, travel overseas to rescue him.
commemorate right An amnesiac is desperate to solve his wife's murder by tattooing clues on his body.

1.1 One-dimensional embedding

As shown in Figure 1, suppose we assign each movie a scalar [−1,1] to describe whether the movie is suitable for children (negative score) or adults (positive score). Suppose we also assign each user a scalar [−1,1] to describe the user's interest in children's movies (close to -1) or adult movies (close to +1). For movies that we expect users to like, the product of movie embeddings and user embeddings should be higher (closer to 1).

Figure 1 Rating movies based on a single feature (adult or child)

In Figure 2, checkmarks (✅) are used to identify movies watched by a specific user. The preferences of the third and fourth users can be well explained by this feature (as the movies they watch are clearly "polarized") - the third user prefers children's movies, the fourth user prefers adult movies Movie. However, this single feature doesn't quite explain the preferences of the first and second users - as their preferences seem to be contradictory, with both children's and adult films.

Figure 2 One-dimensional embedding diagram

1.2 Two-dimensional embedding

The above single characteristics (child-adult) are obviously not enough to characterize the preferences of all users. To solve this problem, let's add a second characteristic: the degree to which each film is a blockbuster or an art film. With the second feature, we can now represent each movie using the following 2D embedding.

Figure 3 describes movies based on two characteristics

We again place the user in the same embedding space to best interpret the feedback matrix: for each (User, item) pair, we want the dot product of the User embedding and the Item embedding to be close to 1 when the user watches the movie, Otherwise it is 0.

Figure 4 Two-dimensional embedding diagram

Note: In the above example, we represent Item and User in the same embedding space, which may seem a bit strange. After all, User and Item are two different entities. However, we can think of the embedding space as a common abstract representation of Item and User - that is, we use the same embedding space to represent Item and User. This way is somewhat obvious: it is convenient to use similarity measures to evaluate similarity or Correlation.

In this example, we designed the embedding by hand. In practice, embeddings can be learned automatically, which is the power of collaborative filtering models. In the next two sections, we discuss different models for learning these embeddings and how to train them.

The collaborative nature of this approach is evident when the model learns embeddings. Assume that the embedding vector of the movie is fixed. The model can then learn an embedding vector for the user that best explains their preferences. Therefore, the embeddings of users with similar preferences will be close. Similarly, if the user's embeddings are fixed, then we can learn movie embeddings to best explain the feedback matrix. Therefore, embeddings of movies liked by similar users will be close in the embedding space.

1.3 Understand embedding again

Regarding embedding, both academia and industry have a clear definition. Here, the author talks about his own understanding - User and Item are complex and diverse, and they each have their own characteristics. It is difficult for us to directly evaluate these characteristics. Therefore, we need to abstract these complicated features into a space in order to [unify the world view]. In this unified space, the originally complicated features are transformed into relatively simple digital vectors.

Based on the above understanding, the so-called embedding is essentially a kind of [abstraction]. Through abstraction, the rough and the fine, the false and the true are removed, so as to depict the essence of User and Item. Through abstraction, we project User and Item into another [space] to express them in mathematical form. This [space] is relatively streamlined.


2.Matrix factorization

Matrix factorization is a simple embedding model. Given a feedback matrix A \in R^{m \times n}, where m is the number of Users (or Query) and n is the number of Items, the model learns:

  • User embedding matrix U \in \mathbb R^{m \times d}, where row i is the embedding of User i.
  • Item embedding matrix V \in \mathbb R^{n \times d}, where row j is the embedding of Item j.
  • d represents the feature dimension.

The embedding is learned such that U V^T it is a good approximation of the feedback matrix A. In the above figure, it is obvious that  the value of U . V^T the coordinates in  (i, j) is the dot product  \rangle U_i, V_j\rangle , that is,  the dot product i of the embedding of the User and the embedding of the Item  . jIdeally,  A_{i, j} the closer this point product is to the one in the feedback matrix, the better - indicating that the prediction is accurate.

Note: Matrix factorization usually provides a more compact representation than learning the full matrix. A full matrix has O(nm) entries, whereas an embedding matrix UVhas O((n+m)d) entries, where the embedding dimension d is usually much smaller  m than sum n. Therefore, matrix factorization discovers the underlying structure in the data, assuming that the observations are close to a low-dimensional subspace. In the previous example, the values ​​of n, m, and d were so low that the advantage was negligible. However, in real-world recommender systems, matrix factorization can be more compact than learning the full matrix.


3. Select the objective function

An intuitive objective function is the squared distance. To do this, we need to minimize the sum of the squared errors of all observed pairs of entries (U,V), as follows:

\min_{U \in \mathbb R^{m \times d},\ V \in \mathbb R^{n \times d}} \sum_{(i, j) \in \text{obs}} (A_{ij} - \langle U_{i}, V_{j} \rangle)^2.

In this objective function, we only sum over the observed pairs of entries (i, j), i.e. over the non-zero values ​​in the feedback matrix. However, summing just one of the values ​​is not a good idea - a matrix of all values ​​will have minimal loss and produce a model that cannot make valid suggestions and generalizes poorly (since the observed pairs of entries may be Sparse, such as 100 movies, we only observed that the user watched one of them).

 Maybe treat unobserved values ​​as zero and sum all entries in the matrix. This corresponds to minimizing the squared Frobenius  distance between A its approximation  :U V^T 

\min_{U \in \mathbb R^{m \times d},\ V \in \mathbb R^{n \times d}} \|A - U V^T\|_F^2.

 This quadratic problem can be solved by singular value decompositionSVD ) of the matrix. However, SVD is also not a good solution because in real applications the matrix A can be very sparse. For example, compare all videos on YouTube to all videos a specific user has watched. The solution U V^T(corresponding to the approximation of the model's input matrix) may be close to zero, resulting in poor generalization performance.

In contrast, weighted matrix factorization decomposes the objective into the following two sums:

  • The sum of pairs of observed entries - equivalent to the sum of squares of (observed - predicted).
  • The sum of pairs of unobserved entries (treated as zero) - equivalent to the sum of squares of the predicted values.
\min_{U \in \mathbb R^{m \times d},\ V \in \mathbb R^{n \times d}} \sum_{(i, j) \in \text{obs}} (A_{ij} - \langle U_{i}, V_{j} \rangle)^2 + w_0 \sum_{(i, j) \not \in \text{obs}} (\langle U_i, V_j\rangle)^2.

where w_{0} is a hyperparameter that weights both terms so that the target is not dominated by one of them. Tuning this hyperparameter is very important.

Note: In practical applications, careful weighting is required. For example, in some high-frequency scenarios, for high-frequency Items (such as non-popular videos) or high-frequency Users (such as heavy users), the first term in the above formula may dominate the objective function. In this scenario, we can correct for this effect by weighting the training examples to account for item frequency. In other words, the objective function can be replaced by:

\sum_{(i, j) \in \text{obs}} w_{i, j} (A_{i, j} - \langle U_i, V_j \rangle)^2 + w_0 \sum_{i, j \not \in \text{obs}} \langle U_i, V_j \rangle^2

where w_{i, j} is a function of the frequency of User i and Item j.


4. Minimize the objective function

Common algorithms for minimizing an objective function include:

 The target is quadratic in both matrices U and . V(Note, however, that the problem is not jointly convex.) WALS works by randomly initializing the embeddings and then alternating between:

  • Specify  U solution V
  • Specify  V solution U

Each stage can be solved exactly (by solving a linear system) and can be distributed. This technique guarantees convergence because each step is guaranteed to reduce the loss.

SGD and WALS each have their own advantages and disadvantages. Check out the information below to see how they compare:

4.1 SGD

   advantage:

        1- Very flexible - other loss functions can be used.

        2-Can be parallelized.

   shortcoming:

        1-Slower - Convergence is not as fast.

        2- Harder to handle unobserved entries.

4.2 WALTZ

    advantage:

        1-Can be parallelized.

        2-The convergence speed is faster than SGD.

        3- Easier to handle unobserved entries.

    shortcoming:

        1-Relies only on loss squared.


5. Advantages and disadvantages of collaborative filtering

5.1 Advantages

1- No domain knowledge required: We don’t need domain knowledge because the embeddings are learned automatically.

2-Discover interests: This model can help users discover new interests. In isolation, a machine learning system may not know that a user is interested in a given item, but the model may still recommend it because similar users are interested in that item.

3-Easy to get started : To some extent, the system only needs the feedback matrix to train the matrix factorization model. In particular, the system does not require contextual features. In practice, this can be used as one of several candidate generators.

5.2 Disadvantages

5.2.1 Unable to process newly uploaded Items

The model prediction for a given (User, Item) pair is the dot product of the corresponding embeddings. Therefore, if an Item is not seen during training, the system will not be able to create an embedding for it and query the model using that Item. This problem is often called the cold start problem . However, the following techniques can solve the cold start problem to some extent: 

  • Projection in WALS. Given a new item i_{0} not seen during training, if the system has some interaction with the user, the system can easily compute embeddings for this item v_{i_0}without retraining the entire model. The system only needs to solve the following equation, or a weighted version:

    \min_{v_{i_0} \in \mathbb R^d} \|A_{i_0} - U v_{i_0}\|

    The previous equation corresponds to an iteration in WALS: the user embedding is kept fixed and the system solves for the Item embedding i_{0}. The same can be done for new users.

  • Heuristics to generate embeddings for new Items. If the system has no interaction, the system can approximate its embedding by averaging the embeddings of Items from the same category, the same uploader (such as YouTube uploaded videos), and so on.

5.2.2 Difficulty adding auxiliary features

Auxiliary characteristics are any characteristics other than User (Query) or Item ID. For movie recommendations, auxiliary features might include country or age. Including available auxiliary features can improve the quality of the model. Although adding auxiliary features to WALS may not be easy, the generalization of WALS makes this possible.

To generalize WALS, the features of the input matrix are increased \bar A by defining a block matrix , where:

  • Block(0, 0) is the original feedback matrix A.
  • Block(0, 1) is the multi-hot encoding of User features.
  • Block(1, 0) is the multi-hot encoding of Item features.

Block  (1, 1) is usually left blank. If matrix factorization is applied \bar A, then in addition to User and Item embeddings, the system also learns embeddings of auxiliary features.

6. References

链接-https://developers.google.cn/machine-learning/recommendation/collaborative/basics

Guess you like

Origin blog.csdn.net/Jin_Kwok/article/details/131597874