Recommendation system (4) Deep neural network DNN

In the article " Recommendation System (2) Collaborative Filtering ", the author introduces how to use matrix factorization to learn embeddings. Matrix factorization has some limitations:

  • Basic matrix decomposition only uses two-dimensional information, UserID (QueryID) and ItemID, and all learned knowledge is contained in the User vector and Item embedding. Interpretability is poor. At the same time, it is difficult to integrate more useful features during the learning process, such as users’ statistical information (income level, education, age, life stage, etc.) and basic feature information of products, such as categories, brands, etc. Therefore, the generalization ability of fundamental matrix factorization is subject to certain limitations.
  • Factorization machine (FM) can be seen as a generalization of basic matrix decomposition. It can well integrate features of more dimensions, so that the learned model has stronger generalization ability. For details, please refer to this article: "Mainstream CTR Model " The Evolution and Comparison of ".
  • Matrix decomposition is difficult to calculate incrementally online, so it cannot handle the user's real-time behavior feedback and can only be calculated based on historical behavior. A recommendation system without real-time processing capabilities will definitely not be a good recommendation system. For example, during the JD Double 11 event, the user behavior pattern on that day must be very different from the usual behavior pattern. If the recommendation system cannot process the real-time behavior of users on that day, Capturing users' new preferences in a timely manner will greatly reduce the recommendation effect.

Deep neural network (DNN) models can address these limitations of matrix factorization. DNN can easily merge User features and Item features (due to the flexibility of the network input layer), thereby helping to capture users' specific interests and improve the relevance of recommendations.

Table of contents

1. Softmax DNN for recommendation

1.1 Model input

1.2 Model architecture

1.3 Softmax output: predicted probability distribution

1.4 Loss function

1.5 Softmax embedding

2.DNN and matrix factorization

2.1 Can I use the Item attribute?

3.Softmax training

3.1 Training data

3.2 Negative sampling

4. Matrix decomposition and softmax

5. References


1. Softmax DNN for recommendation

One possible DNN model is  softmax , which treats the problem as a multi-class prediction problem, where:

  • The input is a user query.
  • The output is a probability vector with a size equal to the number of Items in the material library, representing the probability of interacting with each Item; for example, the probability of clicking or watching a YouTube video.

1.1 Model input

Inputs to DNN can include:

  • Dense features (e.g. viewing time and time since last viewing)
  • Sparse features (e.g. viewing history and country)

Unlike matrix factorization methods, DNN can add auxiliary features such as age, country/region, etc. We use x to represent the input vector.

Figure 1. Input layer x

1.2 Model architecture

Model architecture determines the complexity and expressiveness of the model. By adding hidden layers and nonlinear activation functions (such as ReLU), models can capture more complex relationships in the data. However, increasing the number of parameters also generally makes the model more difficult to train and more expensive to serve. We denote the output of the last hidden layer as: \psi (x)\epsilon \mathbb{R}^{d}.

Figure 2. Output of hidden layer \psi (x).

1.3 Softmax output: predicted probability distribution

The model maps the output of the last layer  \psi (x)and obtains the probability distribution through the softmax layer \hat p = h(\psi(x) V^T), where:

  • h : \mathbb R^n \to \mathbb R^n is the softmax function, given by h(y)_i=\frac{e^{y_i}}{\sum_j e^{y_j}}
  • V \in \mathbb R^{n \times d} is the weight matrix of the softmax layer.

Softmax layers map vectors of scores  y \in \mathbb R^n(sometimes called  logits ) to probability distributions.

Figure 3. Predicted probability distribution,\hat p = h(\psi(x) V^T)

PS:

The name Softmax is a play on words. A "hard" maximum assigns probability 1 to the highest-scoring Item ( y_{i}). In contrast, softmax assigns non-zero probabilities to all Items, giving higher probabilities to Items with higher scores. When fractionally scaled, softmaxh(\alpha y) converges to a "hard" maximum within the limit \alpha \to \infty.

1.4 Loss function

Finally, define a loss function to compare the following:

  • \hat p, the output of the softmax layer (probability distribution)
  • p, Actual observations, representing the Items that User actually interacts with (for example, YouTube videos that users click or watch), can be expressed as normalized multi-hotspot distributions (probability vectors).

For example, you can use cross-entropy loss to compare two probability distributions.

Figure 4. Loss function

1.5 Softmax embedding

The probability of Item j is  \hat p_j = \frac{\exp(\langle \psi(x), V_j\rangle)}{Z}, where Z is a normalized constant and does not depend on j.

In other words, \log(\hat p_j) = \langle \psi(x), V_j\rangle - log(Z), so j the logarithmic probability of Item is d the dot product of (at most one additive constant) two-dimensional vectors. These two vectors can be understood as the embeddings of User (Query) and Item:

  • \psi(x) \in \mathbb R^d is the output of the last hidden layer. We call this the embedding of Query x.
  • V_j \in \mathbb R^dj is the weight vector  connecting the last hidden layer to the output . We call this Item embedding j.

Note: Since log is an increasing function, j the highest probability of Item \hat p_j is  \langle \psi(x) , V_j\rangle the maximum value of the dot product. Therefore, the dot product can be interpreted as a similarity measure in this embedding space.

Figure 5. Embedding of Item V_j \in \mathbb R^d


2.DNN and matrix factorization

j In the softmax model and matrix decomposition model, the system will learn an embedding vector for each Item V_jto obtain the so-called Item  embedding matrix  V \in \mathbb R^{n \times d}. Matrix decomposition is now the weight matrix of the softmax layer.

i However, User(Query) embeddings are different. Instead of learning an embedding for each User U_{i}, the system learns a mapping from User features x to embeddings \psi(x) \in \mathbb R^d . Therefore, this DNN model can be viewed as a generalization of matrix factorization, where  \psi(\cdot) the User (Query) embedding is replaced by a nonlinear function.

2.1 Can I use the Item attribute?

Can the same idea be applied to Items? That is, instead of learning an embedding for each item, can the model learn a nonlinear function that maps item features to embeddings? Yes. To do this, use [Twin Towers Neural Network] , which consists of two neural networks:

  • A neural network maps Query features (essentially User features)  x_{\text{query}} to embeddings \psi(x_{\text{query}}) \in \mathbb R^d
  • A neural network maps Item features  x_{\text{item}} to embeddings \phi(x_{\text{item}}) \in \mathbb R^d

The output of the model can be defined as a dot product \langle \psi(x_{\text{query}}), \phi(x_{\text{item}}) \rangle. Note that this is no longer a softmax model. The new model predicts a value for each pair  (x_{\text{query}}, x_{\text{item}}) instead of predicting a probability vector for each Query  x_{\text{query}}.


3.Softmax training

It was explained above how to incorporate softmax layers into deep neural networks for recommender systems. This section details the training data for this system.

3.1 Training data

The softmax training data consists of Query (User) features and Item vectors (represented as probability distributions , marked in blue in the figure below) that x interact with the user  . pThe variables of the model are the weights of the different layers, marked in orange in the image below. The model is typically trained using stochastic gradient descent (including variations thereof).

3.2 Negative sampling

Since the loss function compares two probability vectors p, \hat p(x) \in \mathbb R^n(representing observations and model predictions respectively), computing the gradient of the loss (for a single query ) can be very expensive if the size of the material library n is too large .x

We can set up a system to only calculate gradients for positive items (i.e. items that are active in the actual observation vector). However, if the system is trained only on positive pairs (forward pairs), the model may exhibit folding, as discussed below.

Folding diagram

In the above figure, assume that each color represents a different category of Query and Item. Each Query (represented as a square) only interacts with Items (represented as circles) of the same color. For example, think of each category in YouTube as a different language. Typically, users will interact primarily with videos in a given language (native language).

The model can learn how to place Query/Item embeddings for a given color (correctly capturing similarities within that color), but embeddings from different colors may appear in the same region of the embedding space by chance. This phenomenon is called folding and may lead to false recommendations: at query time, the model may incorrectly predict high scores for items from different groups .

Negative examples - are items marked as "irrelevant" to a given query. Showing counterexamples to the model during training lets the model know that different groups of embeddings should stay away from each other.

Instead of using all Items to compute gradients (which may be too expensive) or using only positive items (which makes the model prone to collapse), we can use negative sampling. More precisely, the approximate gradient can be calculated using the following Item:

  • All positive items (Items that appear in the target tag, actual observed Items)
  • Negative Item sampling ( j in 1,...n), only sampling, not all

There are different strategies for sampling negative examples:

  • Uniform sampling
  • Items with higher j scores are given higher probabilities \psi(x) . V_j. Intuitively, they are the examples that contribute the most to the gradient , and these examples are often called "hard negatives".


4. Matrix decomposition and softmax

DNN models address many of the limitations of matrix factorization, but are often more expensive to train and query. The table below summarizes some important differences between the two models.

Matrix factorization Softmax deep neural network
Query features not easy to join can be included
Cold start Cannot easily handle Query(User) or Item outside the vocabulary. Some heuristics can be used Easily handle new queries
foldable Folding can be easily reduced by adjusting for unobserved weight in WALS. Easy to fold. Techniques such as negative sampling or gravity need to be used.
Training scalability Can easily scale to very large item libraries (perhaps hundreds of millions of Items or more), but only if the input matrix is ​​sparse. It is difficult to scale to very large material libraries. Some techniques can be used such as hashing, negative sampling, etc.
Service scalability The embeddings U, V are static and a set of candidate embeddings can be pre-computed and stored. Project embeddings V are static and can be stored.

Query embeddings often need to be computed at query time, making the model more expensive to serve.

In short:

  • For large material libraries, matrix factorization is often a better choice. It's easier to scale, cheaper to query, and less prone to collapse.
  • DNN models can better capture personalized preferences, but are more difficult to train and more expensive to query. DNN models outperform matrix factorization in scoring because DNN models can use more features to capture correlations better. Furthermore, DNN model folding is generally acceptable since we are primarily concerned with ranking a pre-filtered set of candidates that are assumed to be relevant.


5. References

1-https://developers.google.cn/machine-learning/recommendation/dnn/training

2-https://developers.google.cn/machine-learning/recommendation/dnn/softmax

Guess you like

Origin blog.csdn.net/Jin_Kwok/article/details/131705077