Recommendation system (11): Embedding in recommendation system

The list of issues discussed this time includes:

  1. What is Embedding?
  2. Why do recommendation systems need Embedding?
  3. How to use data to generate Embedding in recommendation system code?
  4. What are the categories of Embedding technologies in recommendation system code?

1. What is Embedding?

To put it simply, Embedding is a vector (such as [0.2,0.4], which is a two-dimensional Embedding); to look more complex, a low-dimensional dense vector is used to " represent" an object, and the object mentioned here can be a word ( Word2Vec), it can also be an item (Item2Vec), or a node in a network relationship (Graph Embedding). The word "representation" means that the Embedding vector can express certain characteristics of the corresponding object , and the distance between vectors reflects the similarity between objects.

In recommendation scenarios, Users and Items are complex and diverse, and they each have their own characteristics. It is difficult for us to directly evaluate these characteristics. Therefore, we need to abstract these complicated features into a space in order to [unify the world view]. In this unified space, the originally complicated features are transformed into relatively simple digital vectors.

Based on the above understanding, the so-called embedding is essentially a kind of [abstraction]. Through abstraction, the rough and the fine, the false and the true are removed, so as to depict the essence of User and Item. Through abstraction, we project User and Item into another [space] to express them in mathematical form. This [space] is relatively streamlined.

Before Embedding became famous, oneHot was the most handsome guy. Compared with oneHot, intuitively speaking, Embedding is equivalent to smoothing oneHot, and oneHot is equivalent to max pooling to Embedding. From an abstract level, oneHot has a relatively low level of abstraction and has a clear direct connection with the object it represents; in comparison, Embedding has a very high level of abstraction and has no direct evaluable connection with the object it represents, but , Embedding's expressive ability is actually much more powerful than oneHot.

Figure 1 The difference between Embedding and oneHot 

Embedding in a general sense is the parameter weight of the penultimate layer of the neural network . It only has overall and relative significance (very important! Very important! Very important!). It does not have local or absolute meaning (all are also called latent factors, latent vectors). , you can take a closer look at the reference literature - DNN implementation of FM). This is related to the generation process of Embedding. Any Embedding is a random number at the beginning, and then iteratively updates with the optimization algorithm. Finally, when the network converges and stops iteration, the parameters of each layer of the network are relatively solidified, and the hidden layer weight table is obtained . (At this point, it is equivalent to getting the Embedding we want), and then the Embedding of each element can be viewed individually through the lookup table.

Let me explain why Embedding is the parameter weight of the penultimate layer of the neural network . First, the last layer is the prediction layer, and the penultimate layer is strongly related to the target task. Once the Embedding is obtained, the sample can be characterized by weights. Secondly, the purpose of obtaining Embedding is to facilitate retrieval. Retrieval is actually to find the closest distance, that is, to minimize the cross product. The multiplication of the weights of the hidden layer before the penultimate layer and the penultimate layer can be understood as the retrieval process, because it is also a cross product, and the cross product with all candidate Items is calculated at once, so Embedding can be used directly as the weight. .

From the above explanation, we can learn that it is meaningless to look at Embedding directly, because its mathematical form (a vector) is not a direct mapping relationship with the object it wants to represent. Embedding is the expression of the corresponding fact object in another space. Since the spaces are different, it is naturally impossible to simply use one space to examine the other space.


2. Why does the recommendation system need Embedding?

In the recommendation system, we can use Embedding as a vector and use it in the recommendation algorithm as a nearest neighbor recommendation (NN) to achieve object recommendation, everyone recommendation, and person recommendation.

As an essential part of the recommendation algorithm, the Embedding vector has four main application directions:

2.1 As Embedding layer in deep learning network

Complete the conversion from high-dimensional sparse feature vectors to low-dimensional dense feature vectors (such as Wide&Deep, DIN and other models). Because One-Hot coding is used extensively in recommendation scenarios to encode category and ID-type features, resulting in extremely sparse sample feature vectors, and the structural characteristics of deep learning make it unfavorable for processing sparse feature vectors, so almost all deep learning recommendation models The Embedding layer will be used to convert high-dimensional sparse feature vectors into dense low-dimensional feature vectors. Therefore, mastering various Embedding technologies is a basic operation for building a deep learning recommendation model.

2.2 As pre-trained Embedding feature vector

After being connected with other feature vectors, they are input into the deep learning network for training (such as FNN model). Embedding itself is an extremely important feature vector. Compared with the feature vectors generated by traditional methods such as MF matrix decomposition, Embedding has stronger expressive capabilities . Especially after the Graph Embedding technology was proposed, Embedding can introduce almost any information for encoding, making it itself contain a large amount of valuable information. On this basis, the Embedding vector is often connected with other recommendation system features and then input into the subsequent deep learning network for training.

2.3 Calculate the Embedding similarity between users and items

Embedding can be directly used as the recall layer or one of the recall strategies of the recommendation system (such as Youtube recommendation model, etc.). Embedding’s calculation of similarity between items and users is a commonly used recall layer technology in recommendation systems. After fast nearest neighbor search technologies such as Locality-Sensitive Hashing are applied to recommendation systems, Embedding is more suitable for quickly "screening" a large number of candidate items and filtering out items on the order of hundreds to thousands. "Fine ranking" is performed by the deep learning network .

YouTube is the pioneer of using Embedding features for recommendation. In the paper, user_vec is learned through DNN. The advantage of introducing DNN is that any continuous features and discrete features can be easily added to the model. Similarly, although the matrix decomposition method commonly used in recommendation systems can also obtain user_vec and item_vec, it cannot embed more features.

Figure 2 User behavior sequence modeling based on pooling routes proposed by Youtube

  1. The entire model architecture is a DNN structure containing three hidden layers. The input is an input vector concated into user browsing history watch vector, search history search vector, demographic information gender, age and other contextual information; the output is divided into two parts: online and offline training.
  2. Similar to word2vec, each video will be Embedding into a fixed-dimensional vector. The user's video viewing history is expressed through variable-length video sequences, and finally a fixed-dimensional watch vector is obtained through a weighted average (which can be weighted according to importance and time) as the input of the DNN.
  3. The output layer in the offline training stage is the softmax layer, while online the user vector is directly used to query related products. During the serving process of the model, the entire model is not directly used for inference, but user embedding and item embedding are directly used for similarity calculation. Among them, user embedding is the output of the last layer of MLP in the model, and video embedding directly uses the weight of softmax.

By calculating the Embedding of users and items, they are input into recommendation or search models as real-time features (such as Airbnb's Embedding application). It is worth mentioning that in the past, Embedding was calculated offline, but in 2017, Facebook released the faiss algorithm, which made it possible to add Embedding in a streaming manner , and then the calculation of millions of data was shortened to the millisecond ms level.

The main contribution of the 2018 Airbnb paper was its innovation in the construction of sparse samples. Airbnb's operation partially made up for YouTube's acclimatization problem in the field of news recommendation. From the perspective of an Embeddingist, his innovations mainly include the following two points, one is group embedding, and the other is mixed training of users and items.

Figure 3 Embedding development history


3. How to use data to generate Embedding in the recommendation system code?

Combined with the above picture, let’s list 3 code snippets to explain how Embedding is calculated? These three scenarios are: word2vec, system filtering and DNN .

3.1 Content-based word2vec

First look at the three sentences in the red box as input. Then word2vec calculates the Embedding of each word. Then on the right side, we can replace the document with the user and the word with the movie, and then we will get movie recommendations.

Figure 4 word2vec

3.2 Decomposition method of collaborative filtering matrix

In fact, you specify the input "user id", "movie id", and "rating", and then use the ALS algorithm (Alternating Least Squares, which is a matrix decomposition algorithm based on the idea of ​​collaborative filtering) to fit. The Embedding vector (features in the table) of each item (id in the table) is obtained, and then we can recommend objects, people, and people.

Figure 5 Collaborative filtering

3.3 DNN deep learning method

In fact, it was specifically mentioned above, but I will talk about it in detail here. Embedding is actually a by-product of DNN. What does that mean? In the picture on the left below, the red arrow points to Relu, which actually contains 256 vectors. In fact, these 256 vectors are Embedding, and this Embedding is the weight. Then the video vector is obtained on the far left, and with the Nearest Neighbor, the results calculated by the Embedding weight can be sent to softmax and the final output prediction is top N.
Does it feel familiar? In fact, the transformer is similar. It uses the calculated weight of QW to assign it to V and then sends it to softmax. The only difference is that DNN is used here, while the transformer uses multi-head attention.

Figure 6 Embedding of dnn


4. What are the categories of Embedding technologies in recommendation system codes?

4.1 Feature Embedding

In feature engineering, there are roughly the following embedding methods for discrete values, continuous values, and multi-values. Pre-trained Embedding feature vectors, large training samples, and more complete parameter learning. end2end completes the conversion from high-dimensional sparse vectors to low-dimensional dense feature vectors through the Embedding layer. The advantage is that the gradient is unified end-to-end. The disadvantage is that there are many parameters and the convergence speed is slow. If the amount of data is small, the parameters are difficult to fully train.

 Figure 7 Feature Embedding

4.2 Embedding operation

In different deep learning models, in addition to various optimizations of the network structure, various optimization attempts have also been made on the operation of Embedding. Various optimizations of the network structure are essentially optimizations of the operation of Embedding.

Figure 8 Embedding operation

4.3 Embedding defects

As a technology, Embedding is very popular, but it also has some shortcomings, such as the semantic invariance of incremental updates, it is difficult to include multiple features at the same time, and long-tail data is difficult to train.

 Figure 9 Embedding defects

At the 2020 KDD conference, an  AutoFIS article by Huawei talked about the continuous optimization of Embedding, which can obtain a suitable vectorized expression of features, and at the same time obtain a "more appropriate" value of the inner product to represent the importance of combined features .

The principle is to add a parameter in front of <Vi, Vj>. We may ask, is it unnecessary to introduce another parameter? On the contrary, I think this is the essence of AutoFIS. The function of Embedding is to vectorize features and ensure that similar features are closer together. Based on this, the inner product of similar features is larger. "Similar" and "important" are two different things. Very "similar" features may not necessarily play a more "important" role in prediction. However, DeepFM did not decouple these two parts during the training process. The possible result is that the vectorized expression of Embedding may not necessarily make similar features closer, and the inner product of important features may not be large.

Figure 10 Summary

5. References

1-Various popular Embedding methods in deep learning recommendation systems (Part 1) - Cloud + Community-Tencent Cloud (tencent.com)
2-Recommendation system embedding technical practice summary-Zhihu (zhihu.com)
3-Deep Neural Network for Intensive reading of YouTube Recommendation paper - Zhihu (zhihu.com)
4-embedding technology in recommendation system - Zhihu (zhihu.com)
5-FM DNN implementation - the latent vector can be considered as the weight program of embedding learning Gorilla's blog -CSDN Blog Hidden Vector

6- Embedding in recommendation system - short book

Guess you like

Origin blog.csdn.net/Jin_Kwok/article/details/131936164