DIN model of recommendation system (understanding of business by attention mechanism)

foreword

As mentioned above, AFM is a simple attempt to focus on the mechanism, and there is no model based on business design. Click-through rate (CTR) prediction is an important task in industrial fields such as online advertising. In a cost-per-click (CPC) advertising system, ads are ranked by an effective price, the effective cost per mille (eCPM), which is the product of the bid price and the CTR, while The click-through rate needs to be estimated by the system. Therefore, the effect of the CTR prediction model directly affects the final revenue and plays a key role in the advertising system. DIN is a model proposed by Ali in 2018. Compared with AFM, it has more business atmosphere in it. The application of this model is Alibaba’s e-commerce advertising recommendation business. When users browse e-commerce products, they often generate a large number of historical browsing records. DIN makes full use of these historical browsing records and attention mechanism to model users. hobbies. In layman's terms, when recommending products to users, you can learn from the products he has purchased or browsed before, and then make recommendations.

All the recommendation models explained before can be regarded as Embedding&MLP models. These models have similar routines for recommendation. The difference lies in the intersection of Embedding and features. In general, a large number of sparse features are converted into dense Embedding vectors, then stitched together, and then different crossovers are performed. But there is a problem in this way, that is, different users have different historical behavior characteristics, and these historical behaviors will have different effects on the prediction. Brute force splicing and crossing these features will lead to the inability to express the broad interests of users well. This is also a problem that DIN needs to solve.

1. DIN background analysis

The recommendation of e-commerce advertisements generally needs to capture the interests and hobbies of users in a timely manner, but the interests and hobbies of users are diverse and change at any time, which requires learning and prediction based on the historical behavior of users. However, not all historical behaviors of the user can represent the user's interests and hobbies. For example, the product in the advertisement is a keyboard. If the product clicked by the user in the history includes cosmetics, bags, clothes, facial cleanser and other products, then there is a high probability that the user will be listed. It may be that the user is not interested in the keyboard, and if the products in the user's historical behavior include a mouse, computer, iPad, mobile phone, etc., then there is a high probability that the user is interested in the keyboard, and if the user's historical products include a mouse, cosmetics, T -Shirt and facial cleanser, the embedding of the mouse should be more important than the latter three in predicting the click-through rate of the "keyboard" advertisement. Therefore, the previous model based on the violent splicing of Embedding dense vectors is not very practical here.

We need to adaptively capture historical behaviors that play a positive role in user interest prediction, so that we can make more effective product recommendations to users. Speaking of adaptability, it must be able to use the attention mechanism. The author introduced the attention into the model, designed a "local activation unit" structure, and calculated the weight by using the correlation between candidate products and historical problem products , this represents the prediction of the current product advertisement, the importance of each product in the user's historical behavior, and the deep learning network with attention weight is the protagonist DIN this time.

Most recommendation scenarios will be applied to the two stages of recall and sorting, and the difference between product recommendation and other recommendations is that it pays more attention to the user's historical behavior, because this is directly related to the user's interest, which is also the background of DIN and power.

insert image description here

 2. Data processing and baseline

Like the previous model processing data set, each category contains a Feature, but in real business, it often contains multiple Features, as shown in the following figure:

insert image description here

 For feature encoding, the author gave an example here: [weekday=Friday, gender=Female, visited_cate_ids={Bag,Book}, ad_cate_id=Book], in this case we know that it is generally encoded in the form of one-hot, The form of binary features converted into coefficients. But here we will find a visted_cate_ids, which is the user's historical commodity list. For a certain user, this value is a multi-valued feature, and we also need to know that the length of this feature is not the same, that is, the user's purchase history The number of products is not the same, this is obvious. For this feature, we generally use multi-hot encoding, that is, there may be more than one 1, and the corresponding position of any product is 1, so the encoded data looks like this:

insert image description here

This is the data format of the feeding model. It should also be noted that there is no interactive combination in the above features, that is, no feature crossover. This interactive information is handed over to the neural network behind. 

After reading the data processing, let’s take a look at the baseline model. The principle is also very simple. It is the Embedding&MLP above. DIN adds an attention network to adaptively learn the correlation between current advertisements and user historical behavior on the basis of this baseline. The baseline model structure mainly includes three modules: Embedding, Pooling&Concat layer, and MLP. The structure is as follows:

insert image description here

Embedding layer: The function of this layer is to convert the high-dimensional sparse input into a low-dimensional dense vector. Each discrete feature will correspond to an embedding dictionary. The dimension is D × K, where D represents the dimension of the hidden vector. And K represents the number of unique values ​​nunique() of the current discrete feature, here is an example:

Assuming that a user’s weekday feature is Friday, when it is converted into one-hot encoding, it is represented by [0,0,0,0,1,0,0]. Here, if the hidden vector dimension is assumed to be D, then this feature The corresponding embedding dictionary is a matrix of D × 7 (each column represents an embedding, 7 columns exactly 7 embedding vectors, corresponding to Monday to Sunday), then the user's one-hot vector will get a D after passing through the embedding layer The vector of × 1, which is the embedding corresponding to Friday, how to calculate it is actually the embedding matrix∗[0,0,0,0,1,0,0] ^T. If this sentence is more abstract, you can write a matrix and vector for simulation.

For a product, there may be multiple products. At this time, the multi-hot code contains many 1s. At this time, the column vector with a value of 1 will be extracted from the embedding vector of the product, and a product will be obtained at this time. Embedding vector list:

insert image description here

Through this layer, the above input features can get the corresponding dense embedding vector. 

pooling layer and Concat layer: The role of the pooling layer is to convert the user's historical behavior embedding into a fixed-length vector, because the number of items purchased by each user is different, that is, in each user's multi-hot The number of 1 is inconsistent, so after passing through the embedding layer, the number of user historical behavior embeddings obtained is not the same, that is, the above embedding list t_i is not the same length, then in this case, the historical behavior characteristics of each user cannot be combined Same length. And if we add a fully connected network later, we know that he needs a fixed-length feature input. Therefore, a pooling layer is often used to first change the user's historical behavior embedding into a fixed length (uniform length), so there is this formula:

 Here k represents the number of products purchased by users in the corresponding historical special group, that is, the number of historical embeddings. See the user behaviors series in the above figure, which is the process. The role of the Concat layer is to concatenate all the feature embedding vectors, if there are continuous features, they are also counted, spliced ​​and integrated from the feature dimension, and used as the input of MLP. Here is the indiscriminate splicing of user historical behavior data, and the subsequent improvement of DIN is also based on this part of the modification.

Loss : Since this is a click-through rate prediction task and a two-category problem, the loss function here uses a negative log logarithmic likelihood:

It can also be seen from the above figure that before the user's historical behavior characteristics and current candidate advertisement characteristics are all put together for the neural network, there is no interactive process at all, but after being put together for the neural network, although there is interaction Yes, but some of the original information, for example, the information of each historical product will be lost, because the interaction with the current candidate advertising product is the pooled historical feature embedding. This embedding is a combination of all historical product information. Through our previous analysis, not all historical commodities are useful for predicting the current ad click-through rate. Combining all commodity information will add some noisy information. You can think of the keyboard and mouse example above. If you add After using all kinds of facial cleansers, clothes and so on will have a counterproductive effect. Secondly, combined in this way, it is no longer possible to see which product in the user's historical behavior is more related to the current product, that is, the importance of each product in the historical behavior to the current prediction is lost. The last point is that if all the historical behavioral products browsed by users are finally converted into fixed-length embeddings through embedding and pooling, this will limit the model to learn the diverse interests of users.

DIN models this process by being given a candidate ad and then paying attention to representations of local interests related to that ad. Instead of expressing different interests of all users by using the same vector, DIN adaptively computes a representation vector of user interest (for a given advertisement) by considering the relevance of historical behavior. This representation vector varies from advertisement to advertisement. Let's take a look at the DIN model.

3. DIN model

Speaking of the appearance of DIN here, it is very reasonable. In fact, a local activation unit is added to the Pooling layer. This layer is used on the user's historical behavior characteristics, and the Embedding vector of the user's historical behavior products can be weighted according to the relevance of the user's historical behavior products to the current advertisement. First look at the model architecture:

insert image description here

The improvement here is that there is an interaction between the Embedding vector of the product and the Embedding vector of the current product. This interaction method is the local activation unit. In fact, it is a feed-forward neural network that inputs two vectors and then outputs the correlation of the two vectors. This correlation is equivalent to the weight of each historical commodity, and the user's interest expression is obtained by multiplying this weight with the original historical behavior embedding and summing \boldsymbol{v}_{U}(A). The formula looks like this:

 Here \{\boldsymbol{v}_{A}, \boldsymbol{e}_{1}, \boldsymbol{e}_{2}, \ldots, \boldsymbol{e}_{H}\}is the historical behavior feature embedding of the user U, v_{A}which represents the embedding vector of the candidate advertisement A AA, a(e_j, v_A)=w_jwhich represents the weight of the current commodity or the degree of correlation between the historical behavior commodity and the current advertisement A. The above feed-forward neural network represented by a ( ⋅ ) is the so-called attention mechanism. Of course, if you look at the picture, in addition to the historical behavior vector and the candidate advertisement vector, the input also adds a product operation of the two ( Corresponding position element multiplication), the author said that this is the explicit knowledge that is beneficial to model correlation modeling, and some add poor ones, which is similar to (concat(A, B, AB, A*B) in PNN.

One thing that needs special attention here is that the sum of the weights here is not 1. To be precise, this is not the weight, but the directly calculated correlation score as the weight, that is, the usual scores (the one before softmax value), this is to retain the user's interest intensity.

Speaking of this, this is all about DIN. Next, I will talk about the two training skills of this model, which are also based on real business skills.

4. Training skills

1、Mini-batch Aware Regularization

This is a regularization method, which is optimized based on L2 regularization. In terms of L2 regularization, all parameters are calculated, but in the recommendation algorithm, there are a lot of data for 0 case. For example, for the product id, the Embedding matrix is ​​very large. When calculating, it is not necessary to calculate all the parameters in the Embedding matrix, because the sample features are very sparse. Only constrain the embedding of those features (those features that are not 0) that appear in the current mini-batch sample. So this is the author's improvement idea, this method is called mini-batch aware regularization.

The author first said that in the process of model training, the main reason for the complexity is the update of the embedding dictionary matrix corresponding to each feature, which is the large matrix of D × K. If we can reduce the amount of calculation, the model training will not be successful. There is a problem, so the regularization here is mainly put on this thing. According to the improvement idea proposed by the author, the regularization formula on all samples is as follows:

 In the above formula, the constrained numerator is an indicative function. If the value of a feature of a sample is not 0, the corresponding embedding parameter is constrained, and the denominator is the number of samples indicating that the current feature is not 0 (appeared). Here K is the number of features. If you change to mini-batch, it looks like this:

 The above understanding, this is easy to understand, adding a mini-batch, that is, only the feature embedding parameters that have appeared in the mini-batch are constrained. ( x , y ) in the above formula belongs to the entire training set, and here belongs to a certain mini-batch, so there is a traversal summation item between mini-batch before, no more explanation, but here the author is for simplicity , Replace the indicative function with a fixed value. After all, the number of samples with a value of 1 under each different feature will be different, and it will be a bit troublesome to calculate, so the author took an approximation here, let \alpha_{m j}=\max _{(x, y) \in \mathcal{B}_{m}} I\left(x_{j} \neq 0\right) \mathrm{} . The formula is as follows:

 2、Data Adaptive Activation Function

Here, an adaptive activation function that is dynamically adjusted with the data distribution is proposed, which is a generalized PRelu. This is a very commonly used activation function in neural networks. The author uses it in the attention network. This The formula for the activation function is as follows:

 Here the function graph of p ( s ) is as follows (left):

 The author's point here is that the shortcoming of this activation function is that PReLU uses a hard correction point with a value of 0, which may not be suitable when the input of each layer follows a different distribution. So the author here designed a new adaptive activation function called Dice, its p ( s ) is the picture on the right, the formula is as follows:

Here E ( s ) and Var ( s ) are the mean and variance of the samples in each mini-batch. Of course, this is part of the training set. The mean and variance smoothed on the data are used in the test set. Here is similar The feeling of the average mean of Adam and BN. Since the mean and variance are taken into account, the adjustment of this function can be adaptive according to the distribution of the data, which will be more flexible and reasonable, a bit similar to the feeling of sigmoid.

5. Other details

Here are some other details. This part is mainly from the author's experiment, first of all, on the data set. Here the author uses three data sets, one real scene data from Alibaba, two public data sets, one is Amazon's product data set, and the other is movielens data set. Of course, the specific description can refer to the original paper, and we will also use it later. Pytorch implements the DIN model, walks through Amazon's dataset, and then interprets the dataset there. Here I just want to say that the movieslen data set can still play the click-through rate prediction task. I have used this before, but what I saw was the rating of the movie, and it did not consider the user's historical behavior rating. Here, the author converts it into a binary classification. , count the score greater than 3 as a click, and the score less than 3 as no click, and then turn it into a classification task of 0 and 1, and take into account the user's historical click behavior. If there is time later, we will also work on this data set try it.

The second detail is the evaluation. Here we recognize a thing called RelaImpr to measure the improvement of the model. The formula is calculated as follows:

 This thing is based on the base model to see the improvement of the model. Also recognizes a weighted AUC (GAUC), the formula is as follows:

Summarize:

 The DIN model is based on real business scenarios, and it solves the pain point that the deep learning model cannot express the diverse interests of users. It can calculate the user interest vector by considering the interesting representation of [a given candidate ad] and [the user's historical behavior]. Specifically, by introducing local activation units, we focus on relevant user interests by searching relevant parts of historical behaviors, and employ weighted sums to obtain representations of user interests about candidate advertisements. Behaviors that are more relevant to candidate ads get higher activation weights and dominate user interests. This representation vector varies across different advertisements, greatly improving the expressive power of the model. Of course, the application of this model requires the data set to have rich historical user behavior data.

insert image description here

 Reference: AFM and DIN models recommended on AI (when the recommendation system meets the attention mechanism)

Guess you like

Origin blog.csdn.net/qq_38375203/article/details/125599867