The depth of interest in the network system of the DIN recommendation

Deep learning recommendation evolution model

After the recommendation system and compute advertising into a deep learning of the times, compared to the traditional model of recommendation made in the following two aspects significant progress:
(1) Compared with traditional machine learning, ability to express deep learning model is stronger, able to dig out more hidden patterns in data.
Model structure (2) the depth of learning is very flexible, based on business scenarios and data features, flexibility to adjust the model structure, the model scenarios perfect fit.

Evolution of the depth map Suggested model as follows, a multilayer perceptron MLP as the core, by changing the structure of the neural network model constructing depth learning recommendation of different characteristics.

ps This figure is adapted from Wang Zhe Gangster "deep learning recommendation system", although the first time I got the book, flipped through has to admit, "reading, you OUT." Wang Zhe selected model standard is a classic and at the head of the company has successfully applied, such as Ali, DIN, DIEN, but with Ali's DSIN, TDM, ESMM did not selected, may be due to a book of the time or not yet in practice the full amount of the application . CS is still in the forefront of dynamic conference paper but will top more than 1000 annual paper really dazzling.
Here Insert Picture Description

DIN depth of interest in the network

About DIN

The depth of interest in the network (Deep Interest Network, DIN) Ali Mama accurate targeted advertising team at KDD 2018 proposed by the electricity supplier for the next scene CTR-depth understanding of user interest model. The core DIN model is to combine Attention mechanism with the traditional Embedding & MLP model, although Attention mechanism has achieved great success in the CV and the field of NLP, but the success of the Attention mechanism into CTR estimates the field thanks to Ali Engineer precise electricity supplier business understanding .

Through the analysis of user behavior data, we found that users interested Ali has two important features:

  • Diversity: a user may be interested in a variety of categories of goods
  • Local Activation: Due to the diversity of user interests, only some of the current historical data will help predict Click commodities, not all historical data.

Embedding & MLP traditional paradigm is as follows: first by embedding layer wherein the projection is large sparse continuous low dimensional embedding vector, the vectors then concatenate the input to a fully connected network, to calculate the estimated ultimate goal. In the electricity business scene, to be precise estimate we must fully exploit the historical behavior of the user to understand the user's interests. And a user will exist for different commodities are potentially interested in, this will also be reflected in the user's historical behavior. Traditional Embedding & MLP model with a fixed expression vector to a user, the user is not sufficient to characterize the diversity of interests, that is, the user may also be interested in multiple items.

Fixed user vector V in V_u Rank dimension limits the overall solution space of the model, and the dimension vector of force by the operator and generalized restrictions can not be infinitely expanded, so Ali proposed to a user based vector to express the dynamic changes of the estimated target. Specifically, a user prediction U s e r i User_i The target I t e m i Item_i CTR is not required V u V_u All users expressed interest, but only the expression of the user and I t e m i Item_i Relevant interest. Such as targeted advertising merchandise keyboard, click on the history of the sequence in the user's mouse, face cream and T-shirt, from the common sense point of view, the importance of predictive keyboard mouse click-through rate is greater than the latter two; from the model point of view, modeling the process characterized in the mouse "attention" should be greater than the latter.

So Ali Attention by introducing mechanisms to capture for different users different interests commodity status, and with a dynamically transform according to different estimates of target commodities V u V_u To express the associated user interest.

DIN model architecture

Here Insert Picture Description
DIN model structure is shown in Figure interest by activating a module (Activation Unit), with the goal Candidate Ad estimated information to activate the user clicks the history of commodities, in order to extract the user associated with the current estimated target of interest. High weight of history shows that this part of the behavior associated with the current interest in advertising, low weight and is unrelated to advertising "Interest noise." By activating the activation of goods and weights are multiplied, then add up the status of the current estimated interest expressed as a target of Ad. Finally, the relevant interest expressed, the user static features and context-sensitive features, as well as Ad related features stitching together the input to the subsequent multi-layer DNN network, the user clicks on the final predicted probability of the current target Ad.

Attention mechanism

Attention mechanism is simple to understand is there are different weights for different features, so some features will be leading this time to predict, though the model for certain features pay attention. However, DIN and not directly with the attention mechanism. Because different candidate for advertising, user interest representation (embedding vector) should be different.

The user is no longer a point of interest, but rather a function of a multimodal. A peak of interest, says a peak size expressed interest in intensity. So for a different candidate ad, users interested in strength is different, that is to say with the change of candidate advertising, the strength of the user's interests constantly changing.

In DIN models for different needs adaptively adjust Candidate Ad User Representation, that is Embedding Layer -> Pooling Layerobtained when the user's interests represented, given different historical behavior of different weights, to achieve local activation. The final training from the reverse perspective is based on the current Candidate Ad, to reverse activate user history hobby interest, given different historical behavior of different weights. From a formal point of view of mathematics, except that the average operating mechanism attention past operations and replace or add a weighted sum or weighted averaging operation.
Here Insert Picture Description
DIN interest in the activated module according to estimates predict goal of the historical behavior of associated weights, length of yellow energy bar indicates that the longer the higher the weight the right to activate, and the estimated target more relevant. Intuitive and can be seen on the jacket of the estimated target related merchandise are given a relatively higher weight.

Dice activation function

PRelu called Leaky Relu, and Relu activation function is a step function, the same problem is the dividing point is 0, which means that changes the face of different input point is unchanged, but the actual output neurons distribution is different, the dividing point should be determined by the data. So Ali asked Dice (Data Adaptive Activation Function) activation function, to describe the distribution of the data by the statistical mean and variance of the output neurons. Dice the controller adaptively adjusted according to the distribution data, and the overall learning skills are improved.
Here Insert Picture Description

DIN Visualization

Here Insert Picture Description
The figure shows the distribution of user interest: the warmer the color, the higher the interest the user, the user can see the interest in the distribution of multiple peaks.

to sum up

  1. After the user's interests have Diversity, clicks on more merchandise / shop, by Pooling or an average of losses Embedding Vector sum a lot of information, so the introduction of Attention mechanism through Local Activation assign different weights for different behavior ID, this weight is by the current behavior ID and Candidate Ad joint decision.
  2. Activation Unit DIN used to capture characteristics Local Activation using Weighted Sum Pooling to capture the structure of Diversity.
  3. In the optimization model, DIN Dice proposed regular and adaptive activation function, significantly improved the performance of the model and convergence speed.

Reference

Published 11 original articles · won praise 2 · Views 661

Guess you like

Origin blog.csdn.net/liheng301/article/details/105338953