ctr (read the paper) recommendation system of estimates -Wide & Deep analytical model

After reading the FM and FNN / PNN paper to learn about the 16 years of one of Google's papers, articles traditional LR and DNN combination constitute a wide & deep model (parallel structure), only to retain the fitting ability LR, but also DNN has generalization ability, and does not require separate training model, you can easily iterative model, the view of the next bar.

Better reading experience, please click here .

原文:Wide & Deep Learning for Recommender Systems

Address: [https://arxiv.org/pdf/1606.07792.pdf](https://arxiv.org/pdf/1606.07792.pdf)

 

1, the origin of the problem
 
1.1 Background
This article is presented for recommendation system applications, of course, also be applied in the forecast ctr. First introduced two terms appear throughout the paper: * memorization (for the time being translated into memory): the correlation was found between the item or feature from historical data. * Generalization (for the time being translated into generalization): the relevance of transfer and found novel combination of features present in little or no historical data.
To give an example to explain: in the human cognitive learning process in evolution, the human brain is very complex, what it can under Memory (memorize) happen every day (sparrow can fly, pigeons can fly) and generalization (generalize ) this knowledge into something never seen before (winged creatures can fly). But the generalization rules sometimes not particularly accurate, sometimes wrong (animal wings can fly it). This time you need to remember (memorization) to amend the rules of generalization (generalized rules), called the exception (the penguins have wings but can not fly). This is the Memorization and Generalization of reason or meaning.
 
1.2, the problem of existing models
  • LR simple linear model, and the model has a fast interpretable, has a very good fitting ability, but LR model is a linear model, limited skills, generalization capability is weak, need to do engineering characteristics, in particular, need cross characteristics, in order to achieve a good effect, but in the industrial scene, the number of features will be a lot may reach hundreds of thousands, even hundreds of thousands, then the feature works hard to do, not necessarily to achieve better results.
  • DNN model does not need to do too sophisticated engineering features, you can get very good results, can automatically cross DNN features, to learn the interaction between features, especially the high-order feature interaction can learn, with good generalization ability. In addition, DNN by increasing the embedding layer, can effectively solve the problem of sparse data characteristics, features to prevent explosions. Generalization recommendation system is very important, and can increase the diversity of the recommended items, but DNN will be weaker in comparison LR fit the data photogenic.
  • in conclusion:
  1. Linear model can not learn a combination of features not present in the training set;
  2. FM or DNN Although you can learn a combination of features not present in the training set, but it will be excessive generalization through learning embedding vector.
To improve the fit and generalization of the recommendation system, LR and DNN can be combined, while enhancing the fit and generalization ability, wide & deep LR and DNN is to combine, wide part is LR, deep part is DNN, the combined results of both outputs.
 
2, model details
 
Briefly again achieved the two terms: Memorization: processed before large-scale input is sparse: linear model + Cross feature. Brought Memorization and memory capacity is very efficient and interpretable. But Generalization (generalization) requires more manual features works.
Generalization: In contrast, DNN features almost no engineering. Through the low latitudes of dense embedding combined can learn hidden features deeper. However, the disadvantage is a bit over-generalize (overgeneralization). Recommended system as follows: give users a recommendation is not so relevant items, especially user-item matrix is relatively sparse and high-rank (high rank matrix)
Difference between the two: Memorization tend to be more conservative, it had recommended before the user behavior of items. By comparison, generalization more tend to increase the diversity of the recommendation system (diversity).
 
2.1, Wide and Deep
 
Deep & Wide: Wide Deep & consists of two parts: linear model + DNN portion. Combine the advantages above, the balance of memorization and generalization. The reason: the advantages of integrated memorization and generalizatio and serve the recommendation system. In the experiments described herein as compared to the wide-only and deep-only model, wide & deep improve significantly. The figure is an overall configuration model:

 

 
As can be seen, is a special Wide neural network, he inputs and outputs connected directly belong to the category of generalized linear models. Deep refers Deep Neural Network, this is well understood. Wide Linear Model for memorization; Deep Neural Network for generalization. Left is Wide-only, the right side is the Deep-only, the middle Wide & Deep.
 
2.2、Cross-product transformation
 
Wide paper continues to be mentioned for conversion to generate a combination of features, is very important here. It is defined as follows:

 

Where k represents the k-th combination of features. i represents the input X of the i-th dimension feature. C_ki indicates whether the i-th dimension feature configured to participate in the k-th combination of features. X represents the input of the dimension d. In the end what dimensional features to participate in the construction combinations of features, this is an artificial setting (which means that the project requires human characteristics), not reflected in the formula.
In fact, such a complicated formula that is a combination of characteristics before and after we have been talking one-hot: Just input samples X features in gender = female and characteristic language = en are 1, new combinations of features AND (gender = female, language = en) only for the 1. So long as the value is multiplied by two features on it. (Such Cross-product transformation can learn the characteristics binary combination of features, and increased non-linear model)
 
2.3、The Wide Component
 
As mentioned above Wide Part is actually a generalized linear model. Use features include: * raw input: the original feature
  • cross-product transformation: The above-mentioned combination of features
Use the same example to illustrate: Did you model a query (you want to eat food), model returns to give you a gourmet, then you buy / consume this recommendation. In other words, it is recommended to learn the system in fact is a conditional probability: P (consumption | query, item). Wide Part memorization may be some exceptions. For example, AND (query = "fried chicken", item = "chicken fried rice") Although very close from the character point of view, but in fact something completely different, then the Wide can remember this combination is not good, is a special case, next time when you re-order fried chicken, I will not recommend to you a chicken fried rice.
 
2.4、The Deep Component
 
As shown in the right model: Deep Part of learning through a low latitude dense representation (also called embedding vector) for each query and item, to generalize give you recommend some of the characters seem less relevant, but you may also need . For example: You want fried chicken, Embedding Space, the fried chicken and burgers very close, so I will give you recommend Hamburg.
Embedding vectors are randomly initialized and updated according to the final training to reverse the loss. These low-dimensional dense embedding vectors are input as the first hidden layer. Hidden layer activation function is usually used ReLU.
 
3, model training
 
Training in the original sparse features, will be used in two components, such as query = "fried chicken" item = "chicken fried rice": 

 

When training, the loss is calculated according to the final gradient, the Wide and Deep backpropagation two parts, each train their parameters. In other words, the two modules together is trained (ie, joint training of paper), note that this is not a model of integration.
  • Wide portions combination of features may remember those sparse, specific rules
  • Deep in part by Embedding generalization recommend some similar items
Wide module by combining the features can be very efficient to learn some specific combination, but it also led to his not learn a combination of features not present in the training set. Fortunately, Deep modules make up for this shortcoming. In addition, because it is trained together, wide and deep the size are reduced. wide component shortage only need to fill deep component on the line, so they need less of the cross-product feature transformations, rather than a full-size wide Model. Specific training methods and experimental refer to the original papers.
 
4, summary
 
Drawback: Wide part of the project still requires human characteristics. Pros: to achieve a unified modeling and generalization of memorization. To learn low-level and higher-order combinations of features simultaneously

 

Guess you like

Origin www.cnblogs.com/Jesee/p/11237084.html