[Transfer] A Deep Learning Model for Predicting User Online Advertising Click Behavior

This sharing mainly talks about the application of deep learning on data sets such as Multi-field Categorical. This type of data mainly presents the following characteristics: There are multiple domains, and the data on each domain is presented in ID format. This subject is the application under the broad category of information retrieval, and its applications are mainly reflected in the fields of web search, recommendation system, and advertisement display. Deep learning has relatively good effects on continuous data and sequence data (such as picture pixels, speech, natural language, etc.) and currently has relatively mature applications, such as image recognition, speech recognition, etc.

In the real world, there are many phenomena that require multi-field classification data to describe. So if deep learning is used to process multi-field classification data, what will be the effect? This article shows you the application effect of deep learning in multi-field classification data through an application example of user online advertising click behavior prediction.

The article will introduce in detail the advantages of FM and FNN algorithms in processing multi-value classification data, and compare the differences between these two algorithms and neural networks in feature variable processing, and finally through an example of user online advertising click behavior prediction The actual prediction effects of different algorithms such as LR, FM, FNN, CCPM, and PNN-I are compared.

Current application status of deep learning

The current mature applications of deep learning are mainly concentrated on topics such as machine vision, speech recognition, and natural language processing. The common feature of these application fields is that their data sets are continuous. For example, in graphic recognition, each layer is closely related to the parts of the layer after it; there is also a strong correlation before and after the voice information; in natural language processing, although each WORD is discrete, it is The data is also strongly correlated. For this type of data, people can easily understand the data, but it is very difficult for general machine learning algorithms to process this data, while deep learning can learn high-level patterns from the bottom layer by layer. This is The advantages of deep learning.

And today we need to understand the data Multi-field Categorical Data is different from the above continuous or serial data. Multi-field Categorical Data will have a variety of different fields, such as: [Weekday=Wednesday, Gender=Male, City =London,...], then it is more difficult for us to identify the relationship between these features. Let me give you an example of an intuitive scenario: For example, there is a Phoenix website with a Disney advertisement on the website, then we now want to know whether users will be interested in clicking on this advertisement after entering this website, similar to this kind of user click rate prediction in the information The search field is a very core issue.

The general practice is to describe the event through different domains and then predict the user's click behavior, and this domain can have many, such as:

 
  1. • Date: 20160320

  2. • Hour: 14

  3. • Weekday: 7

  4. • IP: 119.163.222.*

  5. • Region: England

  6. • City: London

  7. • Country: UK

  8. • Ad Exchange: Google

  9. • Domain: yahoo.co.uk

  10. • URL: http://www.yahoo.co.uk/abc/xyz.html

  11. • OS: Windows

  12. • Browser: Chrome

  13. • Ad size: 300*250

  14. • Ad ID: a1890

  15. • User tags: Sports, Electronics

Maybe we still have the identity information of these users, such as the user is a student, etc., then we can describe the event through these multi-dimensional values ​​and then predict the user's click behavior. Going back to the scene just now, what kind of users will click on this ad? We might guess that young users currently in Shanghai may have demand. If today is Friday and see this ad, they may click on this ad as a reference for weekend activities. The possible feature would be: [Weekday=Friday, occupation=Student, City=Shanghai]. When these features appear at the same time, we think the probability of this user clicking on this Disney ad will be higher.

This kind of scenario is often encountered in the fields of WEB Search, advertising display, and recommendation systems. For example, when Google and Baidu are doing advertising click-through rate prediction, they artificially use this classification data as fourth-order or fifth-order combined features, and finally To learn features on a super large data set, and this process requires a lot of manpower to do feature processing, this time I will talk about applying deep learning to directly learn such data features.

The traditional approach is to use One-Hot Binary encoding to process this kind of data. For example, there are three domains of data X=[Weekday=Wednesday, Gender=Male, City=Shanghai], among which Weekday has 7 values, we Just compile it into a 7-dimensional binary vector, where only Wednesday is 1, and the others are 0, because it has only one eigenvalue; Gender has two dimensions, one of which is 1; if there are 10,000 cities, then City There are ten thousand dimensions, only Shanghai has a value of 1, and the others are 0.

image description

Then you will eventually get a high-dimensional sparse vector. But this data set cannot be directly trained with a neural network: if it is directly encoded with One-Hot Binary, then the input features are at least one million, and the first layer needs at least 500 nodes, then we need to train 500 million in the first layer Parameters, then 2 billion or 5 billion data sets are needed, and it is basically difficult to obtain such a large data set.

FM, FNN and PNN models

Because of the above reasons, we need to embed very large feature vectors into the low-dimensional vector space to reduce model complexity, and FM (Factorisation machine) is undoubtedly recognized as the most effective embedding model in the industry:

image description

The first part is still Logistic Regression, and the second part is to judge the relationship between feature vectors and target variables through the dot product between two vectors. For example, in the above Disney advertisement, the angle between the two vectors of occupation=Student and City=Shanghai should be less than 90, and the dot product between them should be greater than 0, indicating that the click-through rate of the Disney advertisement is positively correlated. This algorithm is widely used in the field of recommendation systems.

Then we consider the neural network model based on this model. In fact, this model is essentially a three-layer network:

image description

It multiplies the vector in the second layer (for example, the blue node in the above figure is directly the product of two vectors, and there are no parameters to learn on the connecting edge), and each field will only be mapped to a low-dimensional vector. There is no mutual influence between the field and the field, then the first layer is greatly reduced in dimension, and then the neural network model can be applied on this basis.

We use the FM algorithm to embed the underlying field, and the modeling on this basis is the FNN (Factorisation-machine supported Neural Networks) model:

image description

The bottom layer of the model first embeds the one-hot binary-encoded input data with FM, maps the sparse binary feature vector to the dense real layer, and then uses the dense real layer as an input variable for modeling. This approach successfully avoids high The computational complexity of dimensional binary input data.

Then we apply these models to the iPinYou dataset, and the model effects are as follows:

image description

Then we can see that the effect of FNN is better than LR and FM models. Let us further consider what is the difference between FNN and general neural networks? Most neural network models use addition operations for the processing between vectors, while FM measures the relationship between the two through multiplication between vectors. We know that the multiplicative relationship is actually equivalent to the logical “and” relationship. Take the above example, only those who are characterized by students and are in Shanghai have a greater probability of clicking Disney ads. But addition is only equivalent to the "or" relationship in logic. Obviously, "and" can distinguish the target variable more strictly than "or".

So our next job is to model the multiplication relationship. You can multiply the inner product and outer product of two vectors:

image description

It can be seen that for the matrix obtained by the outer product operation, if the matrix only has values ​​on the diagonal, it becomes the result of the inner product operation, so the inner product operation can be regarded as a special case of the outer product operation. In this way, we can measure the relationship between different domains.

The neural network we built on this basis is as follows:

image description

First, embedding the input data to obtain a low-dimensional vector layer. Perform inner product or outer product processing on any two features of the layer to get the blue node in the above figure. Another processing method is to use these features Directly multiply by 1 and copy it to the Z of the upper layer, and then connect Z and P together to be the input layer of the neural network. On this basis, we can apply the neural network to the model.

Then processing the inner product or outer product of the feature will produce a complexity problem: assuming there are 60 domains, then processing these features as the inner product will generate a matrix of nearly 1,000 elements, and this will produce A large weight matrix, then we need to learn a lot of parameters, then our data set may not meet this requirement. The next thing to do is: Since the weight matrix is ​​a symmetric matrix, we can use factorization to process the symmetric matrix and convert it to a small matrix multiplied by the transpose of this small matrix, which will greatly reduce the training we need parameter:

image description

Model effect evaluation

Next we can look at the application effect of the model on two different data sets:

The first data set Criteo Terabyte Dataset: This data set has 13 numerical variables and 26 categorical variables. We selected nearly 300GB of data for 8 days. The first 7 days were used as the training set. Because this data set has too few positive samples, We under-sampled negative samples.

The second data set iPinYou Dataset: There are 24 categorical variables, and we intercepted the data volume for 10 consecutive days.

我们应用的比较算法有:LR (Logistic regression)、FM(Factorisation machine)、FNN(Factorisation machine supported neural network)、CCPM(Convolutional click prediction model)、PNN-I(Inner product neural network)、PNN-II(Outer product neural network)、PNN-III(Inner&outer product ensembled neural network)

In the evaluation model, we mainly look at the following indicators:

  1. Area under ROC curve (AUC): a very critical indicator

  2. Log loss: The smaller the value, the higher the accuracy of click-through rate estimation 
    image description

  3. Root mean squared error (RMSE): The larger the value, the better the model effect, as a reference 
    image description
  4. Relative Information Gain (RIG): The larger the value, the better the model effect 
    image description

The final effects of each model are as follows:

image description

What we mainly look at is the AUC indicator. A 2% increase in the general model in the industry will bring huge benefits. It can be seen that from LR to PNN, the model effect has increased by nearly 5%. This shows that the FM, FNN, PNN The effect of the class model is significantly better than that of LR.

Other experimental results, such as dropout, the industry believes that 0.5 is a better value. The experimental results of our PNN-II model also prove this:

image description

We also tested the optimal number of hidden layers. The number of hidden layers is not the better. Models with too many layers will have an overfitting effect. The number of hidden layers is related to the size of the data set. Generally speaking The larger the data set, the more hidden layers are needed. The best hidden layer shown by our model here is 3 layers:

image description

image description

We also looked at the robustness of each model on the small data set iPinYou Dataset, and found that the effect of the PNN-1 and PNN-2 models surpassed other models and has been steadily increasing:

image description

After that, we also learned the distribution of the number of nodes at different levels of the neural network. There are four different levels of node distribution. The results found that the two forms of constant and diamond perform better, and the increasing form has the worst effect, indicating that we should not Excessive compression of feature vectors in the first layer:

image description

We also compared the effects of Activation Functions of different hidden layer nodes, and found that tanh and relu are significantly better than sigmoid:

image description

image description

summary

  1. Deep learning can also achieve significant application effects on multi-field classification data sets;

  2. Find the correlation between features through inner product and outer product operations;

  3. In the prediction of advertising click-through rate, PNN is better than other models.

Pay attention to the WeChat public account "Communication Classroom" to obtain professional knowledge

Guess you like

Origin blog.csdn.net/Xiaoxll12/article/details/102898700