ctr (read the paper) recommendation system of analytical prediction model -NFM

The sixth article in this series, read the paper together ~

I Caishuxueqian, point out the inadequacies of welcome and exchange.

Today to share another Deep model NFM (serial structure). NFM also come with FM + DNN modeling of the problem, compared to the previously mentioned Wide & Deep (Google), DeepFM (Huawei + HIT), PNN (turned over) and then will share the DCN (Google), DIN ( Ali), etc., NFM What are the advantages of it, went into the model below us with a look.

原文:Neural Factorization Machines for Sparse Predictive Analytics

Address: https://arxiv.org/pdf/1708.05027.pdf

1, the origin of the problem


Cliche, talk to data features: For a lot of advertising in the category features, combinations of features are also very much. The traditional approach is characterized by artificially engineering or decision tree to feature selection, choose the more important features. But this approach has a drawback, is: can not learn in the training set the combination of features not present .

In recent years, Embedding-based method started to become mainstream, through the high-dimensional sparse to dense hidden input _embed_ low-dimensional vector space, the model can learn the training set there have been no combination of features.

Embedding-based can be roughly divided into two categories:

1.factorization machine-based linear models

2.neural network-based non-linear models
(Concrete is no longer started)
* * *

FM: learning a second order linear fashion features interactive, non-linear data for capturing the reality and complexity of the internal structure is not expressive;

Depth network: for example, Wide & Deep and DeepCross, simply stitching embedding feature vector does not consider the interaction between any of the features, but to learn the deep layer of the network structure of the nonlinear interaction of the characteristic and very difficult to train and optimize;

And NFM directly embedded abandoned splicing input vector to the neural network approach, increased _Bi-Interaction_ operation after the buried layer of the second order to model features in combination. This makes the information input low level of expression is more abundant, greatly improves the ability to hide behind layers of learning higher order nonlinear combinations of features.

2, NFM


2.1 NFM Model

And FM (factorization machine) similar, NFM using real Eigenvectors. Given a sparse vector x∈Rn as input, wherein the feature value xi = 0 indicates the i-th feature is not present, the NFM estimated objectives:




Wherein the first and second terms are part of a linear regression, and is similar to FM, and wherein the overall weight deviation FM analog weight data. The third term f (x) is the core component NFM for modeling feature interaction. It is a multilayer feedforward neural network. As shown in FIG. 2, Next, we set forth one layer f (x) design.
FIG overall structure model is as follows:


2.1.1 Embedding Layer

And other models sparse DNN as input, Embedding input conversion to dense embedded in low-dimensional space for processing. Here made slightly different process, the original feature values ​​multiplied Embedding vector, so that the process model may be real valued feature.

2.1.2 Bi-Interaction Layer

Is Bi Bi-linear Abbreviation, this layer is actually a pooling operation layer, which converts a vector into many vectors, formalized as follows:




fbi embedding entire input vector, xi, xj is a characteristic value, vi, vj is embedded in the corresponding feature vector. It represents a multiplication operation corresponds to an intermediate position. Therefore, the original vectors are embedded combination of any two, corresponding to the position of a multiplication result obtained new vector; then adding these new vectors, the outputs is the Bi-Interaction. This output is only one vector.

NOTE: Bi-Interaction does not introduce additional parameters, and its computational complexity is linear FM optimization of the reference, the following simplification:



2.1.3 Hidden Layer

This is basically the same with other models, the accumulation of hidden layers in order to learn higher-order combinations of features. The general effect of constant selection is better.

2.1.4 Prediction Layer

Finally, one hidden layer Zl to the final prediction result output layer formalized as follows:




Wherein h is an intermediate network parameters. Considering the foregoing layers hidden layer weights weight matrix, f (x) formalized as follows:



In fact, compared to the FM here the extra parameters parameter is actually hidden layer, so that FM can also be seen as a neural network architecture, NFM is to remove the hidden layer.

2.2 NFM vs Wide&Deep、DeepCross

substance:

NFM most important difference is that the Bi-Interaction Layer. Wide & Deep and DeepCross are replaced with Bi-Interaction splicing operation (concatenation).

The biggest drawback is that it Concatenation operation does not take into account any combination of characteristics information, so we rely solely on the back of the MLP to learn combinations of features, but unfortunately, MLP learning optimization very difficult.

Using Bi-Interaction takes into account the second order features in combination, represents such input contains more information, reducing the pressure to learn MLP portion of the rear, it is possible to use a simpler model (only one hidden layer experiment), achieve better Effect.

3, summary (comparative experiments and the specific implementation details, etc. Please refer to the original paper)


NFM main features are as follows:

1. NFM core is introduced Bilinear Interaction (Bi-Interaction) pooling operation in the NN. Based on this, NN can learn the combination of features would contain more information in a low level.

2. to learn higher-order non-linear combination of features via deepen FM.

3. NFM DNN as compared to the above-mentioned model, the model structure lighter, simpler (shallower structure), but a better performance, training and adjusting parameters easier.

So, it's still a combination of FM + DNN routine, except that how to deal Embedding vector, which is the local focus of each model. Now look at industry data on how to handle high-dimensional sparse with DNN and there is no single universal method, still in the dark.

Achieve a Demo DeepFM, and interested children's shoes can look at my GitHub .

Guess you like

Origin www.cnblogs.com/Jesee/p/11267985.html