NFM for recommendation system

foreword

In the CTR prediction task, the learning of high-order features and low-order features is very important. We have also learned a lot about the recommendation model, basically from the simplest linear model (LR), to FM considering low-order feature crossing, to considering highly crossed neural networks, and then to the W&D combination model considering both. In fact, these models have their own problems, which is also the principle that the later models need to be continuously improved. The main points are as follows:

Although the simple linear model is simple, it also has the disadvantage of limiting the expressive ability of the model. With the large and complex data, this model cannot fully mine the hidden information in the data and ignores the relationship between features. Interaction, if you want to interact, requires complex feature engineering.
The FM model considers the second-order crossover of features, but this kind of crossover only stays at the second-order level. Although it can be high-order, the amount of calculation and complexity suddenly increases with the increase of the order. Therefore, the second-order is the most common situation. It will ignore the information of high-order feature crossing
DNN, which is suitable for the learning of natural high-order cross-information, but the low-order crossover will be ignored, and the parameters cannot be updated in real time, and the memory ability weaker.
The W&D model made a great attempt to combine the simple LR model and the DNN model, so that the model can learn both high-order combination features and low-order feature patterns, but the wide part of W&D uses the LR model , this part still requires some empirical feature engineering, and the Wide part and the Deep part require two different input modes, which requires strong business experience in specific practical applications.
So the corresponding algorithm came into being. FNN and DeepFM have been explained before. Today I will talk about NFM. I hope everyone will not be confused, hhh.

1. The principle of the NFM model and the details of the paper

NFM (Neural Factorization Machines) is a model proposed by Professor He Xiangnan of the National University of Singapore at the SIGIR conference in 2017. When introducing DeepFM, the author first analyzed the problems existing in FM, and could not consider the problem of high-order feature interaction. When simulating real data with complex internal structures and regularities, the ability of FM will be limited, so DeepFM The author thought of using FM and DNN networks to perform a parallel connection. The two receive the same input, but each learns different features (one is responsible for low-order interaction, and the other is responsible for high-order interaction). Finally, the learned The results are combined to get the final output, and the effectiveness of this strategy is also proved through experiments.

Although FM has been recognized as one of the most effective methods for prediction in sparse data, real data is often nonlinear. Although FM can handle sparse data better, it can also learn the second-order interaction between sparse data, but to put it bluntly, this is still a linear model, and the interaction is limited to the second-order interaction.

Therefore, the author's idea here is to use the nonlinearity and strong expressive ability of the neural network to improve the FM model and obtain an enhanced version of the FM model, so the old routine is to combine these two models. After all, the advantages of FM and DNN in sparse data are very obvious and just complementary, and W&D pioneered the combination model. The key is how to combine it? FNN gives a way of thinking, and DeepFM gives a way of thinking. NFM also has a combination flavor here, but NFM has designed a structure that splices FM and DNN vertically. In this way, the advantages of FM and DNN are also used.

First look at the formula of NFM:

In fact, here is not the combination of FM and DNN for the final results, but the combination of these two models, using DNN instead of high-order feature interaction.

The above formula is a bit like FM, but here f(x) only represents the second-order feature interaction, and using FM for higher-order feature interaction will increase the computational complexity, so use here DNN to replace high-order interactions. Let's take a look at the model of the network first.

Did you find that it is very similar to the FNN mentioned above, but there is still a big difference between it and FNN. The first difference in the model is that FNN pre-trains the embedding vectors of FM, and then combines these embedding vectors. Input into DNN for indiscriminate crossover. But here it is similar to PNN, except that the Product_layer of PNN is replaced by Bi-interaction Pooling layer.

二、Bi-Interaction Pooling layer

The first is the embedding layer, which is actually the same embedding as the rest of the models. However, it should be noted here that embdding encodes different feature fields, and different features have the same feature field. When this place is actually implemented, LabelEncoder (instead of one-hot encoder) is often used first, so that the embedding vectors corresponding to those features with non-zero values can be directly obtained. After all, LabelEncoder is equivalent to all of a certain feature. The feature field creates a dictionary. We know that when taking the embedding vector of a certain value, just go to the index value of the dictionary directly.

Then there is the biggest innovation of this article: Bi-interaction Pooling. In fact, it is the second-order feature cross-pooling layer. First look at the formula:

⊙ represents the element product operation of two vectors, that is, the element product vector obtained by multiplying the corresponding dimensions of the two vectors, where the operation of the kth dimension:

In fact, the above formula is to cross the embedding vectors of the two feature domains. The difference from the second-order crossover of FM is that after the hidden vectors of the two features in FM are crossed, it is a specific value, but here is a The k-dimensional vector is added (pooled) after all the feature fields are crossed in pairs, and finally the k-dimensional vector is input into the DNN.

In fact, if the DNN is not added, the NFM will degenerate into FM, so the key to improvement is to add such a layer, combine the information of the second-order crossover, and then give the DNN to learn the high-order crossover. In fact, this form of combination has also appeared in FNN (the bottom layer is FM, and the upper layer is DNN), but there is no integration of FM into DNN, but training each of them first, and then initializing with the trained vector The whole, and then fine-tuning, there seems to be a combination of FM and DNN, but it is actually a clutch, and NFM also adopts such a combination idea, but they designed feature pooling, so that FM is really combined with DNN It has become a whole, and can be trained in a complete forward and back propagation, so that it can really take advantage of the second-order cross linearity of FM and the high-order cross nonlinearity of DNN. The author pointed out that the Bi-Interaction layer does not require additional model learning parameters, and more importantly, it completes the calculation in a linear time. Reference FM can convert the calculation formula into:

In addition, the author also applied dropout and BN in DNN. One is for overfitting, and the other is to avoid that the update of the embedding vector changes the distribution of the input layer to the hidden layer or the output layer.

3. Other details

Finally, let's summarize NFM:

Compared with other models, the core innovation of NFM is the feature cross pooling layer. With it, the seamless connection between FM and DNN is realized, and NN can learn combined features containing more information at low level. Combining the advantages of FM second-order cross linearity and DNN high-order cross nonlinearity, it is very suitable for scene tasks dealing with sparse data.

Adding dropout technology in the feature intersection layer and hidden layer is beneficial to alleviate overfitting. Dropout is also a strategy for overfitting linear latent vector models.

In NFM, using the combination of BN+Dropout will reduce the stability of learning, so pay attention when using it.

The feature cross-pooling layer can better learn and encode the interaction of second-order feature information. At this time, it will reduce a lot of burden on DNN, and only need a few hidden layers to learn high-order feature information, that is, NFM Compared with the previous DNN, the model structure is shallower and simpler, but the performance is better, and the training and parameter adjustment are easier.

NFM is relatively insensitive to parameter initialization, that is, it does not rely too much on pre-training, and the robustness of the model is strong.

The number of layers of the deep learning model is not always as deep as possible. If it is too deep, it will cause overfitting problems and it will be difficult to optimize.

demo reproduction

class NFM(nn.Module):
    def __init__(self, feature_columns, hidden_units, dnn_dropout=0.):
        """
        NFM:
        :param feature_columns: 特征信息， 这个传入的是fea_cols
        :param hidden_units: 隐藏单元个数， 一个列表的形式， 列表的长度代表层数， 每个元素代表每一层神经元个数
        """
        super(NFM, self).__init__()
        self.dense_feature_cols, self.sparse_feature_cols = feature_columns
        
        # embedding
        self.embed_layers = nn.ModuleDict({
            'embed_' + str(i): nn.Embedding(num_embeddings=feat['feat_num'], embedding_dim=feat['embed_dim'])
            for i, feat in enumerate(self.sparse_feature_cols)
        })
        
        # 这里要注意Pytorch的linear和tf的dense的不同之处， 前者的linear需要输入特征和输出特征维度， 而传入的hidden_units的第一个是第一层隐藏的神经单元个数，这里需要加个输入维度
        self.fea_num = len(self.dense_feature_cols) + self.sparse_feature_cols[0]['embed_dim']
        hidden_units.insert(0, self.fea_num)
        
        self.bn = nn.BatchNorm1d(self.fea_num)     
        self.dnn_network = Dnn(hidden_units, dnn_dropout)
        self.nn_final_linear = nn.Linear(hidden_units[-1], 1)
    
    def forward(self, x):
        dense_inputs, sparse_inputs = x[:, :len(self.dense_feature_cols)], x[:, len(self.dense_feature_cols):]
        sparse_inputs = sparse_inputs.long()       # 转成long类型才能作为nn.embedding的输入
        sparse_embeds = [self.embed_layers['embed_'+str(i)](sparse_inputs[:, i]) for i in range(sparse_inputs.shape[1])]
        sparse_embeds = torch.stack(sparse_embeds)     # embedding堆起来， (field_dim, None, embed_dim)
        sparse_embeds = sparse_embeds.permute((1, 0, 2))
        # 这里得到embedding向量之后 sparse_embeds(None, field_num, embed_dim), 进行特征交叉层，按照那个公式
        embed_cross = 1/2 * (
            torch.pow(torch.sum(sparse_embeds, dim=1),2) - torch.sum(torch.pow(sparse_embeds, 2), dim=1)
        )  # (None, embed_dim)
        
        # 把离散特征和连续特征进行拼接作为FM和DNN的输入
        x = torch.cat([embed_cross, dense_inputs], dim=-1)
        # BatchNormalization
        x = self.bn(x)
        # deep
        dnn_outputs = self.nn_final_linear(self.dnn_network(x))
        outputs = F.sigmoid(dnn_outputs)
        
        return outputs

Reference: Rolling Xiaoqiang