Recommended Algorithm DeepFM

foreword

Today's model is DeepFM, which is a very classic model. Before introducing this model, make a small summary of the shortcomings of the previous model, which is one of the reasons why the DeepFM model was proposed.

In the CTR prediction task, the learning of high-order features and low-order features is very important. We have also learned a lot about the recommendation model, basically from the simplest linear model (LR), to FM considering low-order feature crossing, to considering highly crossed neural networks, and then to the W&D combination model considering both. In fact, these models have their own problems, which is also the principle that the later models need to be continuously improved. The main points are as follows:

  • Although the simple linear model is simple, it also has the disadvantage of limiting the expressive ability of the model. With the large and complex data, this model cannot fully mine the hidden information in the data and ignores the relationship between features. Interaction, if you want to interact, requires complex feature engineering.
  • The FM model considers the second-order crossover of features, but this kind of crossover only stays at the second-order level. Although it can be high-order, the amount of calculation and complexity suddenly increases with the increase of the order. So the second order is the most common case, and the information of high-order feature intersection will be ignored.
  • DNN is suitable for the learning of natural high-order cross information, but low-order cross information will be ignored, and parameters cannot be updated in real time, and the memory ability is weak.
  • The W&D model made a great attempt to combine the simple LR model and the DNN model, so that the model can learn both high-order combination features and low-order feature patterns, but the wide part of W&D uses the LR model , this part still requires some empirical feature engineering, and the Wide part and the Deep part require two different input modes, which requires strong business experience in specific practical applications.

So DeepFM came into being. The old rules first look at the knowledge context map:

1. Principle of DeepFM model

DeepFM is a model jointly proposed by Harbin Institute of Technology and Huawei in 2017. To sum it up in one sentence, it is to replace the wide part of W&D from LR to FM. In fact, it is so simple, but it is not enough to just know this. There are still many details in this paper, as well as the understanding of the recommendation system.

First, let's take a look at the specific model diagram of DeepFM:

This model is also composed of two parts, FM on the left + DNN on the right. The structure is not very complicated. It looks very similar to the W&D and DCN models. The DNN part has not been changed, mainly the Wide part. W&D uses LR, and Deep&Cross uses A Cross network, and FM is used here, and the latter two are improved for the defect that W&D's wide does not have the ability to automatically combine features. The calculation process of DeepFM is also relatively simple. The FM on the left and the DNN on the right share the same input of the Embedding layer (this should be distinguished from W&D, not only is it as simple as FM replacing Wide, but the input mode of the model has also been improved), The FM on the left crosses the Embeddings of different feature domains ( the Embedding vector here is used as the feature hidden vector of the original FM, that is, the input, and the parameters of the FM need to be solved by gradient descent ), and the DNN on the right Embedding performs deep crossover, and finally sends the output of FM and the output of the Deep part to the final output layer to participate in the final target fitting. The formula is as follows:

 There is still a small detail above. I don’t know if you have noticed it. It is the difference between the red line and the black line. The black line above indicates that these parameters in the embedding are updated through the deep neural network . After the update, FM This end can be used directly, that is, the part of the red line can be directly used instead of updating parameters.

2. Embedding

Regarding the input, it includes discrete categorical feature fields (such as gender, region, etc.) and continuous numerical feature fields (such as age, etc.). Categorical feature domains are generally processed as input features through one-hot or multi-hot (such as user browsing history); numerical feature domains can be directly used as input features, or discretized and one-hot encoded as input features. For each feature domain, an Embedding operation needs to be performed separately, because each feature domain has almost no correlation, such as gender and region. The numerical features do not need to be Embedding. Unlike Wide&Deep, the Wide part in DeepFM shares the input features with the Deep part , that is, the Embedding vector.

Embedding layer looks like this:

 3. FM

insert image description here

 Needless to say, this feeling is the same as FNN, here is still a standard FM model, which is responsible for the low-order interaction process between features. The output of FM is the sum of Addition unit and Inner Product units, and Addition unit reflects the first-order feature. The respective influences, while the Inner product represents the influence of the 2-order feature interaction.

Unlike FNN, the hidden vector parameters of FM are directly the same as the parameters of the neural network, and they are all learned together as learning parameters, which saves the pre-training process of FM, but trains the entire network in an end-to-end manner. And this training method has another advantage, that is, the author found that updating parameters through backpropagation through high-order and low-order interactive features will make the model perform better. Of course, this also depends on the strategy of sharing Embedding input.

 4. Paper details

Question 1: What is feature interaction and why is feature interaction required?

  • Second-order feature interaction: Through research on the mainstream application market, we found that people often download food delivery applications during meal times, which indicates that the (order-2) interaction between application categories and timestamps is predicted by CTR a signal.
  • Third-order or higher-order feature interactions: We also found that male teenagers like shooting games and RPG games, which means that the (order-3) interaction of app category, user gender, and age is another signal of CTR.
  • According to the application of Google's W&D model, the authors found that considering both low-order and high-order interaction features has more improvement than considering one of them alone

Question 2: Why is artificial feature engineering challenging?

  • Some feature engineering is easier to understand, such as the two mentioned above. At this time, we can easily design or combine such features. However, most of the other feature interactions are hidden in the data and difficult to identify a priori (for example, the classic association rule "diapers and beer" is mined from the data, not discovered by experts), and can only be automatically detected by machine learning. Capture, even for well-understood interactions, it seems impossible for experts to model them exhaustively, especially when the number of features is large.

The third is that the author compared the previous PNN, FNN and W&D with the DeepFM proposed by himself:

insert image description here

 

Here is a brief summary:

FNN model: The pre-training method increases overhead, the model ability is limited by the upper limit of FM representation ability, and only high-order interactive PNN model is considered
: the inner product calculation of IPNN is very complicated, and the approximate outer product calculation of OPNN loses a lot of information , the results are unstable, and the low-order interaction W&D model is also ignored
: Although the low-order and high-order interactions are considered, and the generalization and memory of the model are taken into account, the input of the Wide part requires professional feature engineering experience. The author also cites here An example, such as the intersection of users installing applications and exposure applications in application recommendations, requires some strong business experience.

Therefore, DeepFM considers the above problems at the same time, replaces the LR of W&D with FM, and the Wide part and the Deep part affect the feature representation through the interaction of low-order and high-order features, so as to more accurately model the feature representation. The feature Embedding is shared to solve the above problems. A small diagram to summarize:

insert image description here

The following is still some experience in the industry, and this model is also a commonly used model in the industry:

  1. The number of layers of the neural network at the MLP side, the industrial experience value does not exceed 3 layers, generally two layers are enough.
  2. The number of hidden neurons at the end of the MLP, the industrial experience value, is generally about 128, and the maximum is no more than 500
  3. The dimension of embedding generally should not exceed 50 dimensions, and the experience value is 10-50

5. Model reproduction


import torch
import torch.nn as nn
import torch.nn.functional as F

import warnings
warnings.filterwarnings('ignore')


class FM(nn.Module):
    """FM part"""

    def __init__(self, latent_dim, fea_num):
        """
        latent_dim: 各个离散特征隐向量的维度
        input_shape: 这个最后离散特征embedding之后的拼接和dense拼接的总特征个数
        """
        super(FM, self).__init__()

        self.latent_dim = latent_dim
        # 定义三个矩阵, 一个是全局偏置,一个是一阶权重矩阵, 一个是二阶交叉矩阵,注意这里的参数由于是可学习参数,需要用nn.Parameter进行定义
        self.w0 = nn.Parameter(torch.zeros([1, ]))
        self.w1 = nn.Parameter(torch.rand([fea_num, 1]))
        self.w2 = nn.Parameter(torch.rand([fea_num, latent_dim]))

    def forward(self, inputs):
        # 一阶交叉
        first_order = self.w0 + torch.mm(inputs, self.w1)  # (samples_num, 1)
        # 二阶交叉  这个用FM的最终化简公式
        second_order = 1 / 2 * torch.sum(
            torch.pow(torch.mm(inputs, self.w2), 2) - torch.mm(torch.pow(inputs, 2), torch.pow(self.w2, 2)),
            dim=1,
            keepdim=True
        )  # (samples_num, 1)

        return first_order + second_order


class Dnn(nn.Module):
    """Dnn part"""

    def __init__(self, hidden_units, dropout=0.):
        """
        hidden_units: 列表, 每个元素表示每一层的神经单元个数, 比如[256, 128, 64], 两层网络, 第一层神经单元128, 第二层64, 第一个维度是输入维度
        dropout = 0.
        """
        super(Dnn, self).__init__()

        self.dnn_network = nn.ModuleList(
            [nn.Linear(layer[0], layer[1]) for layer in list(zip(hidden_units[:-1], hidden_units[1:]))])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        for linear in self.dnn_network:
            x = linear(x)
            x = F.relu(x)
        x = self.dropout(x)
        return x


class DeepFM(nn.Module):
    def __init__(self, feature_columns, hidden_units, dnn_dropout=0.):
        """
        DeepFM:
        :param feature_columns: 特征信息, 这个传入的是fea_cols
        :param hidden_units: 隐藏单元个数, 一个列表的形式, 列表的长度代表层数, 每个元素代表每一层神经元个数
        """
        super(DeepFM, self).__init__()
        self.dense_feature_cols, self.sparse_feature_cols = feature_columns

        # embedding
        self.embed_layers = nn.ModuleDict({
            'embed_' + str(i): nn.Embedding(num_embeddings=feat['feat_num'], embedding_dim=feat['embed_dim'])
            for i, feat in enumerate(self.sparse_feature_cols)
        })

        # 这里要注意Pytorch的linear和tf的dense的不同之处, 前者的linear需要输入特征和输出特征维度, 而传入的hidden_units的第一个是第一层隐藏的神经单元个数,这里需要加个输入维度
        self.fea_num = len(self.dense_feature_cols) + len(self.sparse_feature_cols) * self.sparse_feature_cols[0][
            'embed_dim']
        hidden_units.insert(0, self.fea_num)

        self.fm = FM(self.sparse_feature_cols[0]['embed_dim'], self.fea_num)
        self.dnn_network = Dnn(hidden_units, dnn_dropout)
        self.nn_final_linear = nn.Linear(hidden_units[-1], 1)

    def forward(self, x):
        dense_inputs, sparse_inputs = x[:, :len(self.dense_feature_cols)], x[:, len(self.dense_feature_cols):]
        sparse_inputs = sparse_inputs.long()  # 转成long类型才能作为nn.embedding的输入
        sparse_embeds = [self.embed_layers['embed_' + str(i)](sparse_inputs[:, i]) for i in
                         range(sparse_inputs.shape[1])]
        sparse_embeds = torch.cat(sparse_embeds, dim=-1)

        # 把离散特征和连续特征进行拼接作为FM和DNN的输入
        x = torch.cat([sparse_embeds, dense_inputs], dim=-1)
        # Wide
        wide_outputs = self.fm(x)
        # deep
        deep_outputs = self.nn_final_linear(self.dnn_network(x))

        # 模型的最后输出
        outputs = F.sigmoid(torch.add(wide_outputs, deep_outputs))

        return outputs


hidden_units = [128, 64, 32]
dnn_dropout = 0.

model = DeepFM(fea_cols, hidden_units, dnn_dropout)

Reference: Rolling Xiaoqiang

Guess you like

Origin blog.csdn.net/qq_38375203/article/details/125285710