CTR model understanding

Article Directory

AFM

(Reading the paper) Ctr estimation-AFM model analysis of recommendation system

self.P.shape
Out[18]: torch.Size([4])

If there are 39 domains, then there are m ⋅ (m − 1) 2 \frac{m\cdot (m-1)}{2}2m(m1)=741 feature interactions, WW of the Attention layerW is a 4x4 matrix,hhh is a 4x1 output layer, aij=[741], normalized by softmax

DCN

(Reading the paper) CTR prediction-DCN model analysis of recommendation system

Someone in the comment area knows the author

Demystifying Deep & Cross: How to automatically construct high-order cross features

DCN implemented with tensorflow

You are only one step away from playing the enterprise-level Deep&Cross Network model

xl_w.shape
Out[9]: torch.Size([32, 1, 1])
dot_.shape
Out[10]: torch.Size([32, 117, 1])
x_l.shape
Out[11]: torch.Size([32, 117, 1])
xl_w.shape
Out[9]: torch.Size([32, 1, 1])
dot_.shape
Out[10]: torch.Size([32, 117, 1])
x_l.shape
Out[11]: torch.Size([32, 117, 1])

(Reading the paper) CTR prediction-DCN model analysis of recommendation system

Suppose there are 117 features, self and self inner product, the feature is left with 1, multiplied by weight, the number of columns of weight is consistent with the 117 feature.

NFM

FM can also be regarded as a neural network architecture, that is, the NFM with the hidden layer removed.

The most important difference of NFM is the Bi-Interaction Layer. Both Wide&Deep and DeepCross replace Bi-Interaction with concatenation.

The biggest disadvantage of the Concatenation operation is that it does not consider any feature combination information, so it all relies on the subsequent MLP to learn the feature combination, but unfortunately, the learning optimization of MLP is very difficult.

The use of Bi-Interaction takes into account the second-order feature combination, so that the input representation contains more information and reduces the learning pressure of the subsequent MLP part, so you can use a simpler model (only a hidden layer is used in the experiment) to obtain Better results.

Almost being biased by a certain code, let's look at the official implementation

https://github.com/guoyang9/NFM-pyorch

FROM

Insert picture description here

There are 3 samples, so the sequence feature is at most 3 rows. The longest sequence is 4 elements, and the others are filled with 0.

deepctr_torch/models/basemodel.py:185The code passed , the number of axes of all tensors is normalized to 2

Insert picture description here

6 non-sequence features, 2 sequence features with sequence length 4, 6+4x2=14

Insert picture description here

Enterdeepctr_torch.models.din.DIN#forward

input_from_feature_columnsThis function is used to extract sparse and dense features.

_, dense_value_list = self.input_from_feature_columns(...

dense_value_list
Out[10]: 
[tensor([[0.3000],
         [0.1000],
         [0.2000]], device='cuda:0')]

query_emb_listYes sparse_feature_columnsembedding

self.history_feature_list # 用户指定的
Out[16]: ['item', 'item_gender']
self.history_fc_names     # 系统指定的, 前缀是 hist_
Out[17]: ['hist_item', 'hist_item_gender']

query_emb_listIt is ['item', 'item_gender']doing Embed:

query_emb_list[0].shape
Out[21]: torch.Size([3, 1, 8])

query_emb_listIt is ['hist_item', 'hist_item_gender']doing Embed:

keys_emb_list[0].shape
Out[20]: torch.Size([3, 4, 8])

Looking back at this small piece of code of the DIN constructor, it is easy to determine whether it is a variable-length sparse variable or a historical variable:
Insert picture description here

Splicing in the dimension of Embedding

# concatenate
query_emb = torch.cat(query_emb_list, dim=-1)                     # [B, 1, E]
keys_emb = torch.cat(keys_emb_list, dim=-1)     
keys_emb.shape
Out[24]: torch.Size([3, 4, 16]) # 16 = 8 * 2
keys_length
Out[29]: tensor([2, 3, 3], device='cuda:0')

seedeepctr_torch.layers.sequence.AttentionSequencePoolingLayer#forward

Insert picture description here

seedeepctr_torch.layers.core.LocalActivationUnit#forward

queries = query.expand(-1, user_behavior_len, -1)

query.shape
Out[38]: torch.Size([3, 1, 16])
queries.shape
Out[37]: torch.Size([3, 4, 16])

Copy the query on the sequence length axis to match the keys

On the Embedding axis, [query, key, element minus, element product]

attention_input.shape
Out[40]: torch.Size([3, 4, 64])
attention_output.shape
Out[39]: torch.Size([3, 4, 16])

This is deepctr_torch.layers.core.DNNa bit interesting

self.dnn
Out[41]: 
DNN(
  (dropout): Dropout(p=0)
  (linears): ModuleList(
    (0): Linear(in_features=64, out_features=64, bias=True)
    (1): Linear(in_features=64, out_features=16, bias=True)
  )
  (activation_layers): ModuleList(
    (0): Dice(
      (bn): BatchNorm1d(64, eps=1e-08, momentum=0.1, affine=True, track_running_stats=True)
      (sigmoid): Sigmoid()
    )
    (1): Dice(
      (bn): BatchNorm1d(16, eps=1e-08, momentum=0.1, affine=True, track_running_stats=True)
      (sigmoid): Sigmoid()
    )
  )
)

Note that here is to put together multiple feature domains to get a unique Attention.

I think it can be understood that a user has visited many items, and items have many attributes, such as [item_id, cat_id, brand_id]. We need to consider these attributes comprehensively, but when reflecting the weight of an item, there can only be one weight (rather than one weight for each attribute).

(There is room for improvement)

attention_score.shape
Out[42]: torch.Size([3, 4, 1])

Back deepctr_torch.models.din.DIN#forward, after using Attention, it is equivalent to a weighted average of the historical sequence.

Guess you like

Origin blog.csdn.net/TQCAI666/article/details/113779670