AFM
(Reading the paper) Ctr estimation-AFM model analysis of recommendation system
self.P.shape
Out[18]: torch.Size([4])
If there are 39 domains, then there are m ⋅ (m − 1) 2 \frac{m\cdot (m-1)}{2}2m⋅(m−1)=741 feature interactions, WW of the Attention layerW is a 4x4 matrix,hhh is a 4x1 output layer, aij=[741], normalized by softmax
DCN
(Reading the paper) CTR prediction-DCN model analysis of recommendation system
Someone in the comment area knows the author
Demystifying Deep & Cross: How to automatically construct high-order cross features
DCN implemented with tensorflow
You are only one step away from playing the enterprise-level Deep&Cross Network model
xl_w.shape
Out[9]: torch.Size([32, 1, 1])
dot_.shape
Out[10]: torch.Size([32, 117, 1])
x_l.shape
Out[11]: torch.Size([32, 117, 1])
xl_w.shape
Out[9]: torch.Size([32, 1, 1])
dot_.shape
Out[10]: torch.Size([32, 117, 1])
x_l.shape
Out[11]: torch.Size([32, 117, 1])
(Reading the paper) CTR prediction-DCN model analysis of recommendation system
Suppose there are 117 features, self and self inner product, the feature is left with 1, multiplied by weight, the number of columns of weight is consistent with the 117 feature.
NFM
FM can also be regarded as a neural network architecture, that is, the NFM with the hidden layer removed.
The most important difference of NFM is the Bi-Interaction Layer. Both Wide&Deep and DeepCross replace Bi-Interaction with concatenation.
The biggest disadvantage of the Concatenation operation is that it does not consider any feature combination information, so it all relies on the subsequent MLP to learn the feature combination, but unfortunately, the learning optimization of MLP is very difficult.
The use of Bi-Interaction takes into account the second-order feature combination, so that the input representation contains more information and reduces the learning pressure of the subsequent MLP part, so you can use a simpler model (only a hidden layer is used in the experiment) to obtain Better results.
Almost being biased by a certain code, let's look at the official implementation
https://github.com/guoyang9/NFM-pyorch
FROM
There are 3 samples, so the sequence feature is at most 3 rows. The longest sequence is 4 elements, and the others are filled with 0.
deepctr_torch/models/basemodel.py:185
The code passed , the number of axes of all tensors is normalized to 2
6 non-sequence features, 2 sequence features with sequence length 4, 6+4x2=14
Enterdeepctr_torch.models.din.DIN#forward
input_from_feature_columns
This function is used to extract sparse and dense features.
_, dense_value_list = self.input_from_feature_columns(...
dense_value_list
Out[10]:
[tensor([[0.3000],
[0.1000],
[0.2000]], device='cuda:0')]
query_emb_list
Yes sparse_feature_columns
embedding
self.history_feature_list # 用户指定的
Out[16]: ['item', 'item_gender']
self.history_fc_names # 系统指定的, 前缀是 hist_
Out[17]: ['hist_item', 'hist_item_gender']
query_emb_list
It is ['item', 'item_gender']
doing Embed:
query_emb_list[0].shape
Out[21]: torch.Size([3, 1, 8])
query_emb_list
It is ['hist_item', 'hist_item_gender']
doing Embed:
keys_emb_list[0].shape
Out[20]: torch.Size([3, 4, 8])
Looking back at this small piece of code of the DIN constructor, it is easy to determine whether it is a variable-length sparse variable or a historical variable:
Splicing in the dimension of Embedding
# concatenate
query_emb = torch.cat(query_emb_list, dim=-1) # [B, 1, E]
keys_emb = torch.cat(keys_emb_list, dim=-1)
keys_emb.shape
Out[24]: torch.Size([3, 4, 16]) # 16 = 8 * 2
keys_length
Out[29]: tensor([2, 3, 3], device='cuda:0')
seedeepctr_torch.layers.sequence.AttentionSequencePoolingLayer#forward
seedeepctr_torch.layers.core.LocalActivationUnit#forward
queries = query.expand(-1, user_behavior_len, -1)
query.shape
Out[38]: torch.Size([3, 1, 16])
queries.shape
Out[37]: torch.Size([3, 4, 16])
Copy the query on the sequence length axis to match the keys
On the Embedding axis, [query, key, element minus, element product]
attention_input.shape
Out[40]: torch.Size([3, 4, 64])
attention_output.shape
Out[39]: torch.Size([3, 4, 16])
This is deepctr_torch.layers.core.DNN
a bit interesting
self.dnn
Out[41]:
DNN(
(dropout): Dropout(p=0)
(linears): ModuleList(
(0): Linear(in_features=64, out_features=64, bias=True)
(1): Linear(in_features=64, out_features=16, bias=True)
)
(activation_layers): ModuleList(
(0): Dice(
(bn): BatchNorm1d(64, eps=1e-08, momentum=0.1, affine=True, track_running_stats=True)
(sigmoid): Sigmoid()
)
(1): Dice(
(bn): BatchNorm1d(16, eps=1e-08, momentum=0.1, affine=True, track_running_stats=True)
(sigmoid): Sigmoid()
)
)
)
Note that here is to put together multiple feature domains to get a unique Attention.
I think it can be understood that a user has visited many items, and items have many attributes, such as [item_id, cat_id, brand_id]. We need to consider these attributes comprehensively, but when reflecting the weight of an item, there can only be one weight (rather than one weight for each attribute).
(There is room for improvement)
attention_score.shape
Out[42]: torch.Size([3, 4, 1])
Back deepctr_torch.models.din.DIN#forward
, after using Attention, it is equivalent to a weighted average of the historical sequence.