Model ranking model fusion learning

LGB's ranking model
LGB's classification model
Deep learning classification model DIN

Two more classic model integration methods:

Weighted fusion
staking of output results (use a simple model to predict the output of the model)-I
feel that the statistical combined average method used here

import numpy as np
import pandas as pd
import pickle
from tqdm import tqdm
import gc, os
import time
from datetime import datetime
import lightgbm as lgb
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings(‘ignore’)

Read sorting characteristics

data_path = ‘./data_raw/’
save_path = ‘./temp_results/’
offline = False

When re-reading the data, it is found that click_article_id is a floating point number, so it is converted to int type

trn_user_item_feats_df = pd.read_csv(save_path + ‘trn_user_item_feats_df.csv’)
trn_user_item_feats_df[‘click_article_id’] = trn_user_item_feats_df[‘click_article_id’].astype(int)

if offline:
val_user_item_feats_df = pd.read_csv(save_path + ‘val_user_item_feats_df.csv’)
val_user_item_feats_df[‘click_article_id’] = val_user_item_feats_df[‘click_article_id’].astype(int)
else:
val_user_item_feats_df = None

tst_user_item_feats_df = pd.read_csv(save_path + ‘tst_user_item_feats_df.csv’)
tst_user_item_feats_df[‘click_article_id’] = tst_user_item_feats_df[‘click_article_id’].astype(int)

For convenience, the test set is also marked with an invalid label when making features, just delete it here.

del tst_user_item_feats_df[‘label’]

Return the sorted results

def submit(recall_df, topk=5, model_name=None):
recall_df = recall_df.sort_values(by=[‘user_id’, ‘pred_score’])
recall_df[‘rank’] = recall_df.groupby([‘user_id’])[‘pred_score’].rank(ascending=False, method=‘first’)

# 判断是不是每个用户都有5篇文章及以上
tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max())
assert tmp.min() >= topk

del recall_df['pred_score']
submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index()

submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)]
# 按照提交格式定义列名
submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2', 
                                              3: 'article_3', 4: 'article_4', 5: 'article_5'})

save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv'
submit.to_csv(save_name, index=False, header=True)

Normalization of sorting results

def norm_sim(sim_df, weight=0.0):
# print(sim_df.head())
min_sim = sim_df.min()
max_sim = sim_df.max()
if max_sim == min_sim:
sim_df = sim_df.apply(lambda sim: 1.0)
else:
sim_df = sim_df.apply(lambda sim: 1.0 * (sim - min_sim) / (max_sim - min_sim))

sim_df = sim_df.apply(lambda sim: sim + weight)  # plus one
return sim_df

LGB ranking model

Prevent re-reading data after an intermediate error

trn_user_item_feats_df_rank_model = trn_user_item_feats_df.copy()

if offline:
val_user_item_feats_df_rank_model = val_user_item_feats_df.copy()

tst_user_item_feats_df_rank_model = tst_user_item_feats_df.copy()

Define feature columns

lgb_cols = [‘sim0’, ‘time_diff0’, ‘word_diff0’,‘sim_max’, ‘sim_min’, ‘sim_sum’,
‘sim_mean’, ‘score’,‘click_size’, ‘time_diff_mean’, ‘active_level’,
‘click_environment’,‘click_deviceGroup’, ‘click_os’, ‘click_country’,
‘click_region’,‘click_referrer_type’, ‘user_time_hob1’, ‘user_time_hob2’,
‘words_hbo’, ‘category_id’, ‘created_at_ts’,‘words_count’]

Sorting model grouping

trn_user_item_feats_df_rank_model.sort_values(by=[‘user_id’], inplace=True)
g_train = trn_user_item_feats_df_rank_model.groupby([‘user_id’], as_index=False).count()[“label”].values

if offline:
val_user_item_feats_df_rank_model.sort_values(by=[‘user_id’], inplace=True)
g_val = val_user_item_feats_df_rank_model.groupby([‘user_id’], as_index=False).count()[“label”].values

Ordering model definition

lgb_ranker = lgb.LGBMRanker(boosting_type=‘gbdt’, num_leaves=31, reg_alpha=0.0, reg_lambda=1,
max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,
learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16)

Ranking model training

if offline:
lgb_ranker.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model[‘label’], group=g_train,
eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model[‘label’])],
eval_group= [g_val], eval_at=[1, 2, 3, 4, 5], eval_metric=[‘ndcg’, ], early_stopping_rounds=50, )
else:
lgb_ranker.fit(trn_user_item_feats_df[lgb_cols], trn_user_item_feats_df[‘label’], group=g_train)

Model prediction

tst_user_item_feats_df[‘pred_score’] = lgb_ranker.predict(tst_user_item_feats_df[lgb_cols], num_iteration=lgb_ranker.best_iteration_)

Save a copy of the sorting results here, and merge the models behind the user

tst_user_item_feats_df[[‘user_id’, ‘click_article_id’, ‘pred_score’]].to_csv(save_path + ‘lgb_ranker_score.csv’, index=False)

Reorder prediction results and generate submission results

rank_results = tst_user_item_feats_df[[‘user_id’, ‘click_article_id’, ‘pred_score’]]
rank_results[‘click_article_id’] = rank_results[‘click_article_id’].astype(int)
submit(rank_results, topk=5, model_name=‘lgb_ranker’)

Five-fold cross-validation, where the five-fold cross-over is based on the user as the target for five-fold division

This part is separate from the previous separate training and verification

def get_kfold_users(trn_df, n=5):
user_ids = trn_df[‘user_id’].unique()
user_set = [user_ids[i::n] for i in range(n)]
return user_set

k_fold = 5
trn_df = trn_user_item_feats_df_rank_model
user_set = get_kfold_users(trn_df, n=k_fold)

score_list = []
score_df = trn_df[[‘user_id’, ‘click_article_id’,‘label’]]
sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])

Five-fold cross-validation, and save intermediate results for staking

for n_fold, valid_user in enumerate(user_set):
train_idx = trn_df[~trn_df[‘user_id’].isin(valid_user)] # add slide user
valid_idx = trn_df[trn_df[‘user_id’].isin(valid_user)]

# 训练集与验证集的用户分组
train_idx.sort_values(by=['user_id'], inplace=True)
g_train = train_idx.groupby(['user_id'], as_index=False).count()["label"].values

valid_idx.sort_values(by=['user_id'], inplace=True)
g_val = valid_idx.groupby(['user_id'], as_index=False).count()["label"].values

# 定义模型
lgb_ranker = lgb.LGBMRanker(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,
                        max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,
                        learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16)  
# 训练模型
lgb_ranker.fit(train_idx[lgb_cols], train_idx['label'], group=g_train,
               eval_set=[(valid_idx[lgb_cols], valid_idx['label'])], eval_group= [g_val], 
               eval_at=[1, 2, 3, 4, 5], eval_metric=['ndcg', ], early_stopping_rounds=50, )

# 预测验证集结果
valid_idx['pred_score'] = lgb_ranker.predict(valid_idx[lgb_cols], num_iteration=lgb_ranker.best_iteration_)

# 对输出结果进行归一化
valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x))

valid_idx.sort_values(by=['user_id', 'pred_score'])
valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')

# 将验证集的预测结果放到一个列表中,后面进行拼接
score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])

# 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均
if not offline:
    sub_preds += lgb_ranker.predict(tst_user_item_feats_df_rank_model[lgb_cols], lgb_ranker.best_iteration_)

score_df_ = pd.concat(score_list, axis=0)
score_df = score_df.merge(score_df_, how=‘left’, on=[‘user_id’, ‘click_article_id’])

Save the new features generated by cross-validation of the training set

score_df[[‘user_id’, ‘click_article_id’, ‘pred_score’, ‘pred_rank’, ‘label’]].to_csv(save_path + ‘trn_lgb_ranker_feats.csv’, index=False)

The prediction results of the test set are averaged through multiple cross-validation, and the predicted score and corresponding rank features are saved, which can be used for later staking, and more features can be constructed here.

tst_user_item_feats_df_rank_model[‘pred_score’] = sub_preds / k_fold
tst_user_item_feats_df_rank_model[‘pred_score’] = tst_user_item_feats_df_rank_model[‘pred_score’].transform(lambda x: norm_sim(x))
tst_user_item_feats_df_rank_model.sort_values(by=[‘user_id’, ‘pred_score’])
tst_user_item_feats_df_rank_model[‘pred_rank’] = tst_user_item_feats_df_rank_model.groupby([‘user_id’])[‘pred_score’].rank(ascending=False, method=‘first’)

Save the new features of test set cross-validation

tst_user_item_feats_df_rank_model[[‘user_id’, ‘click_article_id’, ‘pred_score’, ‘pred_rank’]].to_csv(save_path + ‘tst_lgb_ranker_feats.csv’, index=False)

Reorder prediction results and generate submission results

Single model generation submission result

rank_results = tst_user_item_feats_df_rank_model[[‘user_id’, ‘click_article_id’, ‘pred_score’]]
rank_results[‘click_article_id’] = rank_results[‘click_article_id’].astype(int)
submit(rank_results, topk=5, model_name=‘lgb_ranker’)
LGB分类模型

Definition of model and parameters

lgb_Classfication = lgb.LGBMClassifier(boosting_type=‘gbdt’, num_leaves=31, reg_alpha=0.0, reg_lambda=1,
max_depth=-1, n_estimators=500, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,
learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10)

Model training

if offline:
lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model[‘label’],
eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model[‘label’])],
eval_metric=[‘auc’, ],early_stopping_rounds=50, )
else:
lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model[‘label’])

Model prediction

tst_user_item_feats_df[‘pred_score’] = lgb_Classfication.predict_proba(tst_user_item_feats_df[lgb_cols])[:,1]

Save a copy of the sorting results here, and merge the models behind the user

tst_user_item_feats_df[[‘user_id’, ‘click_article_id’, ‘pred_score’]].to_csv(save_path + ‘lgb_cls_score.csv’, index=False)

Reorder prediction results and generate submission results

rank_results = tst_user_item_feats_df[[‘user_id’, ‘click_article_id’, ‘pred_score’]]
rank_results[‘click_article_id’] = rank_results[‘click_article_id’].astype(int)
submit(rank_results, topk=5, model_name=‘lgb_cls’)

Five-fold cross-validation, where the five-fold cross-over is based on the user as the target for five-fold division

This part is separate from the previous separate training and verification

def get_kfold_users(trn_df, n=5):
user_ids = trn_df[‘user_id’].unique()
user_set = [user_ids[i::n] for i in range(n)]
return user_set

k_fold = 5
trn_df = trn_user_item_feats_df_rank_model
user_set = get_kfold_users(trn_df, n=k_fold)

score_list = []
score_df = trn_df[[‘user_id’, ‘click_article_id’, ‘label’]]
sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])

Five-fold cross-validation, and save intermediate results for staking

for n_fold, valid_user in enumerate(user_set):
train_idx = trn_df[~trn_df[‘user_id’].isin(valid_user)] # add slide user
valid_idx = trn_df[trn_df[‘user_id’].isin(valid_user)]

# 模型及参数的定义
lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,
                        max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,
                        learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10)  
# 训练模型
lgb_Classfication.fit(train_idx[lgb_cols], train_idx['label'],eval_set=[(valid_idx[lgb_cols], valid_idx['label'])], 
                      eval_metric=['auc', ],early_stopping_rounds=50, )

# 预测验证集结果
valid_idx['pred_score'] = lgb_Classfication.predict_proba(valid_idx[lgb_cols], 
                                                          num_iteration=lgb_Classfication.best_iteration_)[:,1]

# 对输出结果进行归一化 分类模型输出的值本身就是一个概率值不需要进行归一化
# valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x))

valid_idx.sort_values(by=['user_id', 'pred_score'])
valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')

# 将验证集的预测结果放到一个列表中,后面进行拼接
score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])

# 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均
if not offline:
    sub_preds += lgb_Classfication.predict_proba(tst_user_item_feats_df_rank_model[lgb_cols], 
                                                 num_iteration=lgb_Classfication.best_iteration_)[:,1]

score_df_ = pd.concat(score_list, axis=0)
score_df = score_df.merge(score_df_, how=‘left’, on=[‘user_id’, ‘click_article_id’])

Save the new features generated by cross-validation of the training set

score_df[[‘user_id’, ‘click_article_id’, ‘pred_score’, ‘pred_rank’, ‘label’]].to_csv(save_path + ‘trn_lgb_cls_feats.csv’, index=False)

The prediction results of the test set are averaged through multiple cross-validation, and the predicted score and corresponding rank features are saved, which can be used for later staking, and more features can be constructed here.

tst_user_item_feats_df_rank_model[‘pred_score’] = sub_preds / k_fold
tst_user_item_feats_df_rank_model[‘pred_score’] = tst_user_item_feats_df_rank_model[‘pred_score’].transform(lambda x: norm_sim(x))
tst_user_item_feats_df_rank_model.sort_values(by=[‘user_id’, ‘pred_score’])
tst_user_item_feats_df_rank_model[‘pred_rank’] = tst_user_item_feats_df_rank_model.groupby([‘user_id’])[‘pred_score’].rank(ascending=False, method=‘first’)

Save the new features of test set cross-validation

tst_user_item_feats_df_rank_model[[‘user_id’, ‘click_article_id’, ‘pred_score’, ‘pred_rank’]].to_csv(save_path + ‘tst_lgb_cls_feats.csv’, index=False)

Reorder prediction results and generate submission results

rank_results = tst_user_item_feats_df_rank_model[[‘user_id’, ‘click_article_id’, ‘pred_score’]]
rank_results[‘click_article_id’] = rank_results[‘click_article_id’].astype(int)
submit(rank_results, topk=5, model_name=‘lgb_cls’)
DIN模型

User's historical click behavior list

This is for the DIN model behind

if offline:
all_data = pd.read_csv(’./data_raw/train_click_log.csv’)
else:
trn_data = pd.read_csv(’./data_raw/train_click_log.csv’)
tst_data = pd.read_csv(’./data_raw/testA_click_log.csv’)
all_data = trn_data.append(tst_data)
hist_click =all_data[[‘user_id’, ‘click_article_id’]].groupby(‘user_id’).agg({list}).reset_index()
his_behavior_df = pd.DataFrame()
his_behavior_df[‘user_id’] = hist_click[‘user_id’]
his_behavior_df[‘hist_click_article_id’] = hist_click[‘click_article_id’]
trn_user_item_feats_df_din_model = trn_user_item_feats_df.copy()

if offline:
val_user_item_feats_df_din_model = val_user_item_feats_df.copy()
else:
val_user_item_feats_df_din_model = None

tst_user_item_feats_df_din_model = tst_user_item_feats_df.copy()
trn_user_item_feats_df_din_model = trn_user_item_feats_df_din_model.merge(his_behavior_df, on=‘user_id’)

if offline:
val_user_item_feats_df_din_model = val_user_item_feats_df_din_model.merge(his_behavior_df, on=‘user_id’)
else:
val_user_item_feats_df_din_model = None

tst_user_item_feats_df_din_model = tst_user_item_feats_df_din_model.merge(his_behavior_df, on=‘user_id’)

Introduction to DIN Model

Let's try to use the DIN model below. The full name of DIN is Deep Interest Network. This is a model proposed by Ali in 2018 based on the previous deep learning model that cannot express the diverse interests of users. It can be considered by considering [given candidate ads] The correlation with [user’s historical behavior] is used to calculate the representation vector of the user’s interest. Specifically, the local activation unit is introduced to focus on relevant user interests through the relevant parts of the soft search history behavior, and the weighted sum is used to obtain the expression of user interests related to candidate advertisements. Behaviors that are more relevant to candidate advertisements will get higher activation weights and dominate user interests. The representation vector is different in different advertisements, which greatly improves the expressive ability of the model. Therefore, this model is also more suitable for the task of news recommendation. Here we calculate the user's interest in the article based on the correlation between the current candidate article and the user's historical clicked article. The structure of the model is as follows:

image-20201116201646983
image-20201116201646983
1526×503 120 KB
We directly adjust the package here to use this model. The detailed details of this model will be given in the next issue of recommendation system team learning. Let me talk about how to use the model: the function prototype of deepctr is as follows:

def DIN(dnn_feature_columns, history_feature_list, dnn_use_bn=False,
dnn_hidden_units=(200, 80), dnn_activation=‘relu’, att_hidden_size=(80, 40), att_activation=“dice”,
att_weight_normalization=False, l2_reg_dnn=0, l2_reg_embedding=1e-6, dnn_dropout=0, seed=1024,
task=‘binary’):

dnn_feature_columns: feature column, a list containing all features of the data
history_feature_list: user history behavior column, a list of features reflecting user history behavior
dnn_use_bn: whether to use BatchNormalization
dnn_hidden_units: the number of layers of the fully connected layer network and the number of neurons in each layer A list or tuple
dnn_activation_relu: the type of activation unit
of the fully connected network att_hidden_size: the number of layers of the fully connected network of the attention layer and the number of neurons in each layer
att_activation: the type of activation unit of the attention layer
att_weight_normalization: whether it is normalized Attention score
l2_reg_dnn: the regularization coefficient of the fully connected network
l2_reg_embedding: the regularization of the embedding vector sparse
dnn_dropout: the deactivation probability of the neurons of the fully connected network
task: task, which can be classification or regression

For specific use, we must pass in the feature column and historical behavior column, but before passing it in, we need to preprocess the feature column. details as follows:

First, we need to process the data set to get the data. Since we predict whether the user clicks on the current article based on the user's past behavior, we need to divide the data feature column into numerical features, discrete features and historical behavior features. Part, for each part, the DIN model's processing will be different.
For discrete features, in our data set are those categorical features, such as user_id. For such categorical features, we must first go through embedding processing to get each The low-dimensional dense representation of features. Since embedding is required, we need to create a dictionary for the value of the category feature of each column and specify the embedding dimension. Therefore, when preparing data using the DIN model of deepctr, we need to pass The SparseFeat function indicates these categorical features. The incoming parameters of this function are the column name, the unique value of the column (for dictionary creation) and the embedding dimension.
For user historical behavior feature columns, such as article id, article category, etc., we need to go through embedding first, but the difference from the above is that for this feature, we are getting the embedding of each feature After the representation, it is necessary to calculate the correlation between the user’s historical behavior and the current candidate article through an Attention_layer to obtain the embedding vector of the current user. This vector can be based on the similarity between the current candidate article and the historical article that the user has clicked in the past. The degree reflects the user’s interest, and changes with the user’s different historical clicks to dynamically simulate the changing process of the user’s interest. This type of feature is a historical behavior sequence for each user. For each user, the length of the historical behavior sequence will be different. Some users may click on more historical articles, and some click on fewer historical articles, so we need to change This length is unified. When preparing data for the DIN model, we first need to specify these categorical features through the SparseFeat function, and then we need to fill in the sequence through the VarLenSparseFeat function to make the historical sequence of each user the same length, so this function There will be a maxlen in the parameter to indicate the maximum length of the sequence.
For continuous feature columns, we only need to use the DenseFeat function to specify the column name and dimension.
After processing the feature column, we correspond the corresponding data with the column to get the final data.
Let’s get a feel for the specific code. The logic is like this. First, we need to write a data preparation function. Here, we need to prepare the data according to the specific steps above, get the data and feature columns, then build and train the DIN model, and finally based on the model. carry out testing.

Import deepctr

from deepctr.models import DIN
from deepctr.feature_column import SparseFeat, VarLenSparseFeat, DenseFeat, get_feature_names
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras import backend as K
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.callbacks import *
import tensorflow as tf

import
os.environ [“CUDA_DEVICE_ORDER”] = “PCI_BUS_ID”
os.environ [“CUDA_VISIBLE_DEVICES”] = “2”

Data preparation function

DEF get_din_feats_columns (DF, dense_fea, sparse_fea, behavior_fea, his_behavior_fea, emb_dim = 32, max_len = 100):
"" "
data preparation function:
DF: dataset
dense_fea: numeric characterized in columns
sparse_fea: discrete features columns
behavior_fea: user candidate Behavior feature column
his_behavior_fea: user's historical behavior feature column
embedding_dim: embedding dimension, here for simplicity, the discrete feature column uses the same hidden vector dimension
max_len: the maximum length of the user sequence
"""

sparse_feature_columns = [SparseFeat(feat, vocabulary_size=df[feat].nunique() + 1, embedding_dim=emb_dim) for feat in sparse_fea]

dense_feature_columns = [DenseFeat(feat, 1, ) for feat in dense_fea]

var_feature_columns = [VarLenSparseFeat(SparseFeat(feat, vocabulary_size=df['click_article_id'].nunique() + 1,
                                embedding_dim=emb_dim, embedding_name='click_article_id'), maxlen=max_len) for feat in hist_behavior_fea]

dnn_feature_columns = sparse_feature_columns + dense_feature_columns + var_feature_columns

# 建立x, x是一个字典的形式
x = {}
for name in get_feature_names(dnn_feature_columns):
    if name in his_behavior_fea:
        # 这是历史行为序列
        his_list = [l for l in df[name]]
        x[name] = pad_sequences(his_list, maxlen=max_len, padding='post')      # 二维数组
    else:
        x[name] = df[name].values

return x, dnn_feature_columns

Separate features

sparse_fea = [‘user_id’, ‘click_article_id’, ‘category_id’, ‘click_environment’, ‘click_deviceGroup’,
‘click_os’, ‘click_country’, ‘click_region’, ‘click_referrer_type’, ‘is_cat_hab’]

behavior_fea = [‘click_article_id’]

hist_behavior_fea = [‘hist_click_article_id’]

dense_fea = [‘sim0’, ‘time_diff0’, ‘word_diff0’, ‘sim_max’, ‘sim_min’, ‘sim_sum’, ‘sim_mean’, ‘score’,
‘rank’,‘click_size’,‘time_diff_mean’,‘active_level’,‘user_time_hob1’,‘user_time_hob2’,
‘words_hbo’,‘words_count’]

The dense feature is normalized, and the neural network training needs to normalize the value

mm = MinMaxScaler()

The following is to do some special processing. When invalid values ​​appear in other places, normalization cannot be performed without processing. You can comment it out at the beginning and run the following code

If you find an error afterwards, you should first find a way to deal with how to avoid the value of inf

trn_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)

tst_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)

for feat in dense_fea:
trn_user_item_feats_df_din_model[feat] = mm.fit_transform(trn_user_item_feats_df_din_model[[feat]])

if val_user_item_feats_df_din_model is not None:
    val_user_item_feats_df_din_model[feat] = mm.fit_transform(val_user_item_feats_df_din_model[[feat]])

tst_user_item_feats_df_din_model[feat] = mm.fit_transform(tst_user_item_feats_df_din_model[[feat]])

Prepare training data

x_trn, dnn_feature_columns = get_din_feats_columns(trn_user_item_feats_df_din_model, dense_fea,
sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
y_trn = trn_user_item_feats_df_din_model[‘label’].values

if offline:
# 准备验证数据
x_val, dnn_feature_columns = get_din_feats_columns(val_user_item_feats_df_din_model, dense_fea,
sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
y_val = val_user_item_feats_df_din_model[‘label’].values

dense_fea = [x for x in dense_fea if x != ‘label’]
x_tst, dnn_feature_columns = get_din_feats_columns(tst_user_item_feats_df_din_model, dense_fea,
sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:143: calling RandomNormal.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor

Modeling

model = DIN(dnn_feature_columns, behavior_fea)

View model structure

model.summary()

Model compilation

model.compile(‘adam’, ‘binary_crossentropy’,metrics=[‘binary_crossentropy’, tf.keras.metrics.AUC()])
WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1288: calling VarianceScaling.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:255: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: “model”


Layer (type) Output Shape Param # Connected to

user_id (InputLayer) [(None, 1)] 0


click_article_id (InputLayer) [(None, 1)] 0


category_id (InputLayer) [(None, 1)] 0


click_environment (InputLayer) [(None, 1)] 0


click_deviceGroup (InputLayer) [(None, 1)] 0


click_os (InputLayer) [(None, 1)] 0


click_country (InputLayer) [(None, 1)] 0


click_region (InputLayer) [(None, 1)] 0


click_referrer_type (InputLayer [(None, 1)] 0


is_cat_hab (InputLayer) [(None, 1)] 0


sparse_emb_user_id (Embedding) (None, 1, 32) 1600032 user_id[0][0]


sparse_seq_emb_hist_click_artic multiple 525664 click_article_id[0][0]
hist_click_article_id[0][0]
click_article_id[0][0]


sparse_emb_category_id (Embeddi (None, 1, 32) 7776 category_id[0][0]


sparse_emb_click_environment (E (None, 1, 32) 128 click_environment[0][0]


sparse_emb_click_deviceGroup (E (None, 1, 32) 160 click_deviceGroup[0][0]


sparse_emb_click_os (Embedding) (None, 1, 32) 288 click_os [0] [0]


sparse_emb_click_country (Embed (None, 1, 32) 384 click_country[0][0]


sparse_emb_click_region (Embedd (None, 1, 32) 928 click_region[0][0]


sparse_emb_click_referrer_type (None, 1, 32) 256 click_referrer_type[0][0]


sparse_emb_is_cat_hab (Embeddin (None, 1, 32) 64 is_cat_hab[0][0]


no_mask (NoMask) (None, 1, 32) 0 sparse_emb_user_id[0][0]
sparse_seq_emb_hist_click_article
sparse_emb_category_id[0][0]
sparse_emb_click_environment[0][0
sparse_emb_click_deviceGroup[0][0
sparse_emb_click_os[0][0]
sparse_emb_click_country[0][0]
sparse_emb_click_region[0][0]
sparse_emb_click_referrer_type[0]
sparse_emb_is_cat_hab[0][0]


hist_click_article_id (InputLay [(None, 50)] 0


concatenate (Concatenate) (None, 1, 320) 0 no_mask[0][0]
no_mask[1][0]
no_mask[2][0]
no_mask[3][0]
no_mask[4][0]
no_mask[5][0]
no_mask[6][0]
no_mask[7][0]
no_mask[8][0]
no_mask[9][0]


no_mask_1 (NoMask) (None, 1, 320) 0 concatenate[0][0]


attention_sequence_pooling_laye (None, 1, 32) 13961 sparse_seq_emb_hist_click_article
sparse_seq_emb_hist_click_article


concatenate_1 (Concatenate) (None, 1, 352) 0 no_mask_1[0][0]
attention_sequence_pooling_layer[


sim0 (InputLayer) [(None, 1)] 0


time_diff0 (InputLayer) [(None, 1)] 0


word_diff0 (InputLayer) [(None, 1)] 0


sim_max (InputLayer) [(None, 1)] 0


sim_min (InputLayer) [(None, 1)] 0


sim_sum (InputLayer) [(None, 1)] 0


sim_mean (InputLayer) [(None, 1)] 0


score (InputLayer) [(None, 1)] 0


rank (InputLayer) [(None, 1)] 0


click_size (InputLayer) [(None, 1)] 0


time_diff_mean (InputLayer) [(None, 1)] 0


active_level (InputLayer) [(None, 1)] 0


user_time_hob1 (InputLayer) [(None, 1)] 0


user_time_hob2 (InputLayer) [(None, 1)] 0


words_hbo (InputLayer) [(None, 1)] 0


words_count (InputLayer) [(None, 1)] 0


flatten (Flatten) (None, 352) 0 concatenate_1[0][0]


no_mask_3 (NoMask) (None, 1) 0 sim0[0][0]
time_diff0[0][0]
word_diff0[0][0]
sim_max[0][0]
sim_min[0][0]
sim_sum[0][0]
sim_mean[0][0]
score[0][0]
rank[0][0]
click_size[0][0]
time_diff_mean[0][0]
active_level[0][0]
user_time_hob1[0][0]
user_time_hob2[0][0]
words_hbo[0][0]
words_count[0][0]


no_mask_2 (NoMask) (None, 352) 0 flatten[0][0]


concatenate_2 (Concatenate) (None, 16) 0 no_mask_3[0][0]
no_mask_3[1][0]
no_mask_3[2][0]
no_mask_3[3][0]
no_mask_3[4][0]
no_mask_3[5][0]
no_mask_3[6][0]
no_mask_3[7][0]
no_mask_3[8][0]
no_mask_3[9][0]
no_mask_3[10][0]
no_mask_3[11][0]
no_mask_3[12][0]
no_mask_3[13][0]
no_mask_3[14][0]
no_mask_3[15][0]


flatten_1 (Flatten) (None, 352) 0 no_mask_2[0][0]


flatten_2 (Flatten) (None, 16) 0 concatenate_2[0][0]


no_mask_4 (NoMask) multiple 0 flatten_1[0][0]
flatten_2[0][0]


concatenate_3 (Concatenate) (None, 368) 0 no_mask_4[0][0]
no_mask_4[1][0]


dnn_1 (DNN) (None, 80) 89880 concatenate_3[0][0]


dense (Dense) (None, 1) 80 dnn_1[0][0]


prediction_layer (PredictionLay (None, 1) 1 dense[0][0]

Total params: 2,239,602
Trainable params: 2,239,362
Non-trainable params: 240


Model training

if offline:
history = model.fit(x_trn, y_trn, verbose=1, epochs=10, validation_data=(x_val, y_val), batch_size=256)
else:
# You can also use the above statement to use the validation set sampled by yourself
# history = model.fit(x_trn, y_trn, verbose=1, epochs=3, validation_split=0.3, batch_size=256)
history = model.fit(x_trn, y_trn, verbose=1, epochs=2, batch_size=256)
Epoch 1 /2
290964/290964 [] - 55s 189us/sample - loss: 0.4209 - binary_crossentropy: 0.4206 - auc: 0.7842
Epoch 2/2
290964/290964 [
] - 52s 178us/sample - loss: 0.3630 - binary_crossentropy: 0.3618 - auc: 0.8478

Model prediction

tst_user_item_feats_df_din_model[‘pred_score’] = model.predict(x_tst, verbose=1, batch_size=256)
tst_user_item_feats_df_din_model[[‘user_id’, ‘click_article_id’, ‘pred_score’]].to_csv(save_path + ‘din_rank_score.csv’, index=False)
500000/500000 [==============================] - 20s 39us/sample

Reorder prediction results and generate submission results

rank_results = tst_user_item_feats_df_din_model[[‘user_id’, ‘click_article_id’, ‘pred_score’]]
submit(rank_results, topk=5, model_name=‘din’)

Five-fold cross-validation, where the five-fold cross-over is based on the user as the target for five-fold division

This part is separate from the previous separate training and verification

def get_kfold_users(trn_df, n=5):
user_ids = trn_df[‘user_id’].unique()
user_set = [user_ids[i::n] for i in range(n)]
return user_set

k_fold = 5
trn_df = trn_user_item_feats_df_din_model
user_set = get_kfold_users(trn_df, n=k_fold)

score_list = []
score_df = trn_df[[‘user_id’, ‘click_article_id’, ‘label’]]
sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])

dense_fea = [x for x in dense_fea if x != ‘label’]
x_tst, dnn_feature_columns = get_din_feats_columns(tst_user_item_feats_df_din_model, dense_fea,
sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)

Five-fold cross-validation, and save intermediate results for staking

for n_fold, valid_user in enumerate(user_set):
train_idx = trn_df[~trn_df[‘user_id’].isin(valid_user)] # add slide user
valid_idx = trn_df[trn_df[‘user_id’].isin(valid_user)]

# 准备训练数据
x_trn, dnn_feature_columns = get_din_feats_columns(train_idx, dense_fea, 
                                                   sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
y_trn = train_idx['label'].values

# 准备验证数据
x_val, dnn_feature_columns = get_din_feats_columns(valid_idx, dense_fea, 
                                               sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
y_val = valid_idx['label'].values

history = model.fit(x_trn, y_trn, verbose=1, epochs=2, validation_data=(x_val, y_val) , batch_size=256)

# 预测验证集结果
valid_idx['pred_score'] = model.predict(x_val, verbose=1, batch_size=256)   

valid_idx.sort_values(by=['user_id', 'pred_score'])
valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')

# 将验证集的预测结果放到一个列表中,后面进行拼接
score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])

# 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均
if not offline:
    sub_preds += model.predict(x_tst, verbose=1, batch_size=256)[:, 0]   

score_df_ = pd.concat(score_list, axis=0)
score_df = score_df.merge(score_df_, how=‘left’, on=[‘user_id’, ‘click_article_id’])

Save the new features generated by cross-validation of the training set

score_df[[‘user_id’, ‘click_article_id’, ‘pred_score’, ‘pred_rank’, ‘label’]].to_csv(save_path + ‘trn_din_cls_feats.csv’, index=False)

The prediction results of the test set are averaged through multiple cross-validation, and the predicted score and corresponding rank features are saved, which can be used for later staking, and more features can be constructed here.

tst_user_item_feats_df_din_model[‘pred_score’] = sub_preds / k_fold
tst_user_item_feats_df_din_model[‘pred_score’] = tst_user_item_feats_df_din_model[‘pred_score’].transform(lambda x: norm_sim(x))
tst_user_item_feats_df_din_model.sort_values(by=[‘user_id’, ‘pred_score’])
tst_user_item_feats_df_din_model[‘pred_rank’] = tst_user_item_feats_df_din_model.groupby([‘user_id’])[‘pred_score’].rank(ascending=False, method=‘first’)

Save the new features of test set cross-validation

tst_user_item_feats_df_din_model[['user_id','click_article_id','pred_score','pred_rank']].to_csv(save_path +'tst_din_cls_feats.csv', index=False)
model fusion
weighted fusion

Read the sort result files of multiple models

lgb_ranker = pd.read_csv(save_path + ‘lgb_ranker_score.csv’)
lgb_cls = pd.read_csv(save_path + ‘lgb_cls_score.csv’)
din_ranker = pd.read_csv(save_path + ‘din_rank_score.csv’)

Here can also be replaced by the test results of the cross-validation output for weighted fusion

rank_model = {‘lgb_ranker’: lgb_ranker,
‘lgb_cls’: lgb_cls,
‘din_ranker’: din_ranker}
def get_ensumble_predict_topk(rank_model, topk=5):
final_recall = rank_model[‘lgb_cls’].append(rank_model[‘din_ranker’])
rank_model[‘lgb_ranker’][‘pred_score’] = rank_model[‘lgb_ranker’][‘pred_score’].transform(lambda x: norm_sim(x))

final_recall = final_recall.append(rank_model['lgb_ranker'])
final_recall = final_recall.groupby(['user_id', 'click_article_id'])['pred_score'].sum().reset_index()

submit(final_recall, topk=topk, model_name='ensemble_fuse')

get_ensumble_predict_topk(rank_model)
Staking

Read the result file generated by cross-validation of multiple models

Training set

trn_lgb_ranker_feats = pd.read_csv(save_path + ‘trn_lgb_ranker_feats.csv’)
trn_lgb_cls_feats = pd.read_csv(save_path + ‘trn_lgb_cls_feats.csv’)
trn_din_cls_feats = pd.read_csv(save_path + ‘trn_din_cls_feats.csv’)

Test set

tst_lgb_ranker_feats = pd.read_csv(save_path + ‘tst_lgb_ranker_feats.csv’)
tst_lgb_cls_feats = pd.read_csv(save_path + ‘tst_lgb_cls_feats.csv’)
tst_din_cls_feats = pd.read_csv(save_path + ‘tst_din_cls_feats.csv’)

Combine features output from multiple models

finall_trn_ranker_feats = trn_lgb_ranker_feats[[‘user_id’, ‘click_article_id’, ‘label’]]
finall_tst_ranker_feats = tst_lgb_ranker_feats[[‘user_id’, ‘click_article_id’]]

for idx, trn_model in enumerate([trn_lgb_ranker_feats, trn_lgb_cls_feats, trn_din_cls_feats]):
for feat in [ ‘pred_score’, ‘pred_rank’]:
col_name = feat + ‘_’ + str(idx)
finall_trn_ranker_feats[col_name] = trn_model[feat]

for idx, tst_model in enumerate([tst_lgb_ranker_feats, tst_lgb_cls_feats, tst_din_cls_feats]):
for feat in [ ‘pred_score’, ‘pred_rank’]:
col_name = feat + ‘_’ + str(idx)
finall_tst_ranker_feats[col_name] = tst_model[feat]

Define a logistic regression model to refit the features generated by cross-validation to predict the test set

It should be noted here that when doing cross-validation, you can construct more features related to the output predicted value to enrich the features of the simple model here.

from sklearn.linear_model import LogisticRegression

feat_cols = [‘pred_score_0’, ‘pred_rank_0’, ‘pred_score_1’, ‘pred_rank_1’, ‘pred_score_2’, ‘pred_rank_2’]

trn_x = finall_trn_ranker_feats [feat_cols]
trn_y = finall_trn_ranker_feats ['label']

tst_x = finall_tst_ranker_feats[feat_cols]

Define the model

lr = LogisticRegression()

Model training

lr.fit(trn_x, trn_y)

Model prediction

finall_tst_ranker_feats[‘pred_score’] = lr.predict_proba(tst_x)[:, 1]

Reorder prediction results and generate submission results

rank_results = finall_tst_ranker_feats[['user_id','click_article_id','pred_score']]
submit(rank_results, topk=5, model_name='ensumble_staking')
Summary
This chapter mainly learned three ranking models, including LGB Rank and LGB Classifier There is also the DIN model of deep learning. Of course, we have not given a detailed introduction to the principles of these three models. Please explore the principles by yourself in class. You are also welcome to share your explorations with what you have learned. A piece of learning and progress. Finally, we carried out a simple model fusion strategy, including simple weighting and stacking.

About Datawhale: Datawhale is an open source organization focusing on data science and AI. It brings together outstanding learners from many universities and well-known companies in many fields, and brings together a group of team members with open source spirit and exploratory spirit. With the vision of "for the learner, grow with learners", Datawhale encourages true self-expression, openness and tolerance, mutual trust and mutual assistance, the courage to try and make mistakes, and the courage to take responsibility. At the same time, Datawhale uses the concept of open source to explore open source content, open source learning and open source solutions, empower talent training, help talent growth, and establish a connection between people and people, people and knowledge, people and enterprises, and people and the future. In this data mining path learning, the topical knowledge will be shared in Tianchi. For details, please pay attention to Datawhale.

Guess you like

Origin blog.csdn.net/m0_49978528/article/details/110732169