LGB's ranking model
LGB's classification model
Deep learning classification model DIN
Two more classic model integration methods:
Weighted fusion
staking of output results (use a simple model to predict the output of the model)-I
feel that the statistical combined average method used here
import numpy as np
import pandas as pd
import pickle
from tqdm import tqdm
import gc, os
import time
from datetime import datetime
import lightgbm as lgb
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings(‘ignore’)
Read sorting characteristics
data_path = ‘./data_raw/’
save_path = ‘./temp_results/’
offline = False
When re-reading the data, it is found that click_article_id is a floating point number, so it is converted to int type
trn_user_item_feats_df = pd.read_csv(save_path + ‘trn_user_item_feats_df.csv’)
trn_user_item_feats_df[‘click_article_id’] = trn_user_item_feats_df[‘click_article_id’].astype(int)
if offline:
val_user_item_feats_df = pd.read_csv(save_path + ‘val_user_item_feats_df.csv’)
val_user_item_feats_df[‘click_article_id’] = val_user_item_feats_df[‘click_article_id’].astype(int)
else:
val_user_item_feats_df = None
tst_user_item_feats_df = pd.read_csv(save_path + ‘tst_user_item_feats_df.csv’)
tst_user_item_feats_df[‘click_article_id’] = tst_user_item_feats_df[‘click_article_id’].astype(int)
For convenience, the test set is also marked with an invalid label when making features, just delete it here.
del tst_user_item_feats_df[‘label’]
Return the sorted results
def submit(recall_df, topk=5, model_name=None):
recall_df = recall_df.sort_values(by=[‘user_id’, ‘pred_score’])
recall_df[‘rank’] = recall_df.groupby([‘user_id’])[‘pred_score’].rank(ascending=False, method=‘first’)
# 判断是不是每个用户都有5篇文章及以上
tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max())
assert tmp.min() >= topk
del recall_df['pred_score']
submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index()
submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)]
# 按照提交格式定义列名
submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2',
3: 'article_3', 4: 'article_4', 5: 'article_5'})
save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv'
submit.to_csv(save_name, index=False, header=True)
Normalization of sorting results
def norm_sim(sim_df, weight=0.0):
# print(sim_df.head())
min_sim = sim_df.min()
max_sim = sim_df.max()
if max_sim == min_sim:
sim_df = sim_df.apply(lambda sim: 1.0)
else:
sim_df = sim_df.apply(lambda sim: 1.0 * (sim - min_sim) / (max_sim - min_sim))
sim_df = sim_df.apply(lambda sim: sim + weight) # plus one
return sim_df
LGB ranking model
Prevent re-reading data after an intermediate error
trn_user_item_feats_df_rank_model = trn_user_item_feats_df.copy()
if offline:
val_user_item_feats_df_rank_model = val_user_item_feats_df.copy()
tst_user_item_feats_df_rank_model = tst_user_item_feats_df.copy()
Define feature columns
lgb_cols = [‘sim0’, ‘time_diff0’, ‘word_diff0’,‘sim_max’, ‘sim_min’, ‘sim_sum’,
‘sim_mean’, ‘score’,‘click_size’, ‘time_diff_mean’, ‘active_level’,
‘click_environment’,‘click_deviceGroup’, ‘click_os’, ‘click_country’,
‘click_region’,‘click_referrer_type’, ‘user_time_hob1’, ‘user_time_hob2’,
‘words_hbo’, ‘category_id’, ‘created_at_ts’,‘words_count’]
Sorting model grouping
trn_user_item_feats_df_rank_model.sort_values(by=[‘user_id’], inplace=True)
g_train = trn_user_item_feats_df_rank_model.groupby([‘user_id’], as_index=False).count()[“label”].values
if offline:
val_user_item_feats_df_rank_model.sort_values(by=[‘user_id’], inplace=True)
g_val = val_user_item_feats_df_rank_model.groupby([‘user_id’], as_index=False).count()[“label”].values
Ordering model definition
lgb_ranker = lgb.LGBMRanker(boosting_type=‘gbdt’, num_leaves=31, reg_alpha=0.0, reg_lambda=1,
max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,
learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16)
Ranking model training
if offline:
lgb_ranker.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model[‘label’], group=g_train,
eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model[‘label’])],
eval_group= [g_val], eval_at=[1, 2, 3, 4, 5], eval_metric=[‘ndcg’, ], early_stopping_rounds=50, )
else:
lgb_ranker.fit(trn_user_item_feats_df[lgb_cols], trn_user_item_feats_df[‘label’], group=g_train)
Model prediction
tst_user_item_feats_df[‘pred_score’] = lgb_ranker.predict(tst_user_item_feats_df[lgb_cols], num_iteration=lgb_ranker.best_iteration_)
Save a copy of the sorting results here, and merge the models behind the user
tst_user_item_feats_df[[‘user_id’, ‘click_article_id’, ‘pred_score’]].to_csv(save_path + ‘lgb_ranker_score.csv’, index=False)
Reorder prediction results and generate submission results
rank_results = tst_user_item_feats_df[[‘user_id’, ‘click_article_id’, ‘pred_score’]]
rank_results[‘click_article_id’] = rank_results[‘click_article_id’].astype(int)
submit(rank_results, topk=5, model_name=‘lgb_ranker’)
Five-fold cross-validation, where the five-fold cross-over is based on the user as the target for five-fold division
This part is separate from the previous separate training and verification
def get_kfold_users(trn_df, n=5):
user_ids = trn_df[‘user_id’].unique()
user_set = [user_ids[i::n] for i in range(n)]
return user_set
k_fold = 5
trn_df = trn_user_item_feats_df_rank_model
user_set = get_kfold_users(trn_df, n=k_fold)
score_list = []
score_df = trn_df[[‘user_id’, ‘click_article_id’,‘label’]]
sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])
Five-fold cross-validation, and save intermediate results for staking
for n_fold, valid_user in enumerate(user_set):
train_idx = trn_df[~trn_df[‘user_id’].isin(valid_user)] # add slide user
valid_idx = trn_df[trn_df[‘user_id’].isin(valid_user)]
# 训练集与验证集的用户分组
train_idx.sort_values(by=['user_id'], inplace=True)
g_train = train_idx.groupby(['user_id'], as_index=False).count()["label"].values
valid_idx.sort_values(by=['user_id'], inplace=True)
g_val = valid_idx.groupby(['user_id'], as_index=False).count()["label"].values
# 定义模型
lgb_ranker = lgb.LGBMRanker(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,
max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,
learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16)
# 训练模型
lgb_ranker.fit(train_idx[lgb_cols], train_idx['label'], group=g_train,
eval_set=[(valid_idx[lgb_cols], valid_idx['label'])], eval_group= [g_val],
eval_at=[1, 2, 3, 4, 5], eval_metric=['ndcg', ], early_stopping_rounds=50, )
# 预测验证集结果
valid_idx['pred_score'] = lgb_ranker.predict(valid_idx[lgb_cols], num_iteration=lgb_ranker.best_iteration_)
# 对输出结果进行归一化
valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x))
valid_idx.sort_values(by=['user_id', 'pred_score'])
valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')
# 将验证集的预测结果放到一个列表中,后面进行拼接
score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])
# 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均
if not offline:
sub_preds += lgb_ranker.predict(tst_user_item_feats_df_rank_model[lgb_cols], lgb_ranker.best_iteration_)
score_df_ = pd.concat(score_list, axis=0)
score_df = score_df.merge(score_df_, how=‘left’, on=[‘user_id’, ‘click_article_id’])
Save the new features generated by cross-validation of the training set
score_df[[‘user_id’, ‘click_article_id’, ‘pred_score’, ‘pred_rank’, ‘label’]].to_csv(save_path + ‘trn_lgb_ranker_feats.csv’, index=False)
The prediction results of the test set are averaged through multiple cross-validation, and the predicted score and corresponding rank features are saved, which can be used for later staking, and more features can be constructed here.
tst_user_item_feats_df_rank_model[‘pred_score’] = sub_preds / k_fold
tst_user_item_feats_df_rank_model[‘pred_score’] = tst_user_item_feats_df_rank_model[‘pred_score’].transform(lambda x: norm_sim(x))
tst_user_item_feats_df_rank_model.sort_values(by=[‘user_id’, ‘pred_score’])
tst_user_item_feats_df_rank_model[‘pred_rank’] = tst_user_item_feats_df_rank_model.groupby([‘user_id’])[‘pred_score’].rank(ascending=False, method=‘first’)
Save the new features of test set cross-validation
tst_user_item_feats_df_rank_model[[‘user_id’, ‘click_article_id’, ‘pred_score’, ‘pred_rank’]].to_csv(save_path + ‘tst_lgb_ranker_feats.csv’, index=False)
Reorder prediction results and generate submission results
Single model generation submission result
rank_results = tst_user_item_feats_df_rank_model[[‘user_id’, ‘click_article_id’, ‘pred_score’]]
rank_results[‘click_article_id’] = rank_results[‘click_article_id’].astype(int)
submit(rank_results, topk=5, model_name=‘lgb_ranker’)
LGB分类模型
Definition of model and parameters
lgb_Classfication = lgb.LGBMClassifier(boosting_type=‘gbdt’, num_leaves=31, reg_alpha=0.0, reg_lambda=1,
max_depth=-1, n_estimators=500, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,
learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10)
Model training
if offline:
lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model[‘label’],
eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model[‘label’])],
eval_metric=[‘auc’, ],early_stopping_rounds=50, )
else:
lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model[‘label’])
Model prediction
tst_user_item_feats_df[‘pred_score’] = lgb_Classfication.predict_proba(tst_user_item_feats_df[lgb_cols])[:,1]
Save a copy of the sorting results here, and merge the models behind the user
tst_user_item_feats_df[[‘user_id’, ‘click_article_id’, ‘pred_score’]].to_csv(save_path + ‘lgb_cls_score.csv’, index=False)
Reorder prediction results and generate submission results
rank_results = tst_user_item_feats_df[[‘user_id’, ‘click_article_id’, ‘pred_score’]]
rank_results[‘click_article_id’] = rank_results[‘click_article_id’].astype(int)
submit(rank_results, topk=5, model_name=‘lgb_cls’)
Five-fold cross-validation, where the five-fold cross-over is based on the user as the target for five-fold division
This part is separate from the previous separate training and verification
def get_kfold_users(trn_df, n=5):
user_ids = trn_df[‘user_id’].unique()
user_set = [user_ids[i::n] for i in range(n)]
return user_set
k_fold = 5
trn_df = trn_user_item_feats_df_rank_model
user_set = get_kfold_users(trn_df, n=k_fold)
score_list = []
score_df = trn_df[[‘user_id’, ‘click_article_id’, ‘label’]]
sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])
Five-fold cross-validation, and save intermediate results for staking
for n_fold, valid_user in enumerate(user_set):
train_idx = trn_df[~trn_df[‘user_id’].isin(valid_user)] # add slide user
valid_idx = trn_df[trn_df[‘user_id’].isin(valid_user)]
# 模型及参数的定义
lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,
max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,
learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10)
# 训练模型
lgb_Classfication.fit(train_idx[lgb_cols], train_idx['label'],eval_set=[(valid_idx[lgb_cols], valid_idx['label'])],
eval_metric=['auc', ],early_stopping_rounds=50, )
# 预测验证集结果
valid_idx['pred_score'] = lgb_Classfication.predict_proba(valid_idx[lgb_cols],
num_iteration=lgb_Classfication.best_iteration_)[:,1]
# 对输出结果进行归一化 分类模型输出的值本身就是一个概率值不需要进行归一化
# valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x))
valid_idx.sort_values(by=['user_id', 'pred_score'])
valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')
# 将验证集的预测结果放到一个列表中,后面进行拼接
score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])
# 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均
if not offline:
sub_preds += lgb_Classfication.predict_proba(tst_user_item_feats_df_rank_model[lgb_cols],
num_iteration=lgb_Classfication.best_iteration_)[:,1]
score_df_ = pd.concat(score_list, axis=0)
score_df = score_df.merge(score_df_, how=‘left’, on=[‘user_id’, ‘click_article_id’])
Save the new features generated by cross-validation of the training set
score_df[[‘user_id’, ‘click_article_id’, ‘pred_score’, ‘pred_rank’, ‘label’]].to_csv(save_path + ‘trn_lgb_cls_feats.csv’, index=False)
The prediction results of the test set are averaged through multiple cross-validation, and the predicted score and corresponding rank features are saved, which can be used for later staking, and more features can be constructed here.
tst_user_item_feats_df_rank_model[‘pred_score’] = sub_preds / k_fold
tst_user_item_feats_df_rank_model[‘pred_score’] = tst_user_item_feats_df_rank_model[‘pred_score’].transform(lambda x: norm_sim(x))
tst_user_item_feats_df_rank_model.sort_values(by=[‘user_id’, ‘pred_score’])
tst_user_item_feats_df_rank_model[‘pred_rank’] = tst_user_item_feats_df_rank_model.groupby([‘user_id’])[‘pred_score’].rank(ascending=False, method=‘first’)
Save the new features of test set cross-validation
tst_user_item_feats_df_rank_model[[‘user_id’, ‘click_article_id’, ‘pred_score’, ‘pred_rank’]].to_csv(save_path + ‘tst_lgb_cls_feats.csv’, index=False)
Reorder prediction results and generate submission results
rank_results = tst_user_item_feats_df_rank_model[[‘user_id’, ‘click_article_id’, ‘pred_score’]]
rank_results[‘click_article_id’] = rank_results[‘click_article_id’].astype(int)
submit(rank_results, topk=5, model_name=‘lgb_cls’)
DIN模型
User's historical click behavior list
This is for the DIN model behind
if offline:
all_data = pd.read_csv(’./data_raw/train_click_log.csv’)
else:
trn_data = pd.read_csv(’./data_raw/train_click_log.csv’)
tst_data = pd.read_csv(’./data_raw/testA_click_log.csv’)
all_data = trn_data.append(tst_data)
hist_click =all_data[[‘user_id’, ‘click_article_id’]].groupby(‘user_id’).agg({list}).reset_index()
his_behavior_df = pd.DataFrame()
his_behavior_df[‘user_id’] = hist_click[‘user_id’]
his_behavior_df[‘hist_click_article_id’] = hist_click[‘click_article_id’]
trn_user_item_feats_df_din_model = trn_user_item_feats_df.copy()
if offline:
val_user_item_feats_df_din_model = val_user_item_feats_df.copy()
else:
val_user_item_feats_df_din_model = None
tst_user_item_feats_df_din_model = tst_user_item_feats_df.copy()
trn_user_item_feats_df_din_model = trn_user_item_feats_df_din_model.merge(his_behavior_df, on=‘user_id’)
if offline:
val_user_item_feats_df_din_model = val_user_item_feats_df_din_model.merge(his_behavior_df, on=‘user_id’)
else:
val_user_item_feats_df_din_model = None
tst_user_item_feats_df_din_model = tst_user_item_feats_df_din_model.merge(his_behavior_df, on=‘user_id’)
Introduction to DIN Model
Let's try to use the DIN model below. The full name of DIN is Deep Interest Network. This is a model proposed by Ali in 2018 based on the previous deep learning model that cannot express the diverse interests of users. It can be considered by considering [given candidate ads] The correlation with [user’s historical behavior] is used to calculate the representation vector of the user’s interest. Specifically, the local activation unit is introduced to focus on relevant user interests through the relevant parts of the soft search history behavior, and the weighted sum is used to obtain the expression of user interests related to candidate advertisements. Behaviors that are more relevant to candidate advertisements will get higher activation weights and dominate user interests. The representation vector is different in different advertisements, which greatly improves the expressive ability of the model. Therefore, this model is also more suitable for the task of news recommendation. Here we calculate the user's interest in the article based on the correlation between the current candidate article and the user's historical clicked article. The structure of the model is as follows:
image-20201116201646983
image-20201116201646983
1526×503 120 KB
We directly adjust the package here to use this model. The detailed details of this model will be given in the next issue of recommendation system team learning. Let me talk about how to use the model: the function prototype of deepctr is as follows:
def DIN(dnn_feature_columns, history_feature_list, dnn_use_bn=False,
dnn_hidden_units=(200, 80), dnn_activation=‘relu’, att_hidden_size=(80, 40), att_activation=“dice”,
att_weight_normalization=False, l2_reg_dnn=0, l2_reg_embedding=1e-6, dnn_dropout=0, seed=1024,
task=‘binary’):
dnn_feature_columns: feature column, a list containing all features of the data
history_feature_list: user history behavior column, a list of features reflecting user history behavior
dnn_use_bn: whether to use BatchNormalization
dnn_hidden_units: the number of layers of the fully connected layer network and the number of neurons in each layer A list or tuple
dnn_activation_relu: the type of activation unit
of the fully connected network att_hidden_size: the number of layers of the fully connected network of the attention layer and the number of neurons in each layer
att_activation: the type of activation unit of the attention layer
att_weight_normalization: whether it is normalized Attention score
l2_reg_dnn: the regularization coefficient of the fully connected network
l2_reg_embedding: the regularization of the embedding vector sparse
dnn_dropout: the deactivation probability of the neurons of the fully connected network
task: task, which can be classification or regression
For specific use, we must pass in the feature column and historical behavior column, but before passing it in, we need to preprocess the feature column. details as follows:
First, we need to process the data set to get the data. Since we predict whether the user clicks on the current article based on the user's past behavior, we need to divide the data feature column into numerical features, discrete features and historical behavior features. Part, for each part, the DIN model's processing will be different.
For discrete features, in our data set are those categorical features, such as user_id. For such categorical features, we must first go through embedding processing to get each The low-dimensional dense representation of features. Since embedding is required, we need to create a dictionary for the value of the category feature of each column and specify the embedding dimension. Therefore, when preparing data using the DIN model of deepctr, we need to pass The SparseFeat function indicates these categorical features. The incoming parameters of this function are the column name, the unique value of the column (for dictionary creation) and the embedding dimension.
For user historical behavior feature columns, such as article id, article category, etc., we need to go through embedding first, but the difference from the above is that for this feature, we are getting the embedding of each feature After the representation, it is necessary to calculate the correlation between the user’s historical behavior and the current candidate article through an Attention_layer to obtain the embedding vector of the current user. This vector can be based on the similarity between the current candidate article and the historical article that the user has clicked in the past. The degree reflects the user’s interest, and changes with the user’s different historical clicks to dynamically simulate the changing process of the user’s interest. This type of feature is a historical behavior sequence for each user. For each user, the length of the historical behavior sequence will be different. Some users may click on more historical articles, and some click on fewer historical articles, so we need to change This length is unified. When preparing data for the DIN model, we first need to specify these categorical features through the SparseFeat function, and then we need to fill in the sequence through the VarLenSparseFeat function to make the historical sequence of each user the same length, so this function There will be a maxlen in the parameter to indicate the maximum length of the sequence.
For continuous feature columns, we only need to use the DenseFeat function to specify the column name and dimension.
After processing the feature column, we correspond the corresponding data with the column to get the final data.
Let’s get a feel for the specific code. The logic is like this. First, we need to write a data preparation function. Here, we need to prepare the data according to the specific steps above, get the data and feature columns, then build and train the DIN model, and finally based on the model. carry out testing.
Import deepctr
from deepctr.models import DIN
from deepctr.feature_column import SparseFeat, VarLenSparseFeat, DenseFeat, get_feature_names
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import backend as K
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.callbacks import *
import tensorflow as tf
import
os.environ [“CUDA_DEVICE_ORDER”] = “PCI_BUS_ID”
os.environ [“CUDA_VISIBLE_DEVICES”] = “2”
Data preparation function
DEF get_din_feats_columns (DF, dense_fea, sparse_fea, behavior_fea, his_behavior_fea, emb_dim = 32, max_len = 100):
"" "
data preparation function:
DF: dataset
dense_fea: numeric characterized in columns
sparse_fea: discrete features columns
behavior_fea: user candidate Behavior feature column
his_behavior_fea: user's historical behavior feature column
embedding_dim: embedding dimension, here for simplicity, the discrete feature column uses the same hidden vector dimension
max_len: the maximum length of the user sequence
"""
sparse_feature_columns = [SparseFeat(feat, vocabulary_size=df[feat].nunique() + 1, embedding_dim=emb_dim) for feat in sparse_fea]
dense_feature_columns = [DenseFeat(feat, 1, ) for feat in dense_fea]
var_feature_columns = [VarLenSparseFeat(SparseFeat(feat, vocabulary_size=df['click_article_id'].nunique() + 1,
embedding_dim=emb_dim, embedding_name='click_article_id'), maxlen=max_len) for feat in hist_behavior_fea]
dnn_feature_columns = sparse_feature_columns + dense_feature_columns + var_feature_columns
# 建立x, x是一个字典的形式
x = {}
for name in get_feature_names(dnn_feature_columns):
if name in his_behavior_fea:
# 这是历史行为序列
his_list = [l for l in df[name]]
x[name] = pad_sequences(his_list, maxlen=max_len, padding='post') # 二维数组
else:
x[name] = df[name].values
return x, dnn_feature_columns
Separate features
sparse_fea = [‘user_id’, ‘click_article_id’, ‘category_id’, ‘click_environment’, ‘click_deviceGroup’,
‘click_os’, ‘click_country’, ‘click_region’, ‘click_referrer_type’, ‘is_cat_hab’]
behavior_fea = [‘click_article_id’]
hist_behavior_fea = [‘hist_click_article_id’]
dense_fea = [‘sim0’, ‘time_diff0’, ‘word_diff0’, ‘sim_max’, ‘sim_min’, ‘sim_sum’, ‘sim_mean’, ‘score’,
‘rank’,‘click_size’,‘time_diff_mean’,‘active_level’,‘user_time_hob1’,‘user_time_hob2’,
‘words_hbo’,‘words_count’]
The dense feature is normalized, and the neural network training needs to normalize the value
mm = MinMaxScaler()
The following is to do some special processing. When invalid values appear in other places, normalization cannot be performed without processing. You can comment it out at the beginning and run the following code
If you find an error afterwards, you should first find a way to deal with how to avoid the value of inf
trn_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)
tst_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)
for feat in dense_fea:
trn_user_item_feats_df_din_model[feat] = mm.fit_transform(trn_user_item_feats_df_din_model[[feat]])
if val_user_item_feats_df_din_model is not None:
val_user_item_feats_df_din_model[feat] = mm.fit_transform(val_user_item_feats_df_din_model[[feat]])
tst_user_item_feats_df_din_model[feat] = mm.fit_transform(tst_user_item_feats_df_din_model[[feat]])
Prepare training data
x_trn, dnn_feature_columns = get_din_feats_columns(trn_user_item_feats_df_din_model, dense_fea,
sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
y_trn = trn_user_item_feats_df_din_model[‘label’].values
if offline:
# 准备验证数据
x_val, dnn_feature_columns = get_din_feats_columns(val_user_item_feats_df_din_model, dense_fea,
sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
y_val = val_user_item_feats_df_din_model[‘label’].values
dense_fea = [x for x in dense_fea if x != ‘label’]
x_tst, dnn_feature_columns = get_din_feats_columns(tst_user_item_feats_df_din_model, dense_fea,
sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:143: calling RandomNormal.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Modeling
model = DIN(dnn_feature_columns, behavior_fea)
View model structure
model.summary()
Model compilation
model.compile(‘adam’, ‘binary_crossentropy’,metrics=[‘binary_crossentropy’, tf.keras.metrics.AUC()])
WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1288: calling VarianceScaling.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:255: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: “model”
Layer (type) Output Shape Param # Connected to
user_id (InputLayer) [(None, 1)] 0
click_article_id (InputLayer) [(None, 1)] 0
category_id (InputLayer) [(None, 1)] 0
click_environment (InputLayer) [(None, 1)] 0
click_deviceGroup (InputLayer) [(None, 1)] 0
click_os (InputLayer) [(None, 1)] 0
click_country (InputLayer) [(None, 1)] 0
click_region (InputLayer) [(None, 1)] 0
click_referrer_type (InputLayer [(None, 1)] 0
is_cat_hab (InputLayer) [(None, 1)] 0
sparse_emb_user_id (Embedding) (None, 1, 32) 1600032 user_id[0][0]
sparse_seq_emb_hist_click_artic multiple 525664 click_article_id[0][0]
hist_click_article_id[0][0]
click_article_id[0][0]
sparse_emb_category_id (Embeddi (None, 1, 32) 7776 category_id[0][0]
sparse_emb_click_environment (E (None, 1, 32) 128 click_environment[0][0]
sparse_emb_click_deviceGroup (E (None, 1, 32) 160 click_deviceGroup[0][0]
sparse_emb_click_os (Embedding) (None, 1, 32) 288 click_os [0] [0]
sparse_emb_click_country (Embed (None, 1, 32) 384 click_country[0][0]
sparse_emb_click_region (Embedd (None, 1, 32) 928 click_region[0][0]
sparse_emb_click_referrer_type (None, 1, 32) 256 click_referrer_type[0][0]
sparse_emb_is_cat_hab (Embeddin (None, 1, 32) 64 is_cat_hab[0][0]
no_mask (NoMask) (None, 1, 32) 0 sparse_emb_user_id[0][0]
sparse_seq_emb_hist_click_article
sparse_emb_category_id[0][0]
sparse_emb_click_environment[0][0
sparse_emb_click_deviceGroup[0][0
sparse_emb_click_os[0][0]
sparse_emb_click_country[0][0]
sparse_emb_click_region[0][0]
sparse_emb_click_referrer_type[0]
sparse_emb_is_cat_hab[0][0]
hist_click_article_id (InputLay [(None, 50)] 0
concatenate (Concatenate) (None, 1, 320) 0 no_mask[0][0]
no_mask[1][0]
no_mask[2][0]
no_mask[3][0]
no_mask[4][0]
no_mask[5][0]
no_mask[6][0]
no_mask[7][0]
no_mask[8][0]
no_mask[9][0]
no_mask_1 (NoMask) (None, 1, 320) 0 concatenate[0][0]
attention_sequence_pooling_laye (None, 1, 32) 13961 sparse_seq_emb_hist_click_article
sparse_seq_emb_hist_click_article
concatenate_1 (Concatenate) (None, 1, 352) 0 no_mask_1[0][0]
attention_sequence_pooling_layer[
sim0 (InputLayer) [(None, 1)] 0
time_diff0 (InputLayer) [(None, 1)] 0
word_diff0 (InputLayer) [(None, 1)] 0
sim_max (InputLayer) [(None, 1)] 0
sim_min (InputLayer) [(None, 1)] 0
sim_sum (InputLayer) [(None, 1)] 0
sim_mean (InputLayer) [(None, 1)] 0
score (InputLayer) [(None, 1)] 0
rank (InputLayer) [(None, 1)] 0
click_size (InputLayer) [(None, 1)] 0
time_diff_mean (InputLayer) [(None, 1)] 0
active_level (InputLayer) [(None, 1)] 0
user_time_hob1 (InputLayer) [(None, 1)] 0
user_time_hob2 (InputLayer) [(None, 1)] 0
words_hbo (InputLayer) [(None, 1)] 0
words_count (InputLayer) [(None, 1)] 0
flatten (Flatten) (None, 352) 0 concatenate_1[0][0]
no_mask_3 (NoMask) (None, 1) 0 sim0[0][0]
time_diff0[0][0]
word_diff0[0][0]
sim_max[0][0]
sim_min[0][0]
sim_sum[0][0]
sim_mean[0][0]
score[0][0]
rank[0][0]
click_size[0][0]
time_diff_mean[0][0]
active_level[0][0]
user_time_hob1[0][0]
user_time_hob2[0][0]
words_hbo[0][0]
words_count[0][0]
no_mask_2 (NoMask) (None, 352) 0 flatten[0][0]
concatenate_2 (Concatenate) (None, 16) 0 no_mask_3[0][0]
no_mask_3[1][0]
no_mask_3[2][0]
no_mask_3[3][0]
no_mask_3[4][0]
no_mask_3[5][0]
no_mask_3[6][0]
no_mask_3[7][0]
no_mask_3[8][0]
no_mask_3[9][0]
no_mask_3[10][0]
no_mask_3[11][0]
no_mask_3[12][0]
no_mask_3[13][0]
no_mask_3[14][0]
no_mask_3[15][0]
flatten_1 (Flatten) (None, 352) 0 no_mask_2[0][0]
flatten_2 (Flatten) (None, 16) 0 concatenate_2[0][0]
no_mask_4 (NoMask) multiple 0 flatten_1[0][0]
flatten_2[0][0]
concatenate_3 (Concatenate) (None, 368) 0 no_mask_4[0][0]
no_mask_4[1][0]
dnn_1 (DNN) (None, 80) 89880 concatenate_3[0][0]
dense (Dense) (None, 1) 80 dnn_1[0][0]
prediction_layer (PredictionLay (None, 1) 1 dense[0][0]
Total params: 2,239,602
Trainable params: 2,239,362
Non-trainable params: 240
Model training
if offline:
history = model.fit(x_trn, y_trn, verbose=1, epochs=10, validation_data=(x_val, y_val), batch_size=256)
else:
# You can also use the above statement to use the validation set sampled by yourself
# history = model.fit(x_trn, y_trn, verbose=1, epochs=3, validation_split=0.3, batch_size=256)
history = model.fit(x_trn, y_trn, verbose=1, epochs=2, batch_size=256)
Epoch 1 /2
290964/290964 [] - 55s 189us/sample - loss: 0.4209 - binary_crossentropy: 0.4206 - auc: 0.7842
Epoch 2/2
290964/290964 [] - 52s 178us/sample - loss: 0.3630 - binary_crossentropy: 0.3618 - auc: 0.8478
Model prediction
tst_user_item_feats_df_din_model[‘pred_score’] = model.predict(x_tst, verbose=1, batch_size=256)
tst_user_item_feats_df_din_model[[‘user_id’, ‘click_article_id’, ‘pred_score’]].to_csv(save_path + ‘din_rank_score.csv’, index=False)
500000/500000 [==============================] - 20s 39us/sample
Reorder prediction results and generate submission results
rank_results = tst_user_item_feats_df_din_model[[‘user_id’, ‘click_article_id’, ‘pred_score’]]
submit(rank_results, topk=5, model_name=‘din’)
Five-fold cross-validation, where the five-fold cross-over is based on the user as the target for five-fold division
This part is separate from the previous separate training and verification
def get_kfold_users(trn_df, n=5):
user_ids = trn_df[‘user_id’].unique()
user_set = [user_ids[i::n] for i in range(n)]
return user_set
k_fold = 5
trn_df = trn_user_item_feats_df_din_model
user_set = get_kfold_users(trn_df, n=k_fold)
score_list = []
score_df = trn_df[[‘user_id’, ‘click_article_id’, ‘label’]]
sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])
dense_fea = [x for x in dense_fea if x != ‘label’]
x_tst, dnn_feature_columns = get_din_feats_columns(tst_user_item_feats_df_din_model, dense_fea,
sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
Five-fold cross-validation, and save intermediate results for staking
for n_fold, valid_user in enumerate(user_set):
train_idx = trn_df[~trn_df[‘user_id’].isin(valid_user)] # add slide user
valid_idx = trn_df[trn_df[‘user_id’].isin(valid_user)]
# 准备训练数据
x_trn, dnn_feature_columns = get_din_feats_columns(train_idx, dense_fea,
sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
y_trn = train_idx['label'].values
# 准备验证数据
x_val, dnn_feature_columns = get_din_feats_columns(valid_idx, dense_fea,
sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)
y_val = valid_idx['label'].values
history = model.fit(x_trn, y_trn, verbose=1, epochs=2, validation_data=(x_val, y_val) , batch_size=256)
# 预测验证集结果
valid_idx['pred_score'] = model.predict(x_val, verbose=1, batch_size=256)
valid_idx.sort_values(by=['user_id', 'pred_score'])
valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')
# 将验证集的预测结果放到一个列表中,后面进行拼接
score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])
# 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均
if not offline:
sub_preds += model.predict(x_tst, verbose=1, batch_size=256)[:, 0]
score_df_ = pd.concat(score_list, axis=0)
score_df = score_df.merge(score_df_, how=‘left’, on=[‘user_id’, ‘click_article_id’])
Save the new features generated by cross-validation of the training set
score_df[[‘user_id’, ‘click_article_id’, ‘pred_score’, ‘pred_rank’, ‘label’]].to_csv(save_path + ‘trn_din_cls_feats.csv’, index=False)
The prediction results of the test set are averaged through multiple cross-validation, and the predicted score and corresponding rank features are saved, which can be used for later staking, and more features can be constructed here.
tst_user_item_feats_df_din_model[‘pred_score’] = sub_preds / k_fold
tst_user_item_feats_df_din_model[‘pred_score’] = tst_user_item_feats_df_din_model[‘pred_score’].transform(lambda x: norm_sim(x))
tst_user_item_feats_df_din_model.sort_values(by=[‘user_id’, ‘pred_score’])
tst_user_item_feats_df_din_model[‘pred_rank’] = tst_user_item_feats_df_din_model.groupby([‘user_id’])[‘pred_score’].rank(ascending=False, method=‘first’)
Save the new features of test set cross-validation
tst_user_item_feats_df_din_model[['user_id','click_article_id','pred_score','pred_rank']].to_csv(save_path +'tst_din_cls_feats.csv', index=False)
model fusion
weighted fusion
Read the sort result files of multiple models
lgb_ranker = pd.read_csv(save_path + ‘lgb_ranker_score.csv’)
lgb_cls = pd.read_csv(save_path + ‘lgb_cls_score.csv’)
din_ranker = pd.read_csv(save_path + ‘din_rank_score.csv’)
Here can also be replaced by the test results of the cross-validation output for weighted fusion
rank_model = {‘lgb_ranker’: lgb_ranker,
‘lgb_cls’: lgb_cls,
‘din_ranker’: din_ranker}
def get_ensumble_predict_topk(rank_model, topk=5):
final_recall = rank_model[‘lgb_cls’].append(rank_model[‘din_ranker’])
rank_model[‘lgb_ranker’][‘pred_score’] = rank_model[‘lgb_ranker’][‘pred_score’].transform(lambda x: norm_sim(x))
final_recall = final_recall.append(rank_model['lgb_ranker'])
final_recall = final_recall.groupby(['user_id', 'click_article_id'])['pred_score'].sum().reset_index()
submit(final_recall, topk=topk, model_name='ensemble_fuse')
get_ensumble_predict_topk(rank_model)
Staking
Read the result file generated by cross-validation of multiple models
Training set
trn_lgb_ranker_feats = pd.read_csv(save_path + ‘trn_lgb_ranker_feats.csv’)
trn_lgb_cls_feats = pd.read_csv(save_path + ‘trn_lgb_cls_feats.csv’)
trn_din_cls_feats = pd.read_csv(save_path + ‘trn_din_cls_feats.csv’)
Test set
tst_lgb_ranker_feats = pd.read_csv(save_path + ‘tst_lgb_ranker_feats.csv’)
tst_lgb_cls_feats = pd.read_csv(save_path + ‘tst_lgb_cls_feats.csv’)
tst_din_cls_feats = pd.read_csv(save_path + ‘tst_din_cls_feats.csv’)
Combine features output from multiple models
finall_trn_ranker_feats = trn_lgb_ranker_feats[[‘user_id’, ‘click_article_id’, ‘label’]]
finall_tst_ranker_feats = tst_lgb_ranker_feats[[‘user_id’, ‘click_article_id’]]
for idx, trn_model in enumerate([trn_lgb_ranker_feats, trn_lgb_cls_feats, trn_din_cls_feats]):
for feat in [ ‘pred_score’, ‘pred_rank’]:
col_name = feat + ‘_’ + str(idx)
finall_trn_ranker_feats[col_name] = trn_model[feat]
for idx, tst_model in enumerate([tst_lgb_ranker_feats, tst_lgb_cls_feats, tst_din_cls_feats]):
for feat in [ ‘pred_score’, ‘pred_rank’]:
col_name = feat + ‘_’ + str(idx)
finall_tst_ranker_feats[col_name] = tst_model[feat]
Define a logistic regression model to refit the features generated by cross-validation to predict the test set
It should be noted here that when doing cross-validation, you can construct more features related to the output predicted value to enrich the features of the simple model here.
from sklearn.linear_model import LogisticRegression
feat_cols = [‘pred_score_0’, ‘pred_rank_0’, ‘pred_score_1’, ‘pred_rank_1’, ‘pred_score_2’, ‘pred_rank_2’]
trn_x = finall_trn_ranker_feats [feat_cols]
trn_y = finall_trn_ranker_feats ['label']
tst_x = finall_tst_ranker_feats[feat_cols]
Define the model
lr = LogisticRegression()
Model training
lr.fit(trn_x, trn_y)
Model prediction
finall_tst_ranker_feats[‘pred_score’] = lr.predict_proba(tst_x)[:, 1]
Reorder prediction results and generate submission results
rank_results = finall_tst_ranker_feats[['user_id','click_article_id','pred_score']]
submit(rank_results, topk=5, model_name='ensumble_staking')
Summary
This chapter mainly learned three ranking models, including LGB Rank and LGB Classifier There is also the DIN model of deep learning. Of course, we have not given a detailed introduction to the principles of these three models. Please explore the principles by yourself in class. You are also welcome to share your explorations with what you have learned. A piece of learning and progress. Finally, we carried out a simple model fusion strategy, including simple weighting and stacking.
About Datawhale: Datawhale is an open source organization focusing on data science and AI. It brings together outstanding learners from many universities and well-known companies in many fields, and brings together a group of team members with open source spirit and exploratory spirit. With the vision of "for the learner, grow with learners", Datawhale encourages true self-expression, openness and tolerance, mutual trust and mutual assistance, the courage to try and make mistakes, and the courage to take responsibility. At the same time, Datawhale uses the concept of open source to explore open source content, open source learning and open source solutions, empower talent training, help talent growth, and establish a connection between people and people, people and knowledge, people and enterprises, and people and the future. In this data mining path learning, the topical knowledge will be shared in Tianchi. For details, please pay attention to Datawhale.