The idea of constructing features is this. We know that each user’s clicked article is closely related to the historically clicked article information, such as the same topic, similarity, and so on. Therefore, a very important series of features of feature construction is to combine the user's history to click on the article information. We have obtained a data set of two columns for each user and click candidate articles. Our goal is to predict the last clicked article. A more natural idea is to have a relationship with the last few clicks of the article. It not only considers its historical clicks on the article information, but also has to be closer to the last click, because one of the great features of news is that it pays attention to timeliness. Often the user’s last click is very much related to his last few clicks. So we can make the characteristics related to the last few clicks for each candidate article as follows:

The similarity feature between the candidate item and the last few clicks (embedding inner product) — this is directly related to the user's historical behavior
The statistical characteristics of the similarity characteristics between the candidate item and the last few clicks — statistical characteristics can reduce some fluctuations and anomalies
The feature of the word count difference between the candidate item and the last few clicks of the article — user preferences can be seen by the word count
The time difference feature established between the candidate item and the last few clicks of the article — the time difference feature can show the user’s preference for the real-time nature of the article.
Also need to consider
If youtube recall is used, we can also make similar features between the user and the candidate item
We first obtain the user’s last click operation and the user’s historical clicks. This is based on our log data set.
Based on the user’s historical behavior to produce features, this will use the user’s historical click table, the final recall list, the article’s information table and embedding vector
Make labels to form the final supervised learning data set

导包
import numpy as np
import pandas as pd
import pickle
from tqdm import tqdm
import gc, os
import logging
import time
import lightgbm as lgb
from gensim.models import Word2Vec
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings(‘ignore’)
df节省内存函数

A function to save memory

Reduce memory

def reduce_mem(df):
starttime = time.time()
numerics = [‘int16’, ‘int32’, ‘int64’, ‘float16’, ‘float32’, ‘float64’]
start_mem = df.memory_usage().sum() / 10242
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if pd.isnull(c_min) or pd.isnull(c_max):
continue
if str(col_type)[:3] == ‘int’:
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 10242
print(’-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min’.format(end_mem,
100*(start_mem-end_mem)/start_mem,
(time.time()-starttime)/60))
return df
defines the data path
data_path ='./data_raw/'
save_path ='./temp_results/'
data read

The division
of training and validation sets The reason for dividing the training and validation sets is to verify the quality of the model parameters offline. In order to fully simulate the test set, we will extract all the information of some users in the training set as the validation set. The advantage of dividing the training validation set in advance is that it can break down the pressure when making the sorting features. It may take a long time to do the sorting features of the entire data set at one time.

all_click_df refers to the training set

sample_user_nums sample the number of users as the validation set

def trn_val_split(all_click_df, sample_user_nums):
all_click = all_click_df
all_user_ids = all_click.user_id.unique()

# replace=True表示可以重复抽样，反之不可以
sample_user_ids = np.random.choice(all_user_ids, size=sample_user_nums, replace=False) 

click_val = all_click[all_click['user_id'].isin(sample_user_ids)]
click_trn = all_click[~all_click['user_id'].isin(sample_user_ids)]

# 将验证集中的最后一次点击给抽取出来作为答案
click_val = click_val.sort_values(['user_id', 'click_timestamp'])
val_ans = click_val.groupby('user_id').tail(1)

click_val = click_val.groupby('user_id').apply(lambda x: x[:-1]).reset_index(drop=True)

# 去除val_ans中某些用户只有一个点击数据的情况，如果该用户只有一个点击数据，又被分到ans中，
# 那么训练集中就没有这个用户的点击数据，出现用户冷启动问题，给自己模型验证带来麻烦
val_ans = val_ans[val_ans.user_id.isin(click_val.user_id.unique())] # 保证答案中出现的用户再验证集中还有
click_val = click_val[click_val.user_id.isin(val_ans.user_id.unique())]

return click_trn, click_val, val_ans

Get historical click and last click

Get the historical click and the last click of the current data

def get_hist_and_last_click(all_click):
all_click = all_click.sort_values(by=[‘user_id’, ‘click_timestamp’])
click_last_df = all_click.groupby(‘user_id’).tail(1)

# 如果用户只有一个点击，hist为空了，会导致训练的时候这个用户不可见，此时默认泄露一下
def hist_func(user_df):
    if len(user_df) == 1:
        return user_df
    else:
        return user_df[:-1]

click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True)

return click_hist_df, click_last_df

Read training, verification and test set
def get_trn_val_tst_data(data_path, offline=True):
if offline:
click_trn_data = pd.read_csv(data_path+'train_click_log.csv') # Training set user click log
click_trn_data = reduce_mem(click_trn_data)
click_trn, click_val, val_ans = trn_val_split(all_click_df, sample_user_nums)
else:
click_trn = pd.read_csv(data_path+'train_click_log.csv')
click_trn = reduce_mem(click_trn)
click_val = None
val_ans = None

click_tst = pd.read_csv(data_path+'testA_click_log.csv')

return click_trn, click_val, click_tst, val_ans

Read the recall list

Return to multi-channel recall list or single-channel recall

def get_recall_list(save_path, single_recall_model=None, multi_recall=False):
if multi_recall:
return pickle.load(open(save_path + ‘final_recall_items_dict.pkl’, ‘rb’))

if single_recall_model == 'i2i_itemcf':
    return pickle.load(open(save_path + 'itemcf_recall_dict.pkl', 'rb'))
elif single_recall_model == 'i2i_emb_itemcf':
    return pickle.load(open(save_path + 'itemcf_emb_dict.pkl', 'rb'))
elif single_recall_model == 'user_cf':
    return pickle.load(open(save_path + 'youtubednn_usercf_dict.pkl', 'rb'))
elif single_recall_model == 'youtubednn':
    return pickle.load(open(save_path + 'youtube_u2i_dict.pkl', 'rb'))

Read all kinds of Embedding
Word2Vec training and the use of
gensim. The main idea of Word2Vec is: the context of a word can express the semantics of the word very well. The way to generate word vectors through unsupervised learning. There are two very classic models in word2vec: skip-gram and cbow.
• skip-gram: The known central word predicts surrounding words.
• cbow: predict the center word with known surrounding words.

When using gensim to train word2vec, there are several important parameters
• size: indicates the dimension of the word vector.
• window: Determines how far away the target word will have a relationship with the context.
• sg: If it is 0, it is the CBOW model, and if it is 1, it is the Skip-Gram model.
• workers: Indicates the number of threads during training
• min_count: set the smallest
• iter: the number of times to traverse the entire data set during training

note

The input corpus during training must be a two-dimensional array of characters, such as: [['北','京','你','好'], ['上','海','你', ' Great']]
There are some default values when using the model, which can be viewed through Word2Vec?? in Jupyter

Negative sampling of the training data.
Through recall, we convert the data into the form of triples (user1, item1, label), and observe that the gap between positive and negative samples is extremely unbalanced. We can first downsample the negative samples and downsample The purpose of on the one hand alleviates the problem of the ratio of positive and negative samples, on the other hand, it also reduces our pressure on sorting features. What should we pay attention to when we do negative sampling?

Only downsample negative samples (if there is a better method of positive sample expansion, it can actually be considered)
After negative sampling, ensure that all users and articles still appear in the data after sampling
The proportion of downsampling can be controlled artificially according to the actual situation
After the negative sampling is done, update the list of new user recall articles at this time, because the relative position information may be used in subsequent features.
In fact, the negative sampling can also be left behind to finish the feature. Here, because the sorting feature is too slow, the negative sampling link is brought to the front.

Convert the recall list into the form of df

def recall_dict_2_df(recall_list_dict):
df_row_list = [] # [user, item, score]
for user, recall_list in tqdm(recall_list_dict.items()):
for item, score in recall_list:
df_row_list.append([user, item, score])

col_names = ['user_id', 'sim_item', 'score']
recall_list_df = pd.DataFrame(df_row_list, columns=col_names)

return recall_list_df

Negative sampling function, here you can control the ratio of negative sampling, here is a default value

def neg_sample_recall_data(recall_items_df, sample_rate=0.001):
pos_data = recall_items_df[recall_items_df[‘label’] == 1]
neg_data = recall_items_df[recall_items_df[‘label’] == 0]

print('pos_data_num:', len(pos_data), 'neg_data_num:', len(neg_data), 'pos/neg:', len(pos_data)/len(neg_data))

# 分组采样函数
def neg_sample_func(group_df):
    neg_num = len(group_df)
    sample_num = max(int(neg_num * sample_rate), 1) # 保证最少有一个
    sample_num = min(sample_num, 5) # 保证最多不超过5个，这里可以根据实际情况进行选择
    return group_df.sample(n=sample_num, replace=True)

# 对用户进行负采样，保证所有用户都在采样后的数据中
neg_data_user_sample = neg_data.groupby('user_id', group_keys=False).apply(neg_sample_func)
# 对文章进行负采样，保证所有文章都在采样后的数据中
neg_data_item_sample = neg_data.groupby('sim_item', group_keys=False).apply(neg_sample_func)

# 将上述两种情况下的采样数据合并
neg_data_new = neg_data_user_sample.append(neg_data_item_sample)
# 由于上述两个操作是分开的，可能将两个相同的数据给重复选择了，所以需要对合并后的数据进行去重
neg_data_new = neg_data_new.sort_values(['user_id', 'score']).drop_duplicates(['user_id', 'sim_item'], keep='last')

# 将正样本数据合并
data_new = pd.concat([pos_data, neg_data_new], ignore_index=True)

return data_new

Labeling recall data

def get_rank_label_df(recall_list_df, label_df, is_test=False):
# The test set has no labels. In order to have the same code later, here is a negative number instead
if is_test:
recall_list_df['label'] = -1
return recall_list_df

label_df = label_df.rename(columns={'click_article_id': 'sim_item'})
recall_list_df_ = recall_list_df.merge(label_df[['user_id', 'sim_item', 'click_timestamp']], \
                                           how='left', on=['user_id', 'sim_item'])
recall_list_df_['label'] = recall_list_df_['click_timestamp'].apply(lambda x: 0.0 if np.isnan(x) else 1.0)
del recall_list_df_['click_timestamp']

return recall_list_df_

def get_user_recall_item_label_df(click_trn_hist, click_val_hist, click_tst_hist,click_trn_last, click_val_last, recall_list_df):
# Get the recall list of the training data
trn_user_items_df = recall_list_df[recall_list_df]
# recall_list_df[recall_list_df] #recall_list_df[isin_id']. Data labeling
trn_user_item_label_df = get_rank_label_df(trn_user_items_df, click_trn_last, is_test=False)
# Negative sampling of training data
trn_user_item_label_df = neg_sample_recall_data(trn_user_item_label_df)

if click_val is not None:
    val_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_val_hist['user_id'].unique())]
    val_user_item_label_df = get_rank_label_df(val_user_items_df, click_val_last, is_test=False)
    val_user_item_label_df = neg_sample_recall_data(val_user_item_label_df)
else:
    val_user_item_label_df = None
    
# 测试数据不需要进行负采样，直接对所有的召回商品进行打-1标签
tst_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_tst_hist['user_id'].unique())]
tst_user_item_label_df = get_rank_label_df(tst_user_items_df, None, is_test=True)

return trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df

Read the recall list

recall_list_dict = get_recall_list(save_path, single_recall_model='i2i_itemcf') # Here only the result of single-channel recall is selected, or multiple recall results can be selected

Convert recall data to df

recall_list_df = recall_dict_2_df(recall_list_dict)
100%|██████████| 250000/250000 [00:12<00:00, 20689.39it/s]

Label the training verification data and negatively sample (this part takes a long time)

trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df = get_user_recall_item_label_df (click_trn_hist,
click_val_hist,
click_tst_hist,
click_trn_last,
click_val_last,
recall_list_df)
pos_data_num: 64190 neg_data_num: 1.93581 million POS / NEG: 0.03315924600038227
trn_user_item_label_df.label
the recalled data into a dictionary

Convert the final recalled df data into a dictionary for sorting features

def make_tuple_func(group_df):
row_data = []
for name, row_df in group_df.iterrows():
row_data.append((row_df[‘sim_item’], row_df[‘score’], row_df[‘label’]))

return row_data

trn_user_item_label_tuples = trn_user_item_label_df.groupby(‘user_id’).apply(make_tuple_func).reset_index()
trn_user_item_label_tuples_dict = dict(zip(trn_user_item_label_tuples[‘user_id’], trn_user_item_label_tuples[0]))

if val_user_item_label_df is not None:
val_user_item_label_tuples = val_user_item_label_df.groupby(‘user_id’).apply(make_tuple_func).reset_index()
val_user_item_label_tuples_dict = dict(zip(val_user_item_label_tuples[‘user_id’], val_user_item_label_tuples[0]))
else:
val_user_item_label_tuples_dict = None

tst_user_item_label_tuples = tst_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()
tst_user_item_label_tuples_dict = dict(zip(tst_user_item_label_tuple) for each user’s
history of
recalled features [_user_id_s], 0 Each product has characteristics. The specific steps are as follows:
• For each user, get the item_id of the last N products clicked,
o For each recalled product of this user, calculate the sum of similarity with the last N clicked products above (maximum, minimum, average) , Time difference feature, similarity feature, word count difference feature, similarity feature with the user

The following is based on data for history-related features

def create_feature(users_id, recall_list, click_hist_df, articles_info, articles_emb, user_emb=None, N=1):
"""
Make relevant features based on the user's historical behavior
: param users_id: user id
: param recall_list: candidates for recall for each user Article list
: param click_hist_df: user’s historical click information
: param articles_info: article information
: param articles_emb: article embedding vector, this can be item_content_emb, item_w2v_emb, item_youtube_emb
: param user_emb: user’s embedding vector, this is user_youtube_emb, if not You don’t need to, but note that if you want to use it, articles_emb should be in the form of item_youtube_emb, so that the dimensions are the same
: param N: the most recent N clicks. Because many users in the testA log only have one historical click, so in order not to generate a null value , The default is 1
"""

# 建立一个二维列表保存结果， 后面要转成DataFrame
all_user_feas = []
i = 0
for user_id in tqdm(users_id):
    # 该用户的最后N次点击
    hist_user_items = click_hist_df[click_hist_df['user_id']==user_id]['click_article_id'][-N:]
    
    # 遍历该用户的召回列表
    for rank, (article_id, score, label) in enumerate(recall_list[user_id]):
        # 该文章建立时间, 字数
        a_create_time = articles_info[articles_info['article_id']==article_id]['created_at_ts'].values[0]
        a_words_count = articles_info[articles_info['article_id']==article_id]['words_count'].values[0]
        single_user_fea = [user_id, article_id]
        # 计算与最后点击的商品的相似度的和， 最大值和最小值， 均值
        sim_fea = []
        time_fea = []
        word_fea = []
        # 遍历用户的最后N次点击文章
        for hist_item in hist_user_items:
            b_create_time = articles_info[articles_info['article_id']==hist_item]['created_at_ts'].values[0]
            b_words_count = articles_info[articles_info['article_id']==hist_item]['words_count'].values[0]
            
            sim_fea.append(np.dot(articles_emb[hist_item], articles_emb[article_id]))
            time_fea.append(abs(a_create_time-b_create_time))
            word_fea.append(abs(a_words_count-b_words_count))
            
        single_user_fea.extend(sim_fea)      # 相似性特征
        single_user_fea.extend(time_fea)    # 时间差特征
        single_user_fea.extend(word_fea)    # 字数差特征
        single_user_fea.extend([max(sim_fea), min(sim_fea), sum(sim_fea), sum(sim_fea) / len(sim_fea)])  # 相似性的统计特征
        
        if user_emb:  # 如果用户向量有的话， 这里计算该召回文章与用户的相似性特征 
            single_user_fea.append(np.dot(user_emb[user_id], articles_emb[article_id]))
            
        single_user_fea.extend([score, rank, label])    
        # 加入到总的表中
        all_user_feas.append(single_user_fea)

# 定义列名
id_cols = ['user_id', 'click_article_id']
sim_cols = ['sim' + str(i) for i in range(N)]
time_cols = ['time_diff' + str(i) for i in range(N)]
word_cols = ['word_diff' + str(i) for i in range(N)]
sat_cols = ['sim_max', 'sim_min', 'sim_sum', 'sim_mean']
user_item_sim_cols = ['user_item_sim'] if user_emb else []
user_score_rank_label = ['score', 'rank', 'label']
cols = id_cols + sim_cols + time_cols + word_cols + sat_cols + user_item_sim_cols + user_score_rank_label
        
# 转成DataFrame
df = pd.DataFrame( all_user_feas, columns=cols)

return df

article_info_df = get_article_info_df()
all_click = click_trn.append(click_tst)
item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict = get_embedding(save_path, all_click)
– Mem. usage decreased to 5.56 Mb (50.0% reduction),time spend:0.00 min

Obtain the relevant features of the recalled articles in the training verification and test data

trn_user_item_feats_df = create_feature(trn_user_item_label_tuples_dict.keys(), trn_user_item_label_tuples_dict,
click_trn_hist, article_info_df, item_content_emb_dict)

if val_user_item_label_tuples_dict is not None:
val_user_item_feats_df = create_feature(val_user_item_label_tuples_dict.keys(), val_user_item_label_tuples_dict,
click_val_hist, article_info_df, item_content_emb_dict)
else:
val_user_item_feats_df = None

tst_user_item_feats_df = create_feature(tst_user_item_label_tuples_dict.keys(), tst_user_item_label_tuples_dict,
click_tst_hist, article_info_df, item_content_emb_dict)
100%|██████████| 200000/200000 [50:16<00:00, 66.31it/s]
100%|██████████| 50000/50000 [1:07:21<00:00, 12.37it/s]

Save a copy and run again every time you save, and each run takes a long time

trn_user_item_feats_df.to_csv(save_path + ‘trn_user_item_feats_df.csv’, index=False)

if val_user_item_feats_df is not None:
val_user_item_feats_df.to_csv(save_path + ‘val_user_item_feats_df.csv’, index=False)

tst_user_item_feats_df.to_csv(save_path + ‘tst_user_item_feats_df.csv’, index=False)

click_tst.head()

Read article features

articles = pd.read_csv(data_path+‘articles.csv’)
articles = reduce_mem(articles)

Log data is all the previous data

if click_val is not None:
all_data = click_trn.append(click_val)
all_data = click_trn.append(click_tst)
all_data = reduce_mem(all_data)

Spell the article information

all_data = all_data.merge(articles, left_on='click_article_id', right_on='article_id')
all_data.shape
analyzes the click time and the number of times the article is clicked to distinguish user activity.
If the time interval between a user clicks on an article is relatively small If there are many articles clicked at the same time, then we think that this kind of user is generally an active user. Of course, there may be many ways to measure user activity. Here we only provide one of them. We write a function to get a measure of user activity. The characteristics of the degree, the logic is as follows:

First, group by user user_id, for each user, calculate the number of times the article is clicked, and the average value of the time interval between two clicks on the article
The reciprocal of the number of clicks and the mean value of the time interval are unified and normalized, and then the two are added and combined. The smaller the value, the more active the user is
Note that the average time interval of the two clicks on the article above will appear if the user clicks only once, then there will be a null value in the average time interval. For this case, give a large number to the final feature to distinguish
this measure The standard is to first take the number of clicks to the number and then normalize, then normalize the time difference of clicks, and then add the two to merge. The smaller the value, the more clicks and the shorter the interval.
def active_level(all_data, cols):
"""
Create features that distinguish user activity
: param all_data: data set
: param cols: feature columns used
"""
data = all_data[cols]
data.sort_values(['user_id' ,'click_timestamp'], inplace=True)
user_act = pd.DataFrame(data.groupby('user_id', as_index=False)[['click_article_id','click_timestamp']].
agg({'click_article_id':np.size ,'click_timestamp': {list}}).values, columns=['user_id','click_size','click_timestamp'])

Calculate the mean of the time interval

def time_diff_mean(l):
if len(l) == 1:
return 1
else:
return np.mean([j-i for i, j in list(zip(l[:-1], l[1:]))])

user_act[‘time_diff_mean’] = user_act[‘click_timestamp’].apply(lambda x: time_diff_mean(x))

Countdown of clicks

user_act[‘click_size’] = 1 / user_act[‘click_size’]

Normalize both

user_act[‘click_size’] = (user_act[‘click_size’] - user_act[‘click_size’].min()) / (user_act[‘click_size’].max() - user_act[‘click_size’].min())
user_act[‘time_diff_mean’] = (user_act[‘time_diff_mean’] - user_act[‘time_diff_mean’].min()) / (user_act[‘time_diff_mean’].max() - user_act[‘time_diff_mean’].min())
user_act[‘active_level’] = user_act[‘click_size’] + user_act[‘time_diff_mean’]

user_act[‘user_id’] = user_act[‘user_id’].astype(‘int’)
del user_act[‘click_timestamp’]

return user_act
user_act_fea = active_level(all_data, ['user_id','click_article_id','click_timestamp'])
user_act_fea.head()
analyze the click time and the number of times the article is clicked, and measure the popularity of the article

def hot_level(all_data, cols):
"""
Make features
that measure the popularity of articles : param all_data: data set
: param cols: feature columns used
"""
data = all_data[cols]
data.sort_values(['click_article_id', 'click_timestamp'], inplace=True)
article_hot = pd.DataFrame(data.groupby('click_article_id', as_index=False)[['user_id','click_timestamp']].
agg({'user_id':np.size, 'click_timestamp': {list}}).values, columns=['click_article_id','user_num','click_timestamp'])

# 计算被点击时间间隔的均值
def time_diff_mean(l):
    if len(l) == 1:
        return 1
    else:
        return np.mean([j-i for i, j in list(zip(l[:-1], l[1:]))])
    
article_hot['time_diff_mean'] = article_hot['click_timestamp'].apply(lambda x: time_diff_mean(x))

# 点击次数取倒数
article_hot['user_num'] = 1 / article_hot['user_num']

# 两者归一化
article_hot['user_num'] = (article_hot['user_num'] - article_hot['user_num'].min()) / (article_hot['user_num'].max() - article_hot['user_num'].min())
article_hot['time_diff_mean'] = (article_hot['time_diff_mean'] - article_hot['time_diff_mean'].min()) / (article_hot['time_diff_mean'].max() - article_hot['time_diff_mean'].min())     
article_hot['hot_level'] = article_hot['user_num'] + article_hot['time_diff_mean']

article_hot['click_article_id'] = article_hot['click_article_id'].astype('int')

del article_hot['click_timestamp']

return article_hot

article_hot_fea = hot_level(all_data, [‘user_id’, ‘click_article_id’, ‘click_timestamp’])
article_hot_fea.head()

The user's device habits
def device_fea(all_data, cols):
"""
Make the user's device features
: param all_data: data set
: param cols: the feature column used
"""
user_device_info = all_data[cols]

# 用众数来表示每个用户的设备信息
user_device_info = user_device_info.groupby('user_id').agg(lambda x: x.value_counts().index[0]).reset_index()

return user_device_info

Equipment characteristics (the time will be longer here)

= device_cols [ 'user_id', 'click_environment', 'click_deviceGroup', 'click_os', 'click_country', 'click_region', 'click_referrer_type']
user_device_info = device_fea (all_data, device_cols)
user_device_info.head ()
the user's habits time
def user_time_hob_fea (all_data, cols):
"""
Producing user's time habit characteristics
: param all_data: data set
: param cols: column of characteristics used
"""
user_time_hob_info = all_data[cols]

# 先把时间戳进行归一化
mm = MinMaxScaler()
user_time_hob_info['click_timestamp'] = mm.fit_transform(user_time_hob_info[['click_timestamp']])
user_time_hob_info['created_at_ts'] = mm.fit_transform(user_time_hob_info[['created_at_ts']])

user_time_hob_info = user_time_hob_info.groupby('user_id').agg('mean').reset_index()

user_time_hob_info.rename(columns={'click_timestamp': 'user_time_hob1', 'created_at_ts': 'user_time_hob2'}, inplace=True)
return user_time_hob_info

user_time_hob_cols = ['user_id','click_timestamp','created_at_ts']
user_time_hob_info = user_time_hob_fea(all_data, user_time_hob_cols)
User's subject hobbies.
Here, first turn the subject of the article clicked by the user into a list, and then separate them when they are summarized separately. Make a feature, that is, if the topic of the article belongs to it, it is 1, otherwise it is 0.
def user_cat_hob_fea(all_data, cols):
"""
User's subject hobbies
: param all_data: data set
: param cols: feature column used
"""
user_category_hob_info = all_data[cols]
user_category_hob_info = user_category_hob_user_id'). agg({list}).reset_index()

user_cat_hob_info = pd.DataFrame()
user_cat_hob_info['user_id'] = user_category_hob_info['user_id']
user_cat_hob_info['cate_list'] = user_category_hob_info['category_id']

return user_cat_hob_info

user_category_hob_cols = ['user_id','category_id']
user_cat_hob_info = user_cat_hob_fea(all_data, user_category_hob_cols)
user's word preference feature
user_wcou_info = all_data.groupby('user_id')['agg)('user_id')['agg)(count'mean)
user_wcou_info.rename(columns={'words_count':'words_hbo'}, inplace=True)
User’s information features are merged and saved

Merge all tables

user_info = pd.merge(user_act_fea, user_device_info, on=‘user_id’)
user_info = user_info.merge(user_time_hob_info, on=‘user_id’)
user_info = user_info.merge(user_cat_hob_info, on=‘user_id’)
user_info = user_info.merge(user_wcou_info, on=‘user_id’)

So that the user characteristics can be read directly in the future

user_info.to_csv (save_path + 'user_info.csv', index = False)
the user directly reads the feature
if the previous user has given feature works done, can be directly read back

Read in user information directly

user_info = pd.read_csv(save_path + ‘user_info.csv’)
if os.path.exists(save_path + ‘trn_user_item_feats_df.csv’):
trn_user_item_feats_df = pd.read_csv(save_path + ‘trn_user_item_feats_df.csv’)

if os.path.exists(save_path + ‘tst_user_item_feats_df.csv’):
tst_user_item_feats_df = pd.read_csv(save_path + ‘tst_user_item_feats_df.csv’)

if os.path.exists(save_path + ‘val_user_item_feats_df.csv’):
val_user_item_feats_df = pd.read_csv(save_path + ‘val_user_item_feats_df.csv’)
else:
val_user_item_feats_df = None

Combine user characteristics

Below is the offline verification

trn_user_item_feats_df = trn_user_item_feats_df.merge(user_info, on=‘user_id’, how=‘left’)

if val_user_item_feats_df is not None:
val_user_item_feats_df = val_user_item_feats_df.merge(user_info, on=‘user_id’, how=‘left’)
else:
val_user_item_feats_df = None

tst_user_item_feats_df = tst_user_item_feats_df.merge(user_info, on=‘user_id’,how=‘left’)
trn_user_item_feats_df.columns
Index([‘user_id’, ‘click_article_id’, ‘sim0’, ‘time_diff0’, ‘word_diff0’,
‘sim_max’, ‘sim_min’, ‘sim_sum’, ‘sim_mean’, ‘score’, ‘rank’, ‘label’,
‘click_size’, ‘time_diff_mean’, ‘active_level’, ‘click_environment’,
‘click_deviceGroup’, ‘click_os’, ‘click_country’, ‘click_region’,
‘click_referrer_type’, ‘user_time_hob1’, ‘user_time_hob2’, ‘cate_list’,
‘words_hbo’],
dtype=‘object’)
文章的特征直接读入
articles = pd.read_csv(data_path+‘articles.csv’)
articles = reduce_mem(articles)
– Mem. usage decreased to 5.56 Mb (50.0% reduction),time spend:0.00 min

Combine article features

trn_user_item_feats_df = trn_user_item_feats_df.merge(articles, left_on=‘click_article_id’, right_on=‘article_id’)

if val_user_item_feats_df is not None:
val_user_item_feats_df = val_user_item_feats_df.merge(articles, left_on=‘click_article_id’, right_on=‘article_id’)
else:
val_user_item_feats_df = None

tst_user_item_feats_df = tst_user_item_feats_df.merge(articles, left_on=‘click_article_id’, right_on=‘article_id’)
召回文章的主题是否在用户的爱好里面
trn_user_item_feats_df[‘is_cat_hab’] = trn_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)
if val_user_item_feats_df is not None:
val_user_item_feats_df[‘is_cat_hab’] = val_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)
else:
val_user_item_feats_df = None
tst_user_item_feats_df[‘is_cat_hab’] = tst_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)

Offline verification

del trn_user_item_feats_df[‘cate_list’]

if val_user_item_feats_df is not None:
del val_user_item_feats_df[‘cate_list’]
else:
val_user_item_feats_df = None

del tst_user_item_feats_df[‘cate_list’]

del trn_user_item_feats_df[‘article_id’]

if val_user_item_feats_df is not None:
del val_user_item_feats_df[‘article_id’]
else:
val_user_item_feats_df = None

del tst_user_item_feats_df['article_id']
save features

Training verification features

trn_user_item_feats_df.to_csv(save_path + ‘trn_user_item_feats_df.csv’, index=False)
if val_user_item_feats_df is not None:
val_user_item_feats_df.to_csv(save_path + ‘val_user_item_feats_df.csv’, index=False)
tst_user_item_feats_df.to_csv(save_path + ‘tst_user_item_feats_df.csv’, index=False)

Feature engineering and data cleaning conversion is a crucial part of the game, because data and features determine the upper limit of machine learning, and algorithms and models only approach this upper limit. Therefore, the quality of feature engineering often determines the final result. Engineering can enhance the expressive ability of data in one step. By constructing new features, we can dig out more information about the data and further enlarge the expressive ability of data. In this section, we mainly turn the prediction problem into a supervised learning problem by making features and labels, and then make a series of features around user portraits and article portraits. In addition, in order to ensure the balance of positive and negative samples , We also learned negative sampling techniques and so on.

Feature Engineering-Understanding

A function to save memory

Reduce memory

all_click_df refers to the training set

sample_user_nums sample the number of users as the validation set

Get the historical click and the last click of the current data

Return to multi-channel recall list or single-channel recall

Convert the recall list into the form of df

Negative sampling function, here you can control the ratio of negative sampling, here is a default value

Labeling recall data

Read the recall list

Convert recall data to df

Label the training verification data and negatively sample (this part takes a long time)

Convert the final recalled df data into a dictionary for sorting features

The following is based on data for history-related features

Obtain the relevant features of the recalled articles in the training verification and test data

Save a copy and run again every time you save, and each run takes a long time

Read article features

Log data is all the previous data

Spell the article information

Calculate the mean of the time interval

Countdown of clicks

Normalize both

Equipment characteristics (the time will be longer here)

Merge all tables

So that the user characteristics can be read directly in the future

Read in user information directly

Combine user characteristics

Below is the offline verification

Combine article features

Offline verification

Training verification features

Guess you like