比赛报名地址:algo.qq.com/person/mobile/landingPage?from=dsbryan
由于本次比赛我属于内部员工不得参赛,所以我尽量写一些思路解析,为大家提供一个baseline
github:
https://github.com/YouChouNoBB/2018-tencent-ad-competition-baseline
1.首先处理4个G的用户特征
因为数据太大,而且不是能直接pandas读取的格式,所以需要做格式转换,用dict的方式来初始化DataFrame
2.拼接用户特征,广告特征
训练数据中负样本的标签给的是-1,需要先转成0,预测数据的标签置为-1,方便合并后区分数据集。将缺失值填充为 '-1' ,为什么不是数值的-1呢?因为在LabelEncoder的时候需要对数据排序,同时存在string和int类型是无法比较的。所以需要填充为string类型的 ‘-1’。
3.将单取值的离散特征使用稀疏方式one-hot
为什么要先将数据划分为训练集和测试集呢,因为稀疏的数据是无法分片的,所以只能先划分数据,分别拼接稀疏特征。如果使用pd.get_dummy()来获取onehot特征,生成的数据是可以用来分片的,但是稠密存储是个致命弱点。
github上很多人问我train_x=train[['creativeSize']] 这句是什么意思,其实creativeSize这个特征是数值特征,不需要进行特别的处理,如果想处理的话可以考虑pd.cut来分段离散化。另一个原因是把这个特征拿出来构造一个新的DataFrame,方便和后面生成的稀疏特征进行拼接。所以使用的是[[]]来取值获得一个DataFrame,而不是[]取值来或者一个Seris
4.将多取值的离散特征使用稀疏方式向量化
这个操作估计很多同学之前没有见过,一般出现在自然语言处理中,计算TF-IDF,LDA等时候使用,但是同样可以用来生成一个稀疏向量,作为新的特征,同时可以一个特征生成多个特征,比单独的处理更加方便。
如果65行报错:empty vocabulary; perhaps the documents only contain stop words ,初始化可以使用CountVectorizer(token_pattern='(?u)\\b\\w+\\b')
5.线下测试
使用train_test_split划分数据,这行注释掉了。
6.线上提交
线上预测的时候,模型训练中early_stopping_rounds 这个参数没什么用,参数n_estimatorsxu需要根据线下测试来重新指定。我看到有些同学设置为10000取得了0.74的成绩。。。。
# coding=utf-8 # @author:bryan # blog: https://blog.csdn.net/bryan__ # github: https://github.com/YouChouNoBB/2018-tencent-ad-competition-baseline import pandas as pd import lightgbm as lgb from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.preprocessing import OneHotEncoder,LabelEncoder from scipy import sparse import os ad_feature=pd.read_csv('../data/adFeature.csv') if os.path.exists('../data/userFeature.csv'): user_feature=pd.read_csv('../data/userFeature.csv') else: userFeature_data = [] with open('../data/userFeature.data', 'r') as f: cnt = 0 for i, line in enumerate(f): line = line.strip().split('|') userFeature_dict = {} for each in line: each_list = each.split(' ') userFeature_dict[each_list[0]] = ' '.join(each_list[1:]) userFeature_data.append(userFeature_dict) if i % 100000 == 0: print(i) if i % 1000000 == 0: user_feature = pd.DataFrame(userFeature_data) user_feature.to_csv('../data/userFeature_' + str(cnt) + '.csv', index=False) cnt += 1 del userFeature_data, user_feature userFeature_data = [] user_feature = pd.DataFrame(userFeature_data) user_feature.to_csv('../data/userFeature_' + str(cnt) + '.csv', index=False) del userFeature_data, user_feature user_feature = pd.concat([pd.read_csv('../data/userFeature_' + str(i) + '.csv') for i in range(cnt + 1)]).reset_index(drop=True) user_feature.to_csv('../data/userFeature.csv', index=False) train=pd.read_csv('../data/train.csv') predict=pd.read_csv('../data/test1.csv') train.loc[train['label']==-1,'label']=0 predict['label']=-1 data=pd.concat([train,predict]) data=pd.merge(data,ad_feature,on='aid',how='left') data=pd.merge(data,user_feature,on='uid',how='left') data=data.fillna('-1') one_hot_feature=['LBS','age','carrier','consumptionAbility','education','gender','house','os','ct','marriageStatus','advertiserId','campaignId', 'creativeId', 'adCategoryId', 'productId', 'productType'] vector_feature=['appIdAction','appIdInstall','interest1','interest2','interest3','interest4','interest5','kw1','kw2','kw3','topic1','topic2','topic3'] for feature in one_hot_feature: try: data[feature] = LabelEncoder().fit_transform(data[feature].apply(int)) except: data[feature] = LabelEncoder().fit_transform(data[feature]) train=data[data.label!=-1] train_y=train.pop('label') # train, test, train_y, test_y = train_test_split(train,train_y,test_size=0.2, random_state=2018) test=data[data.label==-1] res=test[['aid','uid']] test=test.drop('label',axis=1) enc = OneHotEncoder() train_x=train[['creativeSize']] test_x=test[['creativeSize']] for feature in one_hot_feature: enc.fit(data[feature].values.reshape(-1, 1)) train_a=enc.transform(train[feature].values.reshape(-1, 1)) test_a = enc.transform(test[feature].values.reshape(-1, 1)) train_x= sparse.hstack((train_x, train_a)) test_x = sparse.hstack((test_x, test_a)) print('one-hot prepared !') cv=CountVectorizer() for feature in vector_feature: cv.fit(data[feature]) train_a = cv.transform(train[feature]) test_a = cv.transform(test[feature]) train_x = sparse.hstack((train_x, train_a)) test_x = sparse.hstack((test_x, test_a)) print('cv prepared !') def LGB_test(train_x,train_y,test_x,test_y): from multiprocessing import cpu_count print("LGB test") clf = lgb.LGBMClassifier( boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1, max_depth=-1, n_estimators=1000, objective='binary', subsample=0.7, colsample_bytree=0.7, subsample_freq=1, learning_rate=0.05, min_child_weight=50,random_state=2018,n_jobs=cpu_count()-1 ) clf.fit(train_x, train_y,eval_set=[(train_x, train_y),(test_x,test_y)],eval_metric='auc',early_stopping_rounds=100) # print(clf.feature_importances_) return clf,clf.best_score_[ 'valid_1']['auc'] def LGB_predict(train_x,train_y,test_x,res): print("LGB test") clf = lgb.LGBMClassifier( boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1, max_depth=-1, n_estimators=1500, objective='binary', subsample=0.7, colsample_bytree=0.7, subsample_freq=1, learning_rate=0.05, min_child_weight=50, random_state=2018, n_jobs=100 ) clf.fit(train_x, train_y, eval_set=[(train_x, train_y)], eval_metric='auc',early_stopping_rounds=100) res['score'] = clf.predict_proba(test_x)[:,1] res['score'] = res['score'].apply(lambda x: float('%.6f' % x)) res.to_csv('../data/submission.csv', index=False) os.system('zip baseline.zip ../data/submission.csv') return clf model=LGB_predict(train_x,train_y,test_x,res)