Tianchi-IJCAI-18 Alimama's search advertising conversion prediction beginner experience (3: lightgbm parameter adjustment, ensemble)

lightgbm parameter explanation

boosting = ' gbdt ' , the iterator selection ' rf ' is slightly better

is_unbalance =True , the samples of the actual data are unbalanced, but setting this parameter leads to poor iteration effect

       bagging_fraction=0.7,

       bagging_freq =1,

Using the bagging method, 70% of the data is randomly selected for training, and bagging is performed every 1 iteration .

The effect is not significantly improved, but it should be improved. (Feature too dish)

       num_leaves=64,

       max_depth=6

num_leaves should be set to about 2^max_depth , but the nearby value may be better than this value

colsample_bytree=0.8

Randomly select 80% of the features for training to prevent overfitting

A fresh lightgbm Chinese document from Amway:

http://lightgbm.apachecn.org/cn/latest/Parameters.html

bagging

Bagging has not yet been implemented concretely, only two results with better predictions are simply added and then averaged. The advanced version takes a weighted average of the results according to the size of the logloss of the training set. But bagging's boost is fairly consistent and almost always improves results.

boosting

This idea is included in many packages, so there is not much to cover.

stacking

Stacking is said to be a big killer of the game, but the time to run is very expensive, and the speed of xgb is too slow.

So the algorithm is just implemented, and the number of iterations for each iterator is set at about 200 times.

The first layer   xgb+lgb+cat

second layer xgb

The effect has deteriorated, but in principle, this method is a method that I like to take for granted, and it may be that I have not fully understood it.

Attach simple stacking + bagging code for weighted average of results.

Input: dataset data with constructed features

from mlxtend.regressor import StackingRegressor
from mlxtend.data import boston_housing_data
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVR
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
import xgboost as xgb
from sklearn.ensemble import BaggingClassifier
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
from catboost import CatBoostClassifier

#--------------------------------------------Declaring training data and test data
train= data[(data['day'] >= 18) & (data['day'] <= 23)]
test= data[(data['day'] == 24)]
drop_name = ['is_trade',
              'item_category_list', 'item_property_list',
                  'predict_category_property',
                      'realtime','context_timestamp'
                 ]
col = [c for c in train if
       c not in drop_name]  

X = train[col]
y = train['is_trade'].values
X_tes = test[col]
y_tes = test['is_trade'].values

#--------------------------------------------The first layer of stacking
clfs=[
#RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
#      ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
xgb.sklearn.XGBClassifier(n_estimators=200,max_depth=4,seed=5,
                                 learning_rate=0.1,subsample=0.8,
                                 min_child_weight=6,colsample_bytree=.8,
                                 scale_pos_weight=1.6, gamma=10,
                                 reg_alpha=8,reg_lambda=1.3,silent=False,
                                 eval_metric='logloss'),
CatBoostClassifier(verbose=True,depth=8,iterations=200,learning_rate=0.1,
                   eval_metric='AUC',bagging_temperature=0.8,
                   l2_leaf_reg = 4,
                   rsm=0.8,random_seed=10086),
lgb.LGBMClassifier(
    objective='binary',
    metric='logloss',
    num_leaves=35,
    depth=8,
    learning_rate=0.1,
    seed=2018,
    colsample_bytree=0.8,
    # min_child_samples=8,
    subsample=0.9,
    n_estimators = 200),
lgb.LGBMClassifier(
objective='binary',
metric='AUC',
num_leaves=35,
depth=8,
learning_rate=0.1,
seed=2018,
colsample_bytree=0.8,
# min_child_samples=8,
subsample=0.9,
n_estimators = 200,
boosting = 'rf'
)
]

cfs = len(clfs)
ntrain=X.shape[0] ## 891
ntest=X_tes.shape[0]   ## 418
kf=KFold(n_splits=cfs,random_state=2017)

#def get_oof(clf,X,y,X_tes):
oof_train=np.zeros((ntrain,cfs)) ##shape is (ntrain,) means only one dimension 891*1
oof_test=np.zeros((ntest,cfs))
for j, clf in enumerate(clfs):
        print 'Training classifier [%s]' % (j)
        oof_test_skf = np.zeros((ntest, cfs))
        for i,(train_index,test_index) in enumerate(kf.split(X)):
            kf_x_train=X.loc[train_index] ## (891/5 *4)*7 故shape:(712*7)
            kf_y_train=y[train_index] ## 712*1
            kf_x_test=X.loc[test_index]   ## 179*7
    
            clf.fit(kf_x_train,kf_y_train)

            oof_train[test_index,j]=clf.predict_proba(kf_x_test)[:,1]
            oof_test_skf[:,i]=clf.predict_proba(X_tes)[:,1]
    
        oof_test[:,j]=oof_test_skf.mean(axis=1)




#----------------------The second layer, you can choose LR, you can also choose lgb
#-------------------------------Select lgb---------0.08248948001930827
lgb0 = lgb.LGBMClassifier(
    objective='binary',
    metric='logloss',
    num_leaves=35,
    depth=8,
    learning_rate=0.05,
    seed=2018,
    colsample_bytree=0.8,
    # min_child_samples=8,
    subsample=0.9,
    n_estimators = 200)
   
lgb_model = lgb0.fit(oof_train, y, eval_set=[(oof_test, y_tes)], early_stopping_rounds=200)
best_iter = lgb_model.best_iteration_
lgb2 = lgb.LGBMClassifier(
    objective='binary',
    metric='logloss',
    num_leaves=35,
    depth=8,
    learning_rate=0.05,
    seed=2018,
    colsample_bytree=0.8,
    # min_child_samples=8,
    subsample=0.9,
    n_estimators = 300)
lgb2.fit(oof_train,y)
Y_test_predict = lgb2.predict_proba(oof_test)[:,1]
pred = pd.DataFrame ()
pred['stacking_pred'] = Y_test_predict
pred['is_trade'] = y_tes
logloss = log_loss(pred['is_trade'],pred['stacking_pred'])


#-----------------------------select lr-------0.08304983559720211
lr = LogisticRegression(n_jobs=-1)
lr.fit(oof_train, y)
Y_test_predict = lr.predict_proba(oof_test)[:,1]
pred = pd.DataFrame ()
pred['stacking_pred'] = Y_test_predict
pred['is_trade'] = y_tes
logloss = log_loss(pred['is_trade'],pred['stacking_pred'])
#--------------------Comparison of results

#Use the original data to predict the lgb-----0.08245197939071315
lgb0 = lgb.LGBMClassifier(
    objective='binary',
    # metric='binary_error',
    num_leaves=35,
    depth=8,
    learning_rate=0.05,
    seed=2018,
    colsample_bytree=0.8,
    # min_child_samples=8,
    subsample=0.9,
    n_estimators = 300)
lgb_model  = lgb0.fit(X, y)
Y_test_predict = lgb_model.predict_proba(test[col])[:, 1]
pred = pd.DataFrame ()
pred ['lgb_pred'] = Y_test_predict
pred['is_trade'] = y_tes
logloss = log_loss(pred['is_trade'],pred['lgb_pred'])
#cat with the original data---------0.08236626308985852
cat0 = CatBoostClassifier(verbose=True,depth=8,iterations=200,learning_rate=0.1,
                   eval_metric='AUC',bagging_temperature=0.8,
                   l2_leaf_reg = 4,
                   rsm=0.8,random_seed=10086)
cat_model  = cat0.fit(X, y)
Y_test_predict = cat_model.predict_proba(X_tes)[:, 1]
pred = pd.DataFrame ()
pred ['cat_pred'] = Y_test_predict
pred['is_trade'] = y_tes
logloss = log_loss(pred['is_trade'],pred['cat_pred'])

#Use the original data to predict lr---------------0.0914814461101563
lr = LogisticRegression(n_jobs=-1)
lr.fit(X, y)
Y_test_predict = lr.predict_proba(test[col])[:, 1]
#pred = pd.DataFrame ()
pred ['lr_pred'] = Y_test_predict
pred['is_trade'] = y_tes
logloss = log_loss(pred['is_trade'],pred['lr_pred'])
#Do a simple bagging--------------0.08223091072342852
before ['bagg_pred'] = (before ['cat_pred'] + before ['lgb_pred']) / 2
logloss = log_loss(pred['is_trade'],pred['bagg_pred'])
#Do a weighted bagging-------------
lr_loss = (1/0.0914814461101563)
cat_loss = (1/0.08222911268463354)
lgb_loss = (1/0.08237142613041541)
#pred['weight_bagg_pred'] = (pred['cat_pred']*cat_loss+\
#                            pred['lgb_pred']*lgb_loss+pred['lr_pred']*lr_loss
#                            )/(lr_loss+cat_loss+lgb_loss)
pred['weight_bagg_pred'] = (pred['cat_pred']*cat_loss+\
                            pred['lgb_pred']*lgb_loss
                            )/(cat_loss+lgb_loss)
logloss = log_loss(pred['is_trade'],pred['weight_bagg_pred'])






Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324649926&siteId=291194637