lightgbm parameter explanation
boosting = ' gbdt ' , the iterator selection ' rf ' is slightly better
is_unbalance =True , the samples of the actual data are unbalanced, but setting this parameter leads to poor iteration effect
bagging_fraction=0.7,
bagging_freq =1,
Using the bagging method, 70% of the data is randomly selected for training, and bagging is performed every 1 iteration .
The effect is not significantly improved, but it should be improved. (Feature too dish)
num_leaves=64,
max_depth=6
num_leaves should be set to about 2^max_depth , but the nearby value may be better than this value
colsample_bytree=0.8
Randomly select 80% of the features for training to prevent overfitting
A fresh lightgbm Chinese document from Amway:
http://lightgbm.apachecn.org/cn/latest/Parameters.html
bagging
Bagging has not yet been implemented concretely, only two results with better predictions are simply added and then averaged. The advanced version takes a weighted average of the results according to the size of the logloss of the training set. But bagging's boost is fairly consistent and almost always improves results.
boosting
This idea is included in many packages, so there is not much to cover.
stacking
Stacking is said to be a big killer of the game, but the time to run is very expensive, and the speed of xgb is too slow.
So the algorithm is just implemented, and the number of iterations for each iterator is set at about 200 times.
The first layer xgb+lgb+cat
second layer xgb
The effect has deteriorated, but in principle, this method is a method that I like to take for granted, and it may be that I have not fully understood it.
Attach simple stacking + bagging code for weighted average of results.
Input: dataset data with constructed features
from mlxtend.regressor import StackingRegressor from mlxtend.data import boston_housing_data from sklearn.linear_model import LinearRegression from sklearn.linear_model import Ridge from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVR import matplotlib.pyplot as plt from sklearn.model_selection import KFold import xgboost as xgb from sklearn.ensemble import BaggingClassifier from sklearn.neural_network import MLPClassifier import xgboost as xgb from catboost import CatBoostClassifier #--------------------------------------------Declaring training data and test data train= data[(data['day'] >= 18) & (data['day'] <= 23)] test= data[(data['day'] == 24)] drop_name = ['is_trade', 'item_category_list', 'item_property_list', 'predict_category_property', 'realtime','context_timestamp' ] col = [c for c in train if c not in drop_name] X = train[col] y = train['is_trade'].values X_tes = test[col] y_tes = test['is_trade'].values #--------------------------------------------The first layer of stacking clfs=[ #RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'), # ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'), xgb.sklearn.XGBClassifier(n_estimators=200,max_depth=4,seed=5, learning_rate=0.1,subsample=0.8, min_child_weight=6,colsample_bytree=.8, scale_pos_weight=1.6, gamma=10, reg_alpha=8,reg_lambda=1.3,silent=False, eval_metric='logloss'), CatBoostClassifier(verbose=True,depth=8,iterations=200,learning_rate=0.1, eval_metric='AUC',bagging_temperature=0.8, l2_leaf_reg = 4, rsm=0.8,random_seed=10086), lgb.LGBMClassifier( objective='binary', metric='logloss', num_leaves=35, depth=8, learning_rate=0.1, seed=2018, colsample_bytree=0.8, # min_child_samples=8, subsample=0.9, n_estimators = 200), lgb.LGBMClassifier( objective='binary', metric='AUC', num_leaves=35, depth=8, learning_rate=0.1, seed=2018, colsample_bytree=0.8, # min_child_samples=8, subsample=0.9, n_estimators = 200, boosting = 'rf' ) ] cfs = len(clfs) ntrain=X.shape[0] ## 891 ntest=X_tes.shape[0] ## 418 kf=KFold(n_splits=cfs,random_state=2017) #def get_oof(clf,X,y,X_tes): oof_train=np.zeros((ntrain,cfs)) ##shape is (ntrain,) means only one dimension 891*1 oof_test=np.zeros((ntest,cfs)) for j, clf in enumerate(clfs): print 'Training classifier [%s]' % (j) oof_test_skf = np.zeros((ntest, cfs)) for i,(train_index,test_index) in enumerate(kf.split(X)): kf_x_train=X.loc[train_index] ## (891/5 *4)*7 故shape:(712*7) kf_y_train=y[train_index] ## 712*1 kf_x_test=X.loc[test_index] ## 179*7 clf.fit(kf_x_train,kf_y_train) oof_train[test_index,j]=clf.predict_proba(kf_x_test)[:,1] oof_test_skf[:,i]=clf.predict_proba(X_tes)[:,1] oof_test[:,j]=oof_test_skf.mean(axis=1) #----------------------The second layer, you can choose LR, you can also choose lgb #-------------------------------Select lgb---------0.08248948001930827 lgb0 = lgb.LGBMClassifier( objective='binary', metric='logloss', num_leaves=35, depth=8, learning_rate=0.05, seed=2018, colsample_bytree=0.8, # min_child_samples=8, subsample=0.9, n_estimators = 200) lgb_model = lgb0.fit(oof_train, y, eval_set=[(oof_test, y_tes)], early_stopping_rounds=200) best_iter = lgb_model.best_iteration_ lgb2 = lgb.LGBMClassifier( objective='binary', metric='logloss', num_leaves=35, depth=8, learning_rate=0.05, seed=2018, colsample_bytree=0.8, # min_child_samples=8, subsample=0.9, n_estimators = 300) lgb2.fit(oof_train,y) Y_test_predict = lgb2.predict_proba(oof_test)[:,1] pred = pd.DataFrame () pred['stacking_pred'] = Y_test_predict pred['is_trade'] = y_tes logloss = log_loss(pred['is_trade'],pred['stacking_pred']) #-----------------------------select lr-------0.08304983559720211 lr = LogisticRegression(n_jobs=-1) lr.fit(oof_train, y) Y_test_predict = lr.predict_proba(oof_test)[:,1] pred = pd.DataFrame () pred['stacking_pred'] = Y_test_predict pred['is_trade'] = y_tes logloss = log_loss(pred['is_trade'],pred['stacking_pred']) #--------------------Comparison of results #Use the original data to predict the lgb-----0.08245197939071315 lgb0 = lgb.LGBMClassifier( objective='binary', # metric='binary_error', num_leaves=35, depth=8, learning_rate=0.05, seed=2018, colsample_bytree=0.8, # min_child_samples=8, subsample=0.9, n_estimators = 300) lgb_model = lgb0.fit(X, y) Y_test_predict = lgb_model.predict_proba(test[col])[:, 1] pred = pd.DataFrame () pred ['lgb_pred'] = Y_test_predict pred['is_trade'] = y_tes logloss = log_loss(pred['is_trade'],pred['lgb_pred']) #cat with the original data---------0.08236626308985852 cat0 = CatBoostClassifier(verbose=True,depth=8,iterations=200,learning_rate=0.1, eval_metric='AUC',bagging_temperature=0.8, l2_leaf_reg = 4, rsm=0.8,random_seed=10086) cat_model = cat0.fit(X, y) Y_test_predict = cat_model.predict_proba(X_tes)[:, 1] pred = pd.DataFrame () pred ['cat_pred'] = Y_test_predict pred['is_trade'] = y_tes logloss = log_loss(pred['is_trade'],pred['cat_pred']) #Use the original data to predict lr---------------0.0914814461101563 lr = LogisticRegression(n_jobs=-1) lr.fit(X, y) Y_test_predict = lr.predict_proba(test[col])[:, 1] #pred = pd.DataFrame () pred ['lr_pred'] = Y_test_predict pred['is_trade'] = y_tes logloss = log_loss(pred['is_trade'],pred['lr_pred']) #Do a simple bagging--------------0.08223091072342852 before ['bagg_pred'] = (before ['cat_pred'] + before ['lgb_pred']) / 2 logloss = log_loss(pred['is_trade'],pred['bagg_pred']) #Do a weighted bagging------------- lr_loss = (1/0.0914814461101563) cat_loss = (1/0.08222911268463354) lgb_loss = (1/0.08237142613041541) #pred['weight_bagg_pred'] = (pred['cat_pred']*cat_loss+\ # pred['lgb_pred']*lgb_loss+pred['lr_pred']*lr_loss # )/(lr_loss+cat_loss+lgb_loss) pred['weight_bagg_pred'] = (pred['cat_pred']*cat_loss+\ pred['lgb_pred']*lgb_loss )/(cat_loss+lgb_loss) logloss = log_loss(pred['is_trade'],pred['weight_bagg_pred'])