Insight Trend Series Three-Model Training (Baseline Model)

Before starting the baseline model training, data normalization is also required.
Data standardization is: after the data is centered by the minimum value, and then scaled by the range (maximum-minimum), the data will be converged to between [0,1], the purpose is to transform the feature size to a uniform style.
Use sklearn.preprocessing's MinMaxScaler to achieve this function

1. Data normalization

form sklearn.preprocessing import MinMaxScaler
labels_train=df_train['TARGET']
feature_train=df_train.drop(['TARGET'],axis=1)
feature_names=feature_train.columns
# 归一化
scaler=MinMaxScaler()
scaler.fit(feature_train)
feature_train=scaler.transform(feature_train)
feature_test=scaler.transform(df_test)

2.Logistic regression model

First, use sklearn's LogisticRegression as the first model, use L2 regularity, and a penalty coefficient of C (used to control the degree of fitting, the larger the better the better the control)

from sklearn.linear_model import LogisticRegression
# 创建Logistic回归模型,设置惩罚项系数C为1.0
lr=LogisticRegression(penalty='l2',C=1.0,class_weight='balanced')
lr.fit(feature_train,labels_train)
# 使用predict_proba做模型预测,预测出来是0-1的取值,可以看做是概率,结果去第二列即可。
lr_pred=lr.predict_proba(feature_test)[:,1]
lr_pred

Once you have the forecast result, submit it first, and build the submission result first

submit=df_test[['SK_ID_CURR']]
submit['TARGET']=lr_pred

Save the submission result as a CSV file

submit.to_csv('lr_baseline.csv',index=False)

3.Random Forest

Then use other machine learning models, such as the Random Forest model, and try to see if it will perform better on the same data set.

from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(
	n_estimators=100,
	criterion='entropy',
	max_depth=8,
	min_samples_leaf=50,
	max_leaf_nodes=200,
	bootstrap=True,
	n_jobs=1,
	random_state=0,
	max_samples=0.8,
	class_weight='balanced',
)
rf.fit(feature_train,labels_train)
rf_pred=rf.predict_proba(feature_test)[:,1]
rf_pred

Create submission result format

submit=df_test[['SK_ID_CURR']]
submit['TARGET']=rf_pred

Save the submission result file

submit.to_csv('rf_Baseline.csv',index=False)

4. LightGBM model

import lightgbm as lgb
params={
    
    
	'bootsint_type':'gbdt',
	'objective':'binary',
	'n_estimators':100,
	'learning':0.01,
	'max_depth':6,
	'num_leaves':50,
	'subsample':0.8,
	'colsample_bytree':0.8,
	'subsample_freq':10,
	'reg_alpha':0.1,
	'reg_lambda':0.1,
	'is_unbalance':True,
}
model=lgb.LGBMClassifier(**params,random_state=0)
model.fit(feature_train,labels_train)
lgb_pred=model.predict_proba(feature_test)[:,1]

Create submission result file

submit=df_test[['SK_ID_CURR']]
submit['TARGET']=lgb_pred

Save the submission result file

submit.to_csv('lgb_baseline.csv',index=False)

5. Test the effectiveness of the new features

5.1 Test the validity of polynomial features

Verify whether the polynomial features and domain knowledge features are useful for the model, add them to the data set, get the actual prediction results, and compare them based on the submitted scores.
First do data normalization

scaler=MinMaxScaler()
scaler.fit(df_train_poly)
feats_train_poly=scaler.transform(df_train_poly)
feats_test_poly=scaler.transform(df_test_poly)

Use lightgbm for verification

params={
    
    
	'bootsint_type':'gbdt',
	'objective':'binary',
	'n_estimators':100,
	'learning':0.01,
	'max_depth':6,
	'num_leaves':50,
	'subsample':0.8,
	'colsample_bytree':0.8,
	'subsample_freq':10,
	'reg_alpha':0.1,
	'reg_lambda':0.1,
	'is_unbalance':True,
}
lgb_poly=lgb.LGBMClassifier(**params,random_state=0)
lgb_poly.fit(feats_train_poly,train_labels)
lgb_pred_poly=lgb_poly.predict_proba(feats_test_poly)[:,1]

Create submission result format

submit=df_test[['SK_ID_CURR']]
submit['TARGET']=lgb_pred_poly

Save the submission result file

submit.to_csv('lgb_domain.csv',index=False)

Score situation

  • The score is only 0.582, and the new features constructed have little improvement in the prediction of the lightgbm model (face like...)

5.3 Verifying domain knowledge characteristics

Data normalization

在这里插入代码片

5.4 View feature importance

One way to know the validity of features is to view the feature_importance_ attribute of the tree model

feature_importance_values=model.feature_importance_

Guess you like

Origin blog.csdn.net/weixin_42961082/article/details/113875328