Baseline model-to be continued
Before starting the baseline model training, data normalization is also required.
Data standardization is: after the data is centered by the minimum value, and then scaled by the range (maximum-minimum), the data will be converged to between [0,1], the purpose is to transform the feature size to a uniform style.
Use sklearn.preprocessing's MinMaxScaler to achieve this function
1. Data normalization
form sklearn.preprocessing import MinMaxScaler
labels_train=df_train['TARGET']
feature_train=df_train.drop(['TARGET'],axis=1)
feature_names=feature_train.columns
# 归一化
scaler=MinMaxScaler()
scaler.fit(feature_train)
feature_train=scaler.transform(feature_train)
feature_test=scaler.transform(df_test)
2.Logistic regression model
First, use sklearn's LogisticRegression as the first model, use L2 regularity, and a penalty coefficient of C (used to control the degree of fitting, the larger the better the better the control)
from sklearn.linear_model import LogisticRegression
# 创建Logistic回归模型,设置惩罚项系数C为1.0
lr=LogisticRegression(penalty='l2',C=1.0,class_weight='balanced')
lr.fit(feature_train,labels_train)
# 使用predict_proba做模型预测,预测出来是0-1的取值,可以看做是概率,结果去第二列即可。
lr_pred=lr.predict_proba(feature_test)[:,1]
lr_pred
Once you have the forecast result, submit it first, and build the submission result first
submit=df_test[['SK_ID_CURR']]
submit['TARGET']=lr_pred
Save the submission result as a CSV file
submit.to_csv('lr_baseline.csv',index=False)
3.Random Forest
Then use other machine learning models, such as the Random Forest model, and try to see if it will perform better on the same data set.
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(
n_estimators=100,
criterion='entropy',
max_depth=8,
min_samples_leaf=50,
max_leaf_nodes=200,
bootstrap=True,
n_jobs=1,
random_state=0,
max_samples=0.8,
class_weight='balanced',
)
rf.fit(feature_train,labels_train)
rf_pred=rf.predict_proba(feature_test)[:,1]
rf_pred
Create submission result format
submit=df_test[['SK_ID_CURR']]
submit['TARGET']=rf_pred
Save the submission result file
submit.to_csv('rf_Baseline.csv',index=False)
4. LightGBM model
import lightgbm as lgb
params={
'bootsint_type':'gbdt',
'objective':'binary',
'n_estimators':100,
'learning':0.01,
'max_depth':6,
'num_leaves':50,
'subsample':0.8,
'colsample_bytree':0.8,
'subsample_freq':10,
'reg_alpha':0.1,
'reg_lambda':0.1,
'is_unbalance':True,
}
model=lgb.LGBMClassifier(**params,random_state=0)
model.fit(feature_train,labels_train)
lgb_pred=model.predict_proba(feature_test)[:,1]
Create submission result file
submit=df_test[['SK_ID_CURR']]
submit['TARGET']=lgb_pred
Save the submission result file
submit.to_csv('lgb_baseline.csv',index=False)
5. Test the effectiveness of the new features
5.1 Test the validity of polynomial features
Verify whether the polynomial features and domain knowledge features are useful for the model, add them to the data set, get the actual prediction results, and compare them based on the submitted scores.
First do data normalization
scaler=MinMaxScaler()
scaler.fit(df_train_poly)
feats_train_poly=scaler.transform(df_train_poly)
feats_test_poly=scaler.transform(df_test_poly)
Use lightgbm for verification
params={
'bootsint_type':'gbdt',
'objective':'binary',
'n_estimators':100,
'learning':0.01,
'max_depth':6,
'num_leaves':50,
'subsample':0.8,
'colsample_bytree':0.8,
'subsample_freq':10,
'reg_alpha':0.1,
'reg_lambda':0.1,
'is_unbalance':True,
}
lgb_poly=lgb.LGBMClassifier(**params,random_state=0)
lgb_poly.fit(feats_train_poly,train_labels)
lgb_pred_poly=lgb_poly.predict_proba(feats_test_poly)[:,1]
Create submission result format
submit=df_test[['SK_ID_CURR']]
submit['TARGET']=lgb_pred_poly
Save the submission result file
submit.to_csv('lgb_domain.csv',index=False)
- The score is only 0.582, and the new features constructed have little improvement in the prediction of the lightgbm model (face like...)
5.3 Verifying domain knowledge characteristics
Data normalization
在这里插入代码片
5.4 View feature importance
One way to know the validity of features is to view the feature_importance_ attribute of the tree model
feature_importance_values=model.feature_importance_