版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u014281392/article/details/81177794
之前给予application_train.csv和application_test.csv,做了简单的特征工程:得到了下面几个文件:(对recode后的数据做了三种不同的特征处理)
- Polynomial Features :poly_train_data.csv,poly_test_data.csv
- Domain Knowledge Features: domain_train_data.csv,domain_test_data.csv
- Featuretools:auto_train_data.csv, auto_test_data.csv
Training model
Logistic Regression
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler, Imputer
from sklearn.linear_model import LogisticRegression
# load data
poly_train = pd.read_csv('data/poly_train_data.csv')
poly_test = pd.read_csv('data/poly_test_data.csv')
#domain_train = pd.read_csv('data/domain_train_data.csv')
#domain_test = pd.read_csv('data/domain_test_data.csv')
auto_train = pd.read_csv('data/auto_train_data.csv')
auto_test = pd.read_csv('data/auto_test_data.csv')
缺失值标准化处理
target = poly_train['TARGET']
Id = poly_test[['SK_ID_CURR']]
polynomial features
poly_train = poly_train.drop(['TARGET'],1)
# feature names
poly_features = list(poly_train.columns)
# 中位数填充缺失值
imputer = Imputer(strategy = 'median')
# scale feature to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))
# fit train data
imputer.fit(poly_train)
# Transform train test data
poly_train = imputer.transform(poly_train)
poly_test = imputer.transform(poly_test)
# scaler
scaler.fit(poly_train)
poly_train = scaler.transform(poly_train)
poly_test = scaler.transform(poly_test)
domain features
domain_train = domain_train.drop(['TARGET'],1)
domain_features = list(domain_train.columns)
# fit train data
imputer.fit(domain_train)
# Transform train test data
domain_train = imputer.transform(domain_train)
domain_test = imputer.transform(domain_test)
# scaler
scaler.fit(domain_train)
domain_train = scaler.transform(domain_train)
domain_test = scaler.transform(domain_test)
Featuretool
auto_train = auto_train.drop(['TARGET'],1)
auto_features = list(auto_train.columns)
# fit train data
imputer.fit(auto_train)
# Transform train test data
auto_train = imputer.transform(auto_train)
auto_test = imputer.transform(auto_test)
# scaler
scaler.fit(auto_train)
auto_train = scaler.transform(auto_train)
auto_test = scaler.transform(auto_test)
print('poly_train',poly_train.shape)
print('poly_test',poly_test.shape)
print('domain_train',domain_train.shape)
print('domain_test',domain_test.shape)
print('auto_train',auto_train.shape)
print('auto_test',auto_test.shape)
poly_train (307511, 274)
poly_test (48744, 274)
domain_train (307511, 244)
domain_test (48744, 244)
auto_train (307511, 239)
auto_test (48744, 239)
LogisticRegression,
lr = LogisticRegression(C = 0.0001,class_weight='balanced') # c 正则化参数
Polynomial
lr.fit(poly_train, target)
lr_poly_pred = lr.predict_proba(poly_test)[:,1]
# submission dataframe
submit = Id.copy()
submit['TARGET'] = lr_poly_pred
submit.to_csv('lr_poly_submit.csv',index = False)
Domain Knowledge
lr = LogisticRegression(C = 0.0001,class_weight='balanced',solver='sag')
lr.fit(domain_train, target)
lr_domain_pred = lr.predict_proba(domain_test)[:,1]
submit = Id.copy()
submit['TARGET'] = lr_domain_pred
submit.to_csv('lr_domain_submit.csv',index = False)
FeatureTools
lr = LogisticRegression(C = 0.0001,class_weight='balanced',solver='sag')
lr.fit(auto_train, target)
lr_auto_pred = lr.predict_proba(auto_test)[:,1]
submit = Id.copy()
submit['TARGET'] = lr_auto_pred
submit.to_csv('lr_auto_submit.csv',index = False)
在线测评结果:
- Polynomail–: 0.723
- Domain——: 0.670
- Featuretool-: 0.669
下面升级一下算法,仍在这三组数据哈桑进行训练,随机森林.
Random Forest
from sklearn.ensemble import RandomForestClassifier
Polynomail
random_forest = RandomForestClassifier(n_estimators = 100, random_state = 55, verbose = 1,
n_jobs = -1)
random_forest.fit(poly_train, target)
# 提取重要特征
poly_importance_feature_values = random_forest.feature_importances_
poly_importance_features = pd.DataFrame({'feature':poly_features,
'importance':poly_importance_feature_values})
rf_poly_pred = random_forest.predict_proba(poly_test)[:,1]
# 结果
submit = Id.copy()
submit['TARGET'] = rf_poly_pred
submit.to_csv('rf_poly_submit.csv', index = False)
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 1.6min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 3.7min finished
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.4s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 0.9s finished
Feature importance
poly_importance_features = poly_importance_features.set_index(['feature'])
poly_importance_features.sort_values(by = 'importance').plot(kind='barh',figsize=(10, 120))
根据上面图中的数据,可以做一些特征选择,丢掉一些完全没有作用的特征,同时也给数据降低维度.下面在升级一下算法,机器学习中的大杀器,Light Gradient Boosting Machine
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gc
import numpy as np
import warnings
warnings.filterwarnings('ignore')
def model(features, test_features, n_folds = 10):
# 取出ID列
train_ids = features['SK_ID_CURR']
test_ids = test_features['SK_ID_CURR']
# TARGET
labels = features[['TARGET']]
# 去掉ID和TARGET
features = features.drop(['SK_ID_CURR', 'TARGET'], axis = 1)
test_features = test_features.drop(['SK_ID_CURR'], axis = 1)
# 特征名字
feature_names = list(features.columns)
# Dataframe-->数组
#features = np.array(features)
#test_features = np.array(test_features)
# 随即切分train _data10份,9份训练,1份验证
k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)
# test predictions
test_predictions = np.zeros(test_features.shape[0])
# validation predictions
out_of_fold = np.zeros(features.shape[0])
# 记录每次的scores
valid_scores = []
train_scores = []
# Iterate through each fold
count = 0
for train_indices, valid_indices in k_fold.split(features):
# Training data for the fold
train_features = features.loc[train_indices, :]
train_labels = labels.loc[train_indices, :]
# Validation data for the fold
valid_features = features.loc[valid_indices, :]
valid_labels = labels.loc[valid_indices, :]
# Create the model
model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary',
class_weight = 'balanced', learning_rate = 0.05,
reg_alpha = 0.1, reg_lambda = 0.1,
subsample = 0.8, n_jobs = -1, random_state = 50)
# Train the model
model.fit(train_features, train_labels, eval_metric = 'auc',
eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
eval_names = ['valid', 'train'], categorical_feature = 'auto',
early_stopping_rounds = 100, verbose = 200)
# Record the best iteration
best_iteration = model.best_iteration_
# 测试集的结果
test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1]/n_folds
# 验证集结果
out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
# Record the best score
valid_score = model.best_score_['valid']['auc']
train_score = model.best_score_['train']['auc']
valid_scores.append(valid_score)
train_scores.append(train_score)
# Clean up memory
gc.enable()
del model, train_features, valid_features
gc.collect()
count += 1
pirnt("%d_fold is over"%count)
# Make the submission dataframe
submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
# Overall validation score
valid_auc = roc_auc_score(labels, out_of_fold)
# Add the overall scores to the metrics
valid_scores.append(valid_auc)
train_scores.append(np.mean(train_scores))
# dataframe of validation scores
fold_names = list(range(n_folds))
fold_names.append('overall')
# Dataframe of validation scores
metrics = pd.DataFrame({'fold': fold_names,
'train': train_scores,
'valid': valid_scores})
return submission, metrics
# load data
poly_train = pd.read_csv('data/poly_train_data.csv')
poly_test = pd.read_csv('data/poly_test_data.csv')
print('poly_train:',poly_train.shape)
print('poly_test:',poly_test.shape)
poly_train: (307511, 275)
poly_test: (48744, 274)
Select Features
特征影响力排名,从小到大
poly_importance_features = poly_importance_features.sort_values(by = 'importance')
丢掉这20个特征
poly_importance_features.head(20).plot(kind = 'barh')
s_train_1 = poly_train.copy()
s_test_1 = poly_test.copy()
# 特征列名
drop_feature_names = poly_importance_features.index[:20]
# 去掉掉20个特征
s_train_1 = s_train_1.drop(drop_feature_names, axis = 1)
s_test_1 = s_test_1.drop(drop_feature_names, axis = 1)
submit2, metrics2 = model(s_train_1, s_test_1, n_folds= 5)
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.800686 valid's auc: 0.755447
[400] train's auc: 0.831722 valid's auc: 0.755842
Early stopping, best iteration is:
[351] train's auc: 0.824767 valid's auc: 0.756092
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.800338 valid's auc: 0.757507
[400] train's auc: 0.831318 valid's auc: 0.757378
Early stopping, best iteration is:
[307] train's auc: 0.818238 valid's auc: 0.757819
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.799557 valid's auc: 0.762719
Early stopping, best iteration is:
[160] train's auc: 0.791849 valid's auc: 0.763023
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.80053 valid's auc: 0.758546
Early stopping, best iteration is:
[224] train's auc: 0.804828 valid's auc: 0.758703
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.799826 valid's auc: 0.758312
[400] train's auc: 0.831271 valid's auc: 0.758623
Early stopping, best iteration is:
[319] train's auc: 0.819603 valid's auc: 0.758971
metrics2
fold | train | valid | |
---|---|---|---|
0 | 0 | 0.824767 | 0.756092 |
1 | 1 | 0.818238 | 0.757819 |
2 | 2 | 0.791849 | 0.763023 |
3 | 3 | 0.804828 | 0.758703 |
4 | 4 | 0.819603 | 0.758971 |
5 | overall | 0.811857 | 0.758799 |
submit2.to_csv('submit2.csv',index = False)
评分结果:0.734
丢掉几个特征后结果略有提升哦,去掉30个试试
s_train_2 = poly_train.copy()
s_test_2 = poly_test.copy()
# 特征列名
drop_feature_names = poly_importance_features.index[:30]
# 丢掉30个特征
s_train_2 = s_train_2.drop(drop_feature_names, axis = 1)
s_test_2 = s_test_2.drop(drop_feature_names, axis = 1)
submit3, metrics3 = model(s_train_2, s_test_2, n_folds= 5)
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.800547 valid's auc: 0.755442
Early stopping, best iteration is:
[267] train's auc: 0.81211 valid's auc: 0.755868
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.80048 valid's auc: 0.757653
Early stopping, best iteration is:
[258] train's auc: 0.81057 valid's auc: 0.758107
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.799261 valid's auc: 0.76291
Early stopping, best iteration is:
[189] train's auc: 0.797314 valid's auc: 0.762962
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.800499 valid's auc: 0.758385
Early stopping, best iteration is:
[202] train's auc: 0.800851 valid's auc: 0.758413
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.799977 valid's auc: 0.758234
Early stopping, best iteration is:
[284] train's auc: 0.814454 valid's auc: 0.758612
metrics3
fold | train | valid | |
---|---|---|---|
0 | 0 | 0.812110 | 0.755868 |
1 | 1 | 0.810570 | 0.758107 |
2 | 2 | 0.797314 | 0.762962 |
3 | 3 | 0.800851 | 0.758413 |
4 | 4 | 0.814454 | 0.758612 |
5 | overall | 0.807060 | 0.758735 |
submit3.to_csv('submit3.csv',index = False)
结果是:0.733,基本没什么变化.只用了一张主表肯定是不行的,这个纯属娱乐一下.