版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_14959801/article/details/51275559
集成学习是提高模型鲁棒性的重要方法,在数据、特征处理之后的阶段,如果在算法方面没有提升,可以尝试在模型集成方面发力,可以收到意想不到的结果。但并不是使用集成学习方法就一定会提高结果。例如stacking方法,理论讲其结果渐进等价于第一层最优子模型结果,使用stacking至少不会大幅度降低模型效果。
一、投票方法
常用的有软投票和硬投票两种,例如,支持向量机可以输出各个样本属于某一类的概率,将多个模型的这种结果进行加权,便得到软投票的集成结果。硬投票更简单,直接多数服从少数即可。
二、Bagging
最常用的莫过于随机森林方法,其思想在随机森林原理这篇文章中有详细介绍。在构建模型的过程中,随机有放回的采样部分样本训练基学习器,最后将基学习器的结果进行融合就是bagging的最终结果。
三、Boosting
提升方法在boosting原理这篇文章中有较为详细的介绍。Boosting算法可以并行处理,而Boosting的思想是一种迭代的方法,每一次训练的时候都更加关心分类错误的样例,给这些分类错误的样例增大权重,下一次迭代的目标就是能够更容易辨别出上一轮分类错误的样例。最终将这些弱分类器进行加权。
四、Stacking
接下来重点介绍一下Stacking方法。
df = pd.read_csv('C:/Users/Titanic/train.csv')
test = df.sample(frac=0.1)
test.to_csv('C:/Users/Titanic/test.csv')
train = df[~df.PassengerId.isin(test.PassengerId)]
train.to_csv('C:/Users/Titanic/train.csv')
In [43]:
import pandas as pd
usedColumnFeature = ['PassengerId','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Survived']
train = pd.read_csv('C:/Users/Titanic/train.csv', usecols=usedColumnFeature)
train.dropna(subset=['Age'],how='any',axis=0,inplace=True)
test = pd.read_csv('C:/Users/Titanic/test.csv', usecols=usedColumnFeature)
test.dropna(subset=['Age'],how='any',axis=0,inplace=True)
train = train.set_index('PassengerId')
test = test.set_index('PassengerId')
train.head()
Out[43]:
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||
1 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S |
2 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C |
3 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S |
4 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S |
5 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S |
In [44]:
y_train = train['Survived']
train.drop('Survived',axis=1,inplace=True)
y_test = test['Survived']
test.drop('Survived',axis=1,inplace=True)
In [45]:
typeNames = ['Pclass','Sex','Embarked']
for item in typeNames:
train = pd.concat([train,pd.get_dummies(train[item],prefix=item+'_')], axis=1)
test = pd.concat([test,pd.get_dummies(test[item],prefix=item+'_')], axis=1)
train.drop(typeNames, axis=1, inplace=True)
test.drop(typeNames, axis=1, inplace=True)
test.head()
Out[45]:
Age | SibSp | Parch | Fare | Pclass__1 | Pclass__2 | Pclass__3 | Sex__female | Sex__male | Embarked__C | Embarked__Q | Embarked__S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||||||
374 | 22.0 | 0 | 0 | 135.6333 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
517 | 34.0 | 0 | 0 | 10.5000 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
310 | 30.0 | 0 | 0 | 56.9292 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
165 | 1.0 | 4 | 1 | 39.6875 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
387 | 1.0 | 5 | 2 | 46.9000 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
In [46]:
#rf = RandomForestClassifier(oob_score=True, random_state=9)
#gbm = GradientBoostingClassifier(random_state=9)
from sklearn.linear_model.logistic import LogisticRegression
lr_model=LogisticRegression()
lr_model.fit(train,y_train)
pred = lr_model.predict(test)
In [47]:
import numpy as np
result = pd.DataFrame(pred, columns=['pred'], index=y_test.index)
result['y_test'] = y_test
print(len(result[result.pred == result.y_test]))
result.head()
55
Out[47]:
pred | y_test | |
---|---|---|
PassengerId | ||
374 | 1 | 0 |
517 | 1 | 1 |
310 | 1 | 1 |
165 | 0 | 0 |
387 | 0 | 0 |
In [48]:
print(train.shape)
print(test.shape)
(644, 12)
(70, 12)
In [49]:
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
y_train = y_train.to_frame().reset_index(drop=True)
In [50]:
# Out-of-Fold Predictions
from sklearn.model_selection import KFold
ntrain = train.shape[0]
ntest = test.shape[0]
kf = KFold(n_splits=5, random_state=2019)
clf = LogisticRegression()
def get_oof(clf, train, y_train, test):
oof_train = np.zeros((ntrain,)) # 1 * 644
oof_test = np.zeros((ntest,)) # 1 * 70
oof_test_skf = np.empty((5, ntest)) # 5 * 70
for i, (train_index,test_index) in enumerate(kf.split(train)): # train: 644 * 12
kf_X_train = train.iloc[list(train_index),:] # 515 * 12
kf_y_train = y_train.iloc[list(train_index),:] # 515 * 1
kf_X_test = train.iloc[list(test_index),:] # 129 * 12
clf.fit(kf_X_train, kf_y_train)
oof_train[test_index] = clf.predict(kf_X_test) # 1 * 129 ==> 1 * 644
oof_test_skf[i, :] = clf.predict(test) # oof_test_skf[i,:] 1 * 70 ==> 5 * 70
oof_test[:] = oof_test_skf.mean(axis=0) # oof_test[:] 1 * 70
return oof_train.reshape(-1,1), oof_test.reshape(-1,1)
# 891 * 1 418 * 1
new_train_lr, new_test_lr = get_oof(clf, train, y_train, test)
In [51]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
new_train_knn, new_test_knn = get_oof(neigh, train, y_train, test)
In [52]:
new_train = y_train
new_train['lr_feature'] = new_train_lr
new_train['knn_feature'] = new_train_knn
new_train.head()
Out[52]:
Survived | lr_feature | knn_feature | |
---|---|---|---|
0 | 0 | 0.0 | 0.0 |
1 | 1 | 1.0 | 1.0 |
2 | 1 | 1.0 | 1.0 |
3 | 1 | 1.0 | 1.0 |
4 | 0 | 0.0 | 0.0 |
In [53]:
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier()
dt_model.fit(new_train[['lr_feature','knn_feature']].values, new_train['Survived'])
new_test = pd.DataFrame(new_test_lr,columns=['lr_feature'])
new_test['knn_feature'] = new_test_knn
pred = dt_model.predict(new_test.values)
In [54]:
new_result = pd.DataFrame(y_test)
new_result['pred'] = pred
len(new_result[new_result.Survived == new_result.pred])
Out[54]:
55