Easy-to-understand machine learning - sklearn integrated learning code implementation

This article has participated in the "New Talent Creation Ceremony" event

Comparison of ordinary decision trees and random forests

Generate circles dataset

X,y = datasets.make_moons(n_samples=500,noise=0.3,random_state=42)
plt.scatter(X[y==0,0],X[y==0,1])
plt.scatter(X[y==1,0],X[y==1,1])
plt.show()
复制代码

insert image description here

drawing function

def plot_decision_boundary(model, X, y):
    x0_min, x0_max = X[:,0].min()-1, X[:,0].max()+1
    x1_min, x1_max = X[:,1].min()-1, X[:,1].max()+1
    x0, x1 = np.meshgrid(np.linspace(x0_min, x0_max, 100), np.linspace(x1_min, x1_max, 100))
    Z = model.predict(np.c_[x0.ravel(), x1.ravel()]) 
    Z = Z.reshape(x0.shape)
    
    plt.contourf(x0, x1, Z, cmap=plt.cm.Spectral)
    plt.ylabel('x1')
    plt.xlabel('x0')
    plt.scatter(X[:, 0], X[:, 1], c=np.squeeze(y))
    plt.show()
复制代码

Prediction using decision trees

Build a decision tree and train it

from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(max_depth=6)
dt_clf.fit(X, y)
plot_decision_boundary(dt_clf,X,y)
复制代码

draw

Since the decision tree likes to go straight, the prediction results are as follows insert image description here

Cross-validation

from sklearn.model_selection import cross_val_score
print(cross_val_score(dt_clf,X, y, cv=5).mean()) #cv决定做几轮交叉验证

#分折交叉验证，会按照原始类别比例分割数据集
from sklearn.model_selection import StratifiedKFold

strKFold = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
print(cross_val_score(dt_clf,X, y,cv=strKFold).mean())

#留一法交叉验证
from sklearn.model_selection import LeaveOneOut

loout = LeaveOneOut()
print(cross_val_score(dt_clf,X, y,cv=loout).mean())

#可以控制划分迭代次数、每次划分时测试集和训练集的比例（也就是说：可以存在既不在训练集也不再测试集的情况）
from sklearn.model_selection import ShuffleSplit

shufspl = ShuffleSplit(train_size=.5,test_size=.4,n_splits=8) #迭代8次；
print(cross_val_score(dt_clf,X, y,cv=shufspl).mean())
复制代码

insert image description here

Voting with Voting Classifier

Building a voting classifier

Voting Classifier voting is a kind of hard voting, that is, voting after getting the prediction result

from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(estimators=[
    ('knn_clf',KNeighborsClassifier(n_neighbors=7)),
    ('gnb_clf',GaussianNB()),
    ('dt_clf', DecisionTreeClassifier(max_depth=6))
],voting='hard')
voting_clf.fit(X_train,y_train)
voting_clf.score(X_test,y_test)
复制代码

insert image description here

draw

plot_decision_boundary(voting_clf,X,y)
复制代码

insert image description here

Soft Voting Classifier

Soft Voting classifier is a kind of soft voting, he votes with reference to the probability generated by each classifier

Building a voting classifier

voting_clf = VotingClassifier(e![请添加图片描述](https://img-blog.csdnimg.cn/88798644fbc0458d88d4f1a97d4f7e17.png)
stimators=[
    ('knn_clf',KNeighborsClassifier(n_neighbors=7)),
    ('gnb_clf',GaussianNB()),
    ('dt_clf', DecisionTreeClassifier(max_depth=6))
],voting='soft')
voting_clf.fit(X_train,y_train)
voting_clf.score(X_test,y_test)
复制代码

insert image description here

draw

insert image description here

bagging

Parse

Build multiple models, randomly select some data from the data to train different models, and determine the final prediction result through the prediction results of the multiple models.

Multiple Decision Tree Models

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

bagging_clf = BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100,bootstrap=True)

bagging_clf.fit(X_train,y_train)
bagging_clf.score(X,y)
复制代码

insert image description here

draw

insert image description here

Personally, I feel that the prediction result of this is similar to the result of a single decision tree, and it is also straight-forward.

Out of Bag-oob

Similar to bagging, the only difference is that after each model selects a part of the data, the next model selection only selects the data set from the unselected data

Code

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

bagging_clf = BaggingClassifier(DecisionTreeClassifier(),#分类器
                                n_estimators=500,#分类器个数
                                max_samples=100,#每个模型训练取样本数
                                bootstrap=True,#放回取样
                                oob_score=True)#out of bag

bagging_clf.fit(X,y)
bagging_clf.oob_score_
复制代码

insert image description here

draw

insert image description here

The prediction results of this are also similar, maybe it will be better to change to other data sets

random forest

The effect of random forest is the same as that of ensemble learning composed of multiple decision trees.

Code

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=500,random_state=666,oob_score=True)

rf_clf.fit(X,y)
rf_clf.oob_score_
复制代码

insert image description here

draw

insert image description here

extreme tree

Extreme random tree algorithm and random forest algorithm are composed of many decision trees. The main difference between extreme trees and random forests Random forests use the Bagging model. All the samples used by extreme trees, but the features are randomly selected, because the splitting is random, so to some extent, the results obtained by random forests are more Okay.

Code

from sklearn.ensemble import ExtraTreesClassifier
et_clf = ExtraTreesClassifier(n_estimators=500,random_state=666,bootstrap=True,oob_score=True)
et_clf.fit(X,y)
et_clf.oob_score_
复制代码

insert image description here

draw

insert image description here

There is a Boost

Weighting the prediction results generated by each model training, the weight of the data that failed to predict the previous model will be increased, and the data used by the next model is the weighted data

Code

from  sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=6),n_estimators=500)
ada_clf.fit(X_train,y_train)
ada_clf.score(X_test,y_test)
复制代码

insert image description here

draw

insert image description here

Gradient Boosting

Only train on wrongly predicted data

Code

from  sklearn.ensemble import GradientBoostingClassifier

gd_clf = GradientBoostingClassifier(max_depth=6,n_estimators=500)

gd_clf.fit(X_train,y_train)
gd_clf.score(X_test,y_test)
复制代码

insert image description here

draw

insert image description here