This article has participated in the "New Talent Creation Ceremony" event
Comparison of ordinary decision trees and random forests
Generate circles dataset
X,y = datasets.make_moons(n_samples=500,noise=0.3,random_state=42)
plt.scatter(X[y==0,0],X[y==0,1])
plt.scatter(X[y==1,0],X[y==1,1])
plt.show()
复制代码
drawing function
def plot_decision_boundary(model, X, y):
x0_min, x0_max = X[:,0].min()-1, X[:,0].max()+1
x1_min, x1_max = X[:,1].min()-1, X[:,1].max()+1
x0, x1 = np.meshgrid(np.linspace(x0_min, x0_max, 100), np.linspace(x1_min, x1_max, 100))
Z = model.predict(np.c_[x0.ravel(), x1.ravel()])
Z = Z.reshape(x0.shape)
plt.contourf(x0, x1, Z, cmap=plt.cm.Spectral)
plt.ylabel('x1')
plt.xlabel('x0')
plt.scatter(X[:, 0], X[:, 1], c=np.squeeze(y))
plt.show()
复制代码
Prediction using decision trees
Build a decision tree and train it
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier(max_depth=6)
dt_clf.fit(X, y)
plot_decision_boundary(dt_clf,X,y)
复制代码
draw
Since the decision tree likes to go straight, the prediction results are as follows
Cross-validation
from sklearn.model_selection import cross_val_score
print(cross_val_score(dt_clf,X, y, cv=5).mean()) #cv决定做几轮交叉验证
#分折交叉验证,会按照原始类别比例分割数据集
from sklearn.model_selection import StratifiedKFold
strKFold = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
print(cross_val_score(dt_clf,X, y,cv=strKFold).mean())
#留一法交叉验证
from sklearn.model_selection import LeaveOneOut
loout = LeaveOneOut()
print(cross_val_score(dt_clf,X, y,cv=loout).mean())
#可以控制划分迭代次数、每次划分时测试集和训练集的比例(也就是说:可以存在既不在训练集也不再测试集的情况)
from sklearn.model_selection import ShuffleSplit
shufspl = ShuffleSplit(train_size=.5,test_size=.4,n_splits=8) #迭代8次;
print(cross_val_score(dt_clf,X, y,cv=shufspl).mean())
复制代码
Voting with Voting Classifier
Building a voting classifier
Voting Classifier voting is a kind of hard voting, that is, voting after getting the prediction result
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(estimators=[
('knn_clf',KNeighborsClassifier(n_neighbors=7)),
('gnb_clf',GaussianNB()),
('dt_clf', DecisionTreeClassifier(max_depth=6))
],voting='hard')
voting_clf.fit(X_train,y_train)
voting_clf.score(X_test,y_test)
复制代码
draw
plot_decision_boundary(voting_clf,X,y)
复制代码
Soft Voting Classifier
Soft Voting classifier is a kind of soft voting, he votes with reference to the probability generated by each classifier
Building a voting classifier
voting_clf = VotingClassifier(e![请添加图片描述](https://img-blog.csdnimg.cn/88798644fbc0458d88d4f1a97d4f7e17.png)
stimators=[
('knn_clf',KNeighborsClassifier(n_neighbors=7)),
('gnb_clf',GaussianNB()),
('dt_clf', DecisionTreeClassifier(max_depth=6))
],voting='soft')
voting_clf.fit(X_train,y_train)
voting_clf.score(X_test,y_test)
复制代码
draw
bagging
Parse
Build multiple models, randomly select some data from the data to train different models, and determine the final prediction result through the prediction results of the multiple models.
Multiple Decision Tree Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
bagging_clf = BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100,bootstrap=True)
bagging_clf.fit(X_train,y_train)
bagging_clf.score(X,y)
复制代码
draw
Personally, I feel that the prediction result of this is similar to the result of a single decision tree, and it is also straight-forward.
Out of Bag-oob
Similar to bagging, the only difference is that after each model selects a part of the data, the next model selection only selects the data set from the unselected data
Code
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
bagging_clf = BaggingClassifier(DecisionTreeClassifier(),#分类器
n_estimators=500,#分类器个数
max_samples=100,#每个模型训练取样本数
bootstrap=True,#放回取样
oob_score=True)#out of bag
bagging_clf.fit(X,y)
bagging_clf.oob_score_
复制代码
draw
The prediction results of this are also similar, maybe it will be better to change to other data sets
random forest
The effect of random forest is the same as that of ensemble learning composed of multiple decision trees.
Code
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=500,random_state=666,oob_score=True)
rf_clf.fit(X,y)
rf_clf.oob_score_
复制代码
draw
extreme tree
Extreme random tree algorithm and random forest algorithm are composed of many decision trees. The main difference between extreme trees and random forests Random forests use the Bagging model. All the samples used by extreme trees, but the features are randomly selected, because the splitting is random, so to some extent, the results obtained by random forests are more Okay.
Code
from sklearn.ensemble import ExtraTreesClassifier
et_clf = ExtraTreesClassifier(n_estimators=500,random_state=666,bootstrap=True,oob_score=True)
et_clf.fit(X,y)
et_clf.oob_score_
复制代码
draw
There is a Boost
Weighting the prediction results generated by each model training, the weight of the data that failed to predict the previous model will be increased, and the data used by the next model is the weighted data
Code
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=6),n_estimators=500)
ada_clf.fit(X_train,y_train)
ada_clf.score(X_test,y_test)
复制代码
draw
Gradient Boosting
Only train on wrongly predicted data
Code
from sklearn.ensemble import GradientBoostingClassifier
gd_clf = GradientBoostingClassifier(max_depth=6,n_estimators=500)
gd_clf.fit(X_train,y_train)
gd_clf.score(X_test,y_test)
复制代码