Introduction to Python Machine Learning-- Random Forest Integration Algorithm Study Notes


foreword

insert image description here

The last study note introduced the decision tree algorithm, which is a simple and efficient model in machine learning. Even so, the decision tree is weak after all, and there are many problems that cannot be solved, but if we introduce multiple trees, the situation will be improved. If the tree is an individual in nature, then the forest is a group, a collection of individuals. The random forest model is literally to build a forest in a random way. There are many decision trees in the forest. There is no connection between each decision tree in the random forest, which enhances the diversity of individual trees and the overall forest competitiveness.

1. Introduction to Integrated Learning

insert image description here

The ensemble algorithm mainly constructs multiple learners, and completes the learning task through the combination of certain strategies. As the saying goes, three cobblers are worth one Zhuge Liang. When weak learners are combined correctly, we can get more accurate and robust learners. Due to the conflict between accuracy and diversity of individual learners, the pursuit of diversity must sacrifice accuracy. This requires combining these "good but different" individual learners. It's like in a forest, different trees have different viability, and combining them will result in a more lush jungle.

insert image description here

According to the relationship between individual learners, integration algorithms are divided into three categories: Bagging, Boosting, and Stacking.

Based on Bagging algorithm
Bagging algorithm (bagging method) is the abbreviation of bootstrap aggregating. It mainly performs random sampling on the sample training set, trains new models through repeated sampling, and finally takes the average on the basis of these models.

Boosting algorithm
(Boosting) is a commonly used and effective statistical learning algorithm, which belongs to iterative algorithm. It uses a weak learner to make up for the "deficiency" of the previous weak learner to construct a relatively weak learner serially. A strong learner, this strong learner can make the objective function value small enough.

Based on the Stacking algorithm
, on the results of model predictions, retraining a model is like "stacking" a model on the original model. Stacking can capture the parts of each model that are better at extracting features, while discarding their respective parts that perform poorly, which can effectively optimize the prediction results and improve the final prediction score.

2. Mathematical Principles of Integrated Learning

1. Random Forest Algorithm

The core idea of ​​Bagging is assuming that there is a training data set of size N, and the number of samples taken out each time from the data set is KKK 's sub-dataset, a total ofMMM times, according to thisMMM sub-datasets, training to learnMMM models. When predicting, use thisMMM models make predictions, and then obtain the final prediction result by taking the average or majority classification, we usef ( x ) f(x)f ( x ) to represent the final classifier

f ( x ) = 1 m ∑ i = 1 m f i ( x ) f(x) = \frac{1}{m} \sum_{i=1}^{m} f_{i}(x) f(x)=m1i=1mfi(x)

insert image description here

Random forest follows the core idea of ​​Bagging, assuming that there is a size of NNN training data set, the data set hasMMM features, first extract kkby random sampling with replacementK samples are used as the sample data of a single decision tree model, andnnn features are used as the node attributes of the decision tree, and thenthe mmm trees, and finally fornnThe classification effect of n decision trees is averaged

2. Adaboost algorithm

The core idea of ​​the Boosting algorithm is that we first train a weak learner, and the classifier generated in the next round of iterations is trained on the basis of the previous round of weak learners, that is, the next round of weak learners will be based on the previous round of weak learners. The error generated by one round of classification is used for new learning, and the error generated by the classification of this round of learner is minimized. Finally, the result of the previous weak learner and the current weak learner are updated according to a certain weight. The model of the current strong learner . We use F ( x ) F(x)F ( x ) to express the learner, we hope to be a new round of learnerh ( x ) h(x)The combination of h ( x ) and the old learner results in the lowest loss

F m ( x ) = F m − 1 ( x ) + a r g m i n h ∈ H ∑ i = 1 n L o s s ( y i , F m − 1 ( x i ) + h ( x i ) ) F_{m}(x) = F_{m-1} (x) + argmin_{h \in H} \sum_{i=1}^{n} Loss(y_{i}, F_{m-1}(x_{i})+h(x_{i})) Fm(x)=Fm1(x)+argminhHi=1nLoss(yi,Fm1(xi)+h(xi))

insert image description here

The Adaboost algorithm is based on the boosting strategy. First, the weight distribution of the training samples is initialized, and each sample has the same weight; then the weak classifier is trained. If the sample is classified correctly, its weight will be reduced in the construction of the next training set. , the sample classification error will be improved, just like if you do wrong questions in the exam, you will pay special attention to the wrong questions and constantly redo them, and you don’t need to spend too much time on the familiar questions. Use the updated sample set to train the next classifier; after the training process of each weak classifier, increase the weight of the weak classifier with a small classification error rate, reduce the weight of the weak classifier with a large classification error rate, and combine all Weak classifiers are combined into strong classifiers.

3. GBDT algorithm

The Gradient Boosting strategy is based on the Boosting strategy. Its core idea is to generate multiple weak learners serially. The goal of each weak learner isFit the negative gradient of the loss function of the previously accumulated model, so that the cumulative model loss after adding the weak learner decreases in the direction of negative gradient. If mmM -round weak learner fits the negative gradient of the loss function with respect to the cumulative model, then add the L oss Lossof the cumulative model after the weak learnerL oss will be the smallest, and the classification effect will be better.

∇ g m = ∂ L o s s ( y , F m − 1 ( x ) ) ∂ F m − 1 ( x ) \nabla g_{m} = \frac{ \partial Loss(y,F_{m-1}(x)) }{ \partial F_{m-1}(x) } gm=Fm1(x)Loss(y,Fm1(x))

The GBDT algorithm is based on the Gradient Boosting strategy. Each weak learner is a CART regression tree. In the regression problem, the loss function adopts the mean square loss function:
L oss ( y , F m − 1 ( x ) ) = ( y − F m − 1 ( x ) ) 2 Loss(y,F_{m-1}(x)) =(y-F_{m-1}(x))^{2}Loss(y,Fm1(x))=(yFm1(x))2.
Each CART decision tree will be generated towards the goal of fitting the gradient of the loss function generated by the previous round of cumulative model. In this round of cumulative model, the loss will be continuously reduced. Through this continuous iteration to generate a strong learner, in fact, this process It uses the idea of ​​gradient descent to achieve better classification results

3. Python implements integrated learning algorithm

1. Random Forest Algorithm

The idea of ​​​​Python to implement the random forest algorithm is the same as before. First, import the commonly used packages and the moon data set, then compare the instantiation of the random forest and the decision tree, train the decision tree and the random forest respectively and classify them, and then use matplotlib to draw the chessboard and decision boundaries, and finally show

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
from matplotlib.colors import ListedColormap

x, y = make_moons(n_samples=500, noise=10, random_state=56)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=56)
RF_clf = RandomForestClassifier(n_estimators=500, max_depth=3, criterion='gini', min_samples_leaf=8, bootstrap=True, n_jobs=-1, random_state=56)
RF_clf.fit(x_train, y_train)
y_RF_pred = RF_clf.predict(x_test)
tree_clf = DecisionTreeClassifier(random_state=56, max_depth=3, min_samples_leaf=8, criterion='gini')
tree_clf.fit(x_train, y_train)
y_tree_pred = tree_clf.predict(x_test)


def plot_decision_boundary(clf, x, y, alpha=0.5):
    axes = [-36, 36, -42, 42]
    x1s = np.linspace(axes[0], axes[1], 300)
    x2s = np.linspace(axes[2], axes[3], 300)
    x1, x2 = np.meshgrid(x1s, x2s)
    x_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(x_new).reshape(x1.shape)
    custom_cmap2 = ListedColormap(['#7d7d58', '#4c4c7f', '#507d50'])
    plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    plt.contourf(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.2)
    plt.plot(x[:, 0][y == 0], x[:, 1][y == 0], 'yo', alpha=0.6)
    plt.plot(x[:, 0][y == 1], x[:, 1][y == 1], 'bs', alpha=0.6)
    plt.xlabel('x1')
    plt.xlabel('x2')

plt.figure(figsize=(12,8))
plt.subplot(121)
plot_decision_boundary(RF_clf, x, y)
plt.title('Random Forest')
plt.subplot(122)
plot_decision_boundary(tree_clf, x, y)
plt.title('Decision Tree')
plt.show()

insert image description here

2. Adaboost algorithm

The idea of ​​​​Python implementing the Adaboost algorithm is the same as before. First, import the commonly used packages and the moon data set, and then instantiate the support vector machine SVM as a single learner, iteratively train the SVM to classify and classify the SVM classifiers with different effects. Weighted, increase the learning rate for the bad part of SVM learner, then use matplotlib to draw the chessboard and decision boundary, and finally show it

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from matplotlib.colors import ListedColormap

x, y = make_moons(n_samples=500, noise=0.5, random_state=56)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=56)


def plot_decision_boundary(clf, x, y):
    axes = [-2.5, 2.5, -2.5, 3.5]
    x1s = np.linspace(axes[0], axes[1], 200)
    x2s = np.linspace(axes[2], axes[3], 200)
    x1, x2 = np.meshgrid(x1s, x2s)
    x_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(x_new).reshape(x1.shape)
    custom_cmap2 = ListedColormap(['#7d7d58', '#4c4c7f', '#507d50'])
    plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    plt.contourf(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.2)
    plt.plot(x[:, 0][y == 0], x[:, 1][y == 0], 'yo', alpha=0.6)
    plt.plot(x[:, 0][y == 0], x[:, 1][y == 1], 'bs', alpha=0.6)
    plt.xlabel('x1')
    plt.xlabel('x2')

m = len(x_train)

plt.figure(figsize=(12, 8))

for subplot, learning_rate in ((121, 0.2), (122, 0.5)):
    sample_weights = np.ones(m)
    plt.subplot(subplot)
    for i in range(5):
        svm_clf = SVC(kernel='rbf', C=0.05, random_state=56)
        svm_clf.fit(x_train, y_train, sample_weight=sample_weights)
        y_pred = svm_clf.predict(x_train)
        sample_weights[y_pred != y_train] *= (1 + learning_rate)
        plot_decision_boundary(svm_clf, x, y)
        plt.title('learning_rate={}'.format(learning_rate))


plt.show()

insert image description here

3. GBDT algorithm

The idea of ​​implementing the GBDT algorithm in Python is the same as before. First import commonly used packages and generate a nonlinear function for fitting, and then instantiate the decision tree as a single learner. The decision tree here is mainly used for regression fitting. Function, iteratively train the decision tree for regression, let the next round of decision tree learn and fit the loss generated in the previous round, then use matplotlib to draw the chessboard and decision boundary, and finally show it

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor

np.random.seed(36)
x = np.random.randn(200, 1) - 0.3
y0 = 2*x[:, 0]**3 + 0.05*np.random.rand(200)


def plot_predictions(regs, x, y, axes, label=None, data_label=None, style="r--", data_style="b.",):
    x1 = np.linspace(axes[0], axes[1], 600)
    y_pred = np.sum(reg.predict(x1.reshape(-1, 1)) for reg in regs)
    plt.plot(x[:, 0], y, data_style, label=data_label)
    plt.plot(x1, y_pred, style, label=label, linewidth=2)
    if data_label or label:
        plt.legend(loc="upper center", fontsize=10)
    plt.axis(axes)



tree_reg = DecisionTreeRegressor(max_depth=4)
tree_reg.fit(x, y0)

y1 = y0 - tree_reg.predict(x)
tree_reg1 = DecisionTreeRegressor(max_depth=4)
tree_reg1.fit(x, y1)

y2 = y1 - tree_reg1.predict(x)
tree_reg2 = DecisionTreeRegressor(max_depth=4)
tree_reg2.fit(x, y2)

plt.figure(figsize=(12, 8))

plt.subplot(321)
plot_predictions([tree_reg], x, y0, axes=[-2, 2, -2, 2], label='tree_reg', data_label='training set')
plt.subplot(322)
plot_predictions([tree_reg1], x, y1, axes=[-2, 2, -2, 2], label='tree_reg1')
plt.subplot(323)
plot_predictions([tree_reg, tree_reg1], x, y0, axes=[-2, 2, -2, 2], label='tree_reg + tree_reg1')
plt.subplot(324)
plot_predictions([tree_reg2], x, y2, axes=[-2, 2, -2, 2], label='tree_reg2', data_style="g+")
plt.subplot(325)
plot_predictions([tree_reg1, tree_reg2], x, y0, axes=[-2, 2, -2, 2], label='tree_reg1 + tree_reg2')
plt.subplot(326)
plot_predictions([tree_reg, tree_reg1, tree_reg2], x, y0, axes=[-2, 2, -2, 2], label='tree_reg + tree_reg1 + tree_reg2')
plt.show()

insert image description here

Summarize

The above are the study notes of machine learning integration algorithms. The notes in this film simply record the common integration algorithms and the ideas of Python implementation. Brainstorming is the wisdom that has been passed down through the ages. The integrated algorithm combines different learners to achieve a more powerful learning effect. While achieving good results, it also has a strong model generalization ability and a high degree of interpretability. Today, when the deep neural network dominates, integrated learning still has its irreplaceable existence and development.

Guess you like

Origin blog.csdn.net/m0_55202222/article/details/130061975