Mathematical modeling study notes (27) random forest

The content of the previous article sorted out the relevant knowledge of decision trees. This article expands on the basis of decision trees and introduces the concept of random forest

Random forest is an algorithm that integrates multiple trees through the idea of ​​ensemble learning. Its basic unit is a decision tree, and its essence belongs to a large branch of machine learning-ensemble learning (Ensemble Learning) method.

In fact, from an intuitive point of view, each decision tree is a classifier (assuming that it is now a classification problem), then for an input sample, N trees will have N classification results. Random forest integrates all classification voting results, and designates the category with the most votes as the final output. This is the simplest Bagging idea.

Bagging is also called bootstrap aggregating, which is an ensemble technique that trains the classifier by reselecting k new data sets on the original data set by sampling with replacement. It uses a set of trained classifiers to classify new samples, and then uses majority voting or the method of averaging the output to count the classification results of all classifiers. The category with the highest result is the final label. Insert picture description here
[Bootstrap] It uses bootstrap resampling technology to collect a fixed number of samples from the training set, but after each sample is collected, the samples are returned. In other words, the samples collected before may continue to be collected after being put back.
[OOB] In each round of random sampling in Bagging, about 36.8% of the data in the training set is not collected by the sampling set. For the data not collected in this part, we often call it Out Of Bag (OOB). These data are not involved in the fitting of the training set model, so it can be used to test the generalization ability of the model.
[Randomness] For our Bagging algorithm, boostrap is generally used to randomly collect samples, and each tree collects the same number of samples, which is generally less than the original sample size. The content of the sample set obtained in this way is different each time, and k classification trees are generated through such a self-service method to form a random forest to achieve sample randomness.
[Output] Bagging's collection strategy is also relatively simple. For classification problems, simple voting is usually used, and the category or one of the categories that gets the most votes is the final model output. For regression problems, the simple averaging method is usually used, and the regression results obtained by T weak learners are arithmetic averaged to obtain the final model output.

To put it simply, [random] in a random forest is to randomly sample samples from the data set with replacement. The forest is a collection of decision trees. The results of each decision tree are summarized after the analysis. The results returned by the forest are determined by voting, which is The most returned is the result of the forest

Analogous to an exam, everyone has a different correct rate for a set of test papers. Random Forest is to summarize the test papers of the entire class, and each set of questions is selected as the answer to this question at most, so the correct rate will always be higher than personal

Random forest construction steps:
Insert picture description here
Application direction:
Insert picture description here
Application example: Feature value screening
Idea: If a feature has a greater impact on the classification of the decision tree, it means that this feature is more important.
Example: Wine classification problem Wine data set (in the code Will be automatically obtained)

Example 2: Analysis of the importance of feature values ​​in the classification of red wine: The more the change affects the classification, the more important the attribute is. For example: Alcohol content is indispensable to wine, but for all wines, it has alcohol. Therefore, although alcohol itself is important to wine, it is not very important to "what kind of wine does the wine belong to" . Feature value screening mainly screens the difference value of the variable, that is, what kind of wine the wine belongs to.

python code:

import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
df = pd.read_csv(url, header = None)
df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
              'Alcalinity of ash', 'Magnesium', 'Total phenols',
              'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
              'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x, y = df.iloc[:, 1:].values, df.iloc[:, 0].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
feat_labels = df.columns[1:]
forest = RandomForestClassifier(n_estimators=5, random_state=0, n_jobs=-1)
forest.fit(x_train, y_train)
importances = forest.feature_importances_
import numpy as np
np.unique(df['Class label'])
indices = np.argsort(importances)[::-1]
for f in range(x_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
import matplotlib.pyplot as plt
plt.title('Feature Importance')
plt.bar(range(x_train.shape[1]), importances[indices], color='lightblue', align='center')
plt.xticks(range(x_train.shape[1]), feat_labels, rotation=90)
plt.xlim([-1, x_train.shape[1]])
plt.tight_layout()
plt.show()

Result:
Insert picture description here
Insert picture description here
(The figure and the table are not consistent, the code is slightly defective, subject to the table to be modified)

Guess you like

Origin blog.csdn.net/qq1198768105/article/details/113631179