"Machine Learning Formula Derivation and Code Implementation" chapter15-Random Forest

"Machine Learning Formula Derivation and Code Implementation" study notes, record your own learning process, please buy the author's book for detailed content.

random forest

BaggingIt is Boostingan ensemble learning framework that is different from that, by sampling the data set itself to obtain different subsets, and training a base classifier for each subset to perform model integration.

BaggingIt is a parallel ensemble learning method. 随机森林It is Bagginga representative of the learning framework. The base classifier is constructed by combining two randomnesses, 样本and a random forest is formed from multiple decision trees.特征

1 Bagging

The integrated learning models mentioned in the previous chapters are all Boosting frameworks, which construct integrated tree models through continuous iteration and residual fitting. Bagging is the most typical framework of parallel integrated learning method, and its core concept lies in 自助采样( bootstrap sampling). Given a sample set with a data size of m samples, a sample is randomly selected with replacement and put into the sample set. After m sampling, a sample set with the same size as the original data set can be obtained. Finally, T sampling sets containing m samples can be sampled, and then a base classifier is trained based on each sampling set, and finally these base classifiers are combined. This is the main idea of Bagging.

The biggest feature of Bagging is that it can be implemented in parallel, while Boosting is an implementation of sequence iteration.

2 Basic Principles of Random Forest

随机森林( random forest, RF) uses the decision tree as the classifier to integrate, and further introduces the method of randomly selecting the characteristics of the data set in the decision tree training process, so it is called random forest.
Simply put, the algorithm process of random forest is 两个随机性:

Suppose there are M samples, and M samples are randomly selected with replacement.
Assuming that there are N features, when each node is to be split during decision-making, randomly select n features (n<<N) from these N features, and select features from these n features for node splitting.

Finally, build a large number of decision trees to form a random forest, and then synthesize the results of each tree (the voting method can be used for classification, and the mean method can be used for regression)

3 Implementation of Random Forest Algorithm

The decision tree used is CART, and CART is implemented .

from cart import ClassificationTree
import numpy as np

class RandomForest:
    def __init__(self, n_estimators=100, min_samples_split=2, min_gain=999, max_depth=float("inf"), max_features=None):
        self.n_estimators = n_estimators
        self.min_gini_impurity = min_gain
        self.min_samples_split = min_samples_split
        self.max_depth = max_depth
        self.max_features = max_features
        self.trees = []

        # 基于决策树构造森林
        for _ in range(self.n_estimators):
            tree = ClassificationTree(
                min_samples_split=self.min_samples_split, 
                min_gini_impurity=self.min_gini_impurity, 
                max_depth=self.max_depth
            )
            self.trees.append(tree)
        
    def bootstrap_sampling(self, X, y):

        X_y = np.concatenate([X, y.reshape(-1, 1)], axis=1) # 合并数据输入和标签
        np.random.shuffle(X_y) # 打乱数据
        n_samples = X_y.shape[0] # 样本量
        sampling_subsets = [] # 初始化抽样子集列表
        # 遍历产生多个抽样子集
        for _ in range(self.n_estimators):
            # 第一个随机性，行抽样
            idx1 = np.random.choice(n_samples, n_samples, replace=True)
            bootstrap_Xy = X_y[idx1, :]
            bootstrap_X = bootstrap_Xy[:, :-1]
            bootstrap_y = bootstrap_Xy[:, -1]
            sampling_subsets.append([bootstrap_X, bootstrap_y])
        return sampling_subsets
    
    def fit(self, X, y):
        
        # 对森林中每棵树训练一个双随机抽样子集
        sub_sets = self.bootstrap_sampling(X, y)
        n_features = X.shape[1]
        # 设置max_feature
        if self.max_features == None:
            self.max_features = int(np.sqrt(n_features))
        
        # 遍历拟合每棵树
        for i in range(self.n_estimators):
            sub_X, sub_y = sub_sets[i]
            # 第二个随机性，列抽样
            idx2 = np.random.choice(n_features, self.max_features, replace=True)
            sub_X = sub_X[:, idx2]
            self.trees[i].fit(sub_X, sub_y)
            self.trees[i].feature_indices = idx2
            print(f'the {
      
      i}th tree is trained done')
    
    def predict(self, X):
        y_preds = []
        for i in range(self.n_estimators):
            idx = self.trees[i].feature_indices
            sub_X = X[:, idx]
            y_pred = self.trees[i].predict(sub_X)
            y_preds.append(y_pred)

        y_preds = np.array(y_preds).T
        res = []
        for j in y_preds:
            res.append(np.bincount(j.astype('int')).argmax())
        return res

the 0th tree is trained done
the 1th tree is trained done
the 2th tree is trained done
the 3th tree is trained done
the 4th tree is trained done
the 5th tree is trained done
the 6th tree is trained done
the 7th tree is trained done
the 8th tree is trained done
the 9th tree is trained done
0.7333333333333333

4. Implementation of random forest algorithm based on sklearn

ensemble.RandomForestClassifierRandom forest-based classification and regression call methods are and , respectively ensemble.RandomForestRegressor.

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=3, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(acc)

0.81

Notebook_Github address