《机器学习公式推导与代码实现》chapter15-随机森林

《机器学习公式推导与代码实现》学习笔记，记录一下自己的学习过程，详细的内容请大家购买作者的书籍查阅。

随机森林

Bagging是区别于Boosting的一种集成学习框架，通过对数据集自身采样来获取不同子集，并且对每个子集训练基分类器来进行模型集成。

Bagging是一种并行化的集成学习方法。随机森林就是Bagging学习框架的一个代表，通过样本和特征两个随机性来构造基分类器，由多棵决策树进而形成随机森林。

1 Bagging

前面几章提到的集成学习模型都是Boosting框架，通过不断地迭代和残差拟合的方式来构造集成的树模型。Bagging作为并行式集成学习方法最典型的框架，其核心概念在于自助采样(bootstrap sampling)。给定包含m个样本的数据大小的采样集，有放回的随机抽取一个样本放入样本集中，经过m次采样，可得到一个与原始数据集一样大小的采样集。最终可以采样得到T个包含m个样本的采样集，然后基于每个采样集训练出一个基分类器，最后将这些基分类器进行组合。这就是Bagging的主要思想。

Bagging最大的特征就是可以并行实现，Boosting则是一种序列迭代的实现方式。

2 随机森林的基本原理

随机森林(random forest, RF)以决策树为分类器进行集成，进一步在决策树训练过程中引入了随机选择数据集特征的方法，故称为随机森林。
简单来说随机森林的算法过程就是两个随机性:

假设有M个样本，有放回地随机选择M个样本。
假设有N个特征，在决策时每个结点要进行分裂时，随机从这N个特征中选取n个特征(n<<N)，从这n个特征中选择特征进行节点分裂。

最后构建大量决策树组成随机森林，然后将每棵树地结果进行综合（分类可使用投票法，回归可以使用均值法）

3 随机森林地算法实现

使用的决策树为CART，CART实现。

from cart import ClassificationTree
import numpy as np

class RandomForest:
    def __init__(self, n_estimators=100, min_samples_split=2, min_gain=999, max_depth=float("inf"), max_features=None):
        self.n_estimators = n_estimators
        self.min_gini_impurity = min_gain
        self.min_samples_split = min_samples_split
        self.max_depth = max_depth
        self.max_features = max_features
        self.trees = []

        # 基于决策树构造森林
        for _ in range(self.n_estimators):
            tree = ClassificationTree(
                min_samples_split=self.min_samples_split, 
                min_gini_impurity=self.min_gini_impurity, 
                max_depth=self.max_depth
            )
            self.trees.append(tree)
        
    def bootstrap_sampling(self, X, y):

        X_y = np.concatenate([X, y.reshape(-1, 1)], axis=1) # 合并数据输入和标签
        np.random.shuffle(X_y) # 打乱数据
        n_samples = X_y.shape[0] # 样本量
        sampling_subsets = [] # 初始化抽样子集列表
        # 遍历产生多个抽样子集
        for _ in range(self.n_estimators):
            # 第一个随机性，行抽样
            idx1 = np.random.choice(n_samples, n_samples, replace=True)
            bootstrap_Xy = X_y[idx1, :]
            bootstrap_X = bootstrap_Xy[:, :-1]
            bootstrap_y = bootstrap_Xy[:, -1]
            sampling_subsets.append([bootstrap_X, bootstrap_y])
        return sampling_subsets
    
    def fit(self, X, y):
        
        # 对森林中每棵树训练一个双随机抽样子集
        sub_sets = self.bootstrap_sampling(X, y)
        n_features = X.shape[1]
        # 设置max_feature
        if self.max_features == None:
            self.max_features = int(np.sqrt(n_features))
        
        # 遍历拟合每棵树
        for i in range(self.n_estimators):
            sub_X, sub_y = sub_sets[i]
            # 第二个随机性，列抽样
            idx2 = np.random.choice(n_features, self.max_features, replace=True)
            sub_X = sub_X[:, idx2]
            self.trees[i].fit(sub_X, sub_y)
            self.trees[i].feature_indices = idx2
            print(f'the {
      
      i}th tree is trained done')
    
    def predict(self, X):
        y_preds = []
        for i in range(self.n_estimators):
            idx = self.trees[i].feature_indices
            sub_X = X[:, idx]
            y_pred = self.trees[i].predict(sub_X)
            y_preds.append(y_pred)

        y_preds = np.array(y_preds).T
        res = []
        for j in y_preds:
            res.append(np.bincount(j.astype('int')).argmax())
        return res

the 0th tree is trained done
the 1th tree is trained done
the 2th tree is trained done
the 3th tree is trained done
the 4th tree is trained done
the 5th tree is trained done
the 6th tree is trained done
the 7th tree is trained done
the 8th tree is trained done
the 9th tree is trained done
0.7333333333333333

4.基于sklearn地随机森林算法实现

基于随机森林的分类和回归调用方式分别为ensemble.RandomForestClassifier和ensemble.RandomForestRegressor。

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=3, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(acc)

0.81

笔记本_Github地址