XGBoost: A Scalable Tree Boosting System

T.Q. Chen, C. Guestrin, XGBoost: A Scalable Tree Boosting System, KDD (2016)

摘要

XGBoost：端到端可扩展提升树（a scalable end-to-end tree boosting system）

稀疏感知算法处理稀疏数据（sparsity-aware algorithm for sparse data）

加权分位算法学习近似树（weighted quantile sketch for approximate tree learning）

1 引言

数据驱动机器学习方法（machine learning and data-driven approaches）的两大特性：

（1）使用统计模型提取复杂数据的内在关系（usage of eﬀective (statistical) models that capture the complex data dependencies）

（2）从大型数据集中学习感兴趣的模型的可扩展学习系统（scalable learning systems that learn the model of interest from large datasets）

2 梯度提升树框架（tree boosting in a nutshell）

2.1 正则损失（regularized learning objective）

给定数据集 $\mathcal{D} = \{ (\mathbf{x}_{i}, y_{i}) \}$ （ $|\mathcal{D}| = n$ ， $\mathbf{x}_{i} \in {\R}^{m}$ ， $y_{i} \in \R$ ）

由 $K$ 个加性方程构建的树集成模型（a tree ensemble model）（Fig. 1）的预测输出为：

$\hat{y}_{i} = \phi(\mathbf{x}_{i}) = \sum_{k = 1}^{K} f_{k}(\mathbf{x}_{i}), \ f_{k} \in \mathcal{F} \tag{1}$

其中， $\mathcal{F} = \{ f(\mathbf{x}_{i}) = w_{q(\mathbf{x})} \}$ （ $q : {\R}^{m} \rightarrow T$ ， $\mathbf{w} = (w_{1}, \dots, w_{T}) \in {\R}^{T}$ ）表示回归树空间（space of regression trees，CART）。 $q$ 表示将样本映射到相应叶节点索引的树结构（the structure of each tree that maps an example to the corresponding leaf index）、 $T$ 表示树的叶节点总数、 $f_{k}$ 表示第 $k$ 棵树的结构 $q$ 及叶节点权值（leaf weights） $\mathbf{w}$ ，与决策树不同，回归树的叶节点表示连续分值（a continuous score on each of the leaf）， $w_{i}$ 表示第 $i$ 个叶节点的分值。

在这里插入图片描述
本文通过最小化正则损失（regularized objective），学习函数集合：

$\mathcal{L} (\phi)= \sum_{i} l(\hat{y}_{i}, y_{i}) + \sum_{k} \Omega(f_{k}), \ \text{where} \ \Omega(f) = \gamma T + \frac{1}{2} \lambda\| \mathbf{w} \|^{2} \tag{2}$

其中， $l$ 表示可微凸损失函数（a diﬀerentiable convex loss function），用于衡量预测值 $\hat{y}_{i}$ 与直实值（target） $y_{i}$ 间的差异； $\Omega$ 表示模型复杂度的惩罚项（penalizes the complexity of the model, i.e., the regression tree functions）。

2.2 梯度提升树（gradient tree boosting）

树集成模型（Eq. (2)）以函数作为参数，不能在欧氏空间中用传统优化算法求解（the tree ensemble model in Eq. (2) includes functions as parameters and cannot be optimized using traditional optimization methods in Euclidean space）。

$\hat{y}_{i}^{(t)}$ 表示第 $t$ 次迭代模型对样本 $i$ （instance）的预测，即：

$\hat{y}_{i}^{(t)} = \hat{y}_{i}^{(t - 1)} + f_{t}(\mathbf{x}_{i})$

其中， $f_{t}$ 表示第 $t$ 次迭代增加的函数，其目标为最小化损失 $\mathcal{L}^{(t)}$ （第 $t$ 次迭代的损失函数， $\hat{y}_{i}^{(t - 1)}$ 为已知量、 $\Omega$ 仅针对 $f_{t}$ ）

$\mathcal{L}^{(t)} = \sum_{i = 1}^{n} l \left( y_{i}, \hat{y}_{i}^{(t - 1)} + f_{t}(\mathbf{x}_{i}) \right) + \Omega(f_{t})$

第 $t$ 次迭代中， $f_{t}$ 的优化目标为使当前损失 $\mathcal{L}^{(t)}$ 最小（贪心（greedy）算法）。本文以 $\mathcal{L}^{(t)}$ 的二阶近似（second-order approximation，泰勒级数）作为损失函数优化求解：

$\mathcal{L}^{(t)} \cong \sum_{i = 1}^{n} \left[ l \left( y_{i}, \hat{y}^{(t - 1)} \right) + g_{i} f_{t}(\mathbf{x}_{i}) + \frac{1}{2} h_{i} f_{t}^{2}(\mathbf{x}_{i}) \right] + \Omega(f_{t})$

其中， $g_{i} = \partial_{\hat{y}^{(t - 1)}} l \left( y_{i}, \hat{y}^{(t - 1)} \right)$ 、 $h_{i} = \partial_{\hat{y}^{(t - 1)}}^{2} l \left( y_{i}, \hat{y}^{(t - 1)} \right)$ 分别为损失函数的一阶和二阶梯度统计（ﬁrst and second order gradient statistics on the loss function）。

■■

泰勒公式：若函数 $f(x)$ 在包含 $x = x_0$ 的某个闭区间 $[a, b]$ 上有 $n$ 阶导数，且在开区间 $(a, b)$ 上有 $n + 1$ 阶导数，则对闭区间 $[a, b]$ 上任意一点 $x$ ，

$f(x) = f(x_{0} + \Delta x) = \sum_{k=0}^{n} \frac{1}{k!} f^{(k)}(x_{0}) (\Delta x)^{k} + \mathcal{o}((\Delta x)^{n})$

$\phi$ 表示函数空间 $\mathcal{F}$ 中的点，第 $t$ 次迭代时， $\phi = \sum_{k}^{t} f_{k} = \sum_{k}^{t - 1} f_{k} + f_{t}$ 。令 $\phi_{0} = \sum_{k}^{t - 1} f_{k}$ 、 $\Delta \phi = f_{t}$ ，将 $l$ 在 $\phi_{0}$ 处进行泰勒展开，

$\begin{aligned} l (y, \phi) & = l (y, \phi_{0} + \Delta \phi) \\ & \approx l ( y, \phi_{0}) + \frac{\partial}{\partial \phi} l(y, \phi_{0}) \Delta \phi + \frac{1}{2} \frac{\partial^{2}}{\partial \phi^{2}} l(y, \phi_{0}) (\Delta \phi)^{2} \end{aligned}$

在样本 $\mathbf{x}_{i}$ 处，

$\begin{aligned} l (y_{i}, \phi(\mathbf{x}_{i})) & = l (y_{i}, \phi_{0}(\mathbf{x}_{i}) + \Delta \phi(\mathbf{x}_{i})) \\ & \approx l ( y_{i}, \phi_{0}(\mathbf{x}_{i})) + \frac{\partial l}{\partial \phi} (y_{i}, \phi_{0}(\mathbf{x}_{i})) \Delta \phi(\mathbf{x}_{i}) + \frac{1}{2} \frac{\partial^{2} l}{\partial \phi^{2}} (y_{i}, \phi_{0}(\mathbf{x}_{i})) (\Delta \phi(\mathbf{x}_{i}))^{2} \end{aligned}$

即

$l \left( y_{i}, \hat{y}_{i}^{(t - 1)} + f_{t}(\mathbf{x}_{i}) \right) \approx l \left( y_{i}, \hat{y}_{i}^{(t - 1)} \right) + \frac{\partial l}{\partial \hat{y}^{(t - 1)}} \left( y_{i}, \hat{y}_{i}^{(t - 1)} \right) f_{t}(\mathbf{x}_{i}) + \frac{1}{2} \frac{\partial^{2} l}{(\partial \hat{y}^{(t - 1)})^{2}} \left( y_{i}, \hat{y}_{i}^{(t - 1)} \right) f_{t}^{2}(\mathbf{x}_{i})$

令 $g_{i} = \partial_{\hat{y}^{(t - 1)}} l \left( y_{i}, \hat{y}_{i}^{(t - 1)} \right)$ 、 $h_{i} = \partial_{\hat{y}^{(t - 1)}}^{2} l \left( y_{i}, \hat{y}_{i}^{(t - 1)} \right)$

$l \left(y_{i}, \hat{y}_{i}^{(t - 1)} + f_{t}(\mathbf{x}_{i}) \right) \approx l \left( y_{i}, \hat{y}_{i}^{(t - 1)} \right) + g_{i} f_{t}(\mathbf{x}_{i}) + \frac{1}{2} h_{i} f_{t}^{2}(\mathbf{x}_{i})$

个人以为原文中 $\hat{y}^{(t - 1)}$ 表述不够清晰，有歧义。

■

忽略方程中的常数项（ $l \left( y_{i}, \hat{y}^{(t - 1)} \right)$ ），则在第 $t$ 次迭代的损失函数化简为

$\tilde{\mathcal{L}}^{(t)} = \sum_{i = 1}^{n} \left[ g_{i} f_{t}(\mathbf{x}_{i}) + \frac{1}{2} h_{i} f_{t}^{2}(\mathbf{x}_{i}) \right] + \Omega(f_{t}) \tag{3}$

将叶节点 $j$ 的样本集合（instance set of leaf $j$ ）定义为 $I_{j} = \{ i | q(\mathbf{x}_{i}) = j \}$ ，并展开Eq. (3)中的惩罚项 $\Omega$ ，

$\begin{aligned} \tilde{\mathcal{L}}^{(t)} = & \sum_{i = 1}^{n} \left[ g_{i} f_{t}(\mathbf{x}_{i}) + \frac{1}{2} h_{i} f_{t}^{2}(\mathbf{x}_{i}) \right] + \gamma T + \frac{1}{2} \lambda \sum_{j = 1}^{T} w_{j}^{2} \\ = & \sum_{j = 1}^{T} \left[ \left( \sum_{i \in I_{j}} g_{i} \right) w_{j} + \frac{1}{2} \left( \sum_{i \in I_{j}} h_{i} + \lambda \right) w_{j}^{2} \right] + \gamma T \end{aligned} \tag{4}$

■■ $f_{t}$ 表示树，将输入 $\mathbf{x}_{i}$ 映射到相应的叶节点 $j$ 上，并返回叶节点 $j$ 的权值 $w_{j}$ ，即， $f_{t}(\mathbf{x}_{i}) = w_{j}, \forall i \in I_{j}$ 。■

给定树结构 $q(\mathbf{x})$ ，其叶节点 $j$ 分值的最优解（optimal weight） $w_{j}^{\ast}$ 为（令 $\frac{\partial \tilde{\mathcal{L}}^{(t)}}{\partial w_{j}} = 0$ ）：

$w_{j}^{\ast} = - \frac{\sum_{i \in I_{j}} g_{i}}{\sum_{i \in I_{j}} h_{i} + \lambda} \tag{5}$

最优值（optimal value）为

$\tilde{\mathcal{L}}^{(t)} (q) = - \frac{1}{2} \sum_{j = 1}^{T} \frac{(\sum_{i \in I_{j}} g_{i})^{2}}{\sum_{i \in I_{j}} h_{i} + \lambda} + \gamma T \tag{6}$

Eq. (6)为树结构 $q$ 质量的评价函数（Eq. (6) can be used as a scoring function to measure the quality of a tree structure $q$ ），其分值与决策树的不纯度得分相似（like the impurity score for evaluating decision trees）。Fig. 2为该分值的计算过程。

在这里插入图片描述
遍历所有可能树结构为NP-hard问题，本文采用贪心算法（greedy algorithm）：由根节点开始，迭代增加分枝（a greedy algorithm that starts from a single leaf and iteratively adds branches to the tree）。令 $I_{L}$ 、 $I_{R}$ 分别表示划分后左右子节点的实例集（ $I_{L}$ and $I_{R}$ are the instance sets of left and right nodes after the split）， $I = I_{L} \cup I_{R}$ ，则划分后损失下降（loss reduction）为：

$\mathcal{L}_{\text{split}} = \frac{1}{2} \left[ \frac{(\sum_{i \in I_{L}} g_{i})^{2}}{\sum_{i \in I_{L}} h_{i} + \lambda} + \frac{(\sum_{i \in I_{R}} g_{i})^{2}}{\sum_{i \in I_{R}} h_{i} + \lambda} - \frac{(\sum_{i \in I} g_{i})^{2}}{\sum_{i \in I} h_{i} + \lambda} + \right] - \gamma \tag{7}$

■■

给定一个树结构 $q$ ，其损失为

$\tilde{\mathcal{L}}^{(t)} (q) = - \frac{1}{2} \sum_{j = 1, j \not= k}^{T} \frac{(\sum_{i \in I_{j}} g_{i})^{2}}{\sum_{i \in I_{j}} h_{i} + \lambda} - \frac{1}{2} \frac{(\sum_{i \in I_{k}} g_{i})^{2}}{\sum_{i \in I_{k}} h_{i} + \lambda} + \gamma T$

假设在叶节点 $j$ 处做划分， $I_{k} = I_{k, L} + I_{k, R}$ ，划分后树的叶节点数增加 $1$ （ $T + 1$ ），其损失为

$\tilde{\mathcal{L}}^{(t)} (q_{\text{split}}) = - \frac{1}{2} \sum_{j = 1, j \not= k}^{T} \frac{(\sum_{i \in I_{j}} g_{i})^{2}}{\sum_{i \in I_{j}} h_{i} + \lambda} - \frac{1}{2} \frac{(\sum_{i \in I_{k, L}} g_{i})^{2}}{\sum_{i \in I_{k, L}} h_{i} + \lambda} - \frac{1}{2} \frac{(\sum_{i \in I_{k, R}} g_{i})^{2}}{\sum_{i \in I_{k, R}} h_{i} + \lambda} + \gamma (T + 1)$

因此划分后损失下降

$\begin{aligned} \mathcal{L}_{\text{split}} & = \tilde{\mathcal{L}}^{(t)} (q) - \tilde{\mathcal{L}}^{(t)} (q_{\text{split}}) \\ & = - \frac{1}{2} \frac{(\sum_{i \in I_{k}} g_{i})^{2}}{\sum_{i \in I_{k}} h_{i} + \lambda} + \frac{1}{2} \frac{(\sum_{i \in I_{k, L}} g_{i})^{2}}{\sum_{i \in I_{k, L}} h_{i} + \lambda} + \frac{1}{2} \frac{(\sum_{i \in I_{k, R}} g_{i})^{2}}{\sum_{i \in I_{k, R}} h_{i} + \lambda} - \gamma \\ & = \frac{1}{2} \left[ \frac{(\sum_{i \in I_{k, L}} g_{i})^{2}}{\sum_{i \in I_{k, L}} h_{i} + \lambda} + \frac{(\sum_{i \in I_{k, R}} g_{i})^{2}}{\sum_{i \in I_{k, R}} h_{i} + \lambda} - \frac{(\sum_{i \in I_{k}} g_{i})^{2}}{\sum_{i \in I_{k}} h_{i} + \lambda} \right] - \gamma \end{aligned}$

■

2.3 系数收缩与特征重采样（shrinkage and column subsampling）

本文使用系数收缩（shrinkage）与特征重采样（column (feature) subsampling）进一步防止过拟合（overfitting）

系数收缩：每一步提升树后（after each step of tree boosting），系数收缩对新增权值缩放 $\eta$ 倍（shrinkage scales newly added weights by a factor $\eta$ ）。系数收缩能减小各树对结果的影响，并为新增树预留改进模型的空间（shrinkage reduces the inﬂuence of each individual tree and leaves space for future trees to improve the model）。

特征重采样：相比样本重抽样，特征重采样更能防止过拟合（using column sub-sampling prevents over-ﬁtting even more so than the traditional row sub-sampling）。

3 划分点查找（split finding algorithms）

3.1 贪心算法（basic exact greedy algorithm）

贪心算法（exact greedy algorithm）：在全部特征上，遍历所有可能划分（enumerates over all the possible splits on all the features），根据Eq. (7)确定划分点（Alg. (1)）。

在这里插入图片描述
■■

注意 $g$ 、 $h$ 分别表示 $l$ 在 $\mathcal{F}$ 空间上对 $\phi$ 的一阶、二阶偏导数（ $g = \frac{\partial l}{\partial \phi}$ 、 $h = \frac{\partial^{2} l}{\partial \phi^{2}}$ ），将 $l$ 在 $\phi_{0} = \sum_{k}^{t - 1} f_{k}$ 处进行泰勒展开（ $\Delta \phi = f_{t}$ ）

$\begin{aligned} l (y, \phi) & = l (y, \phi_{0} + \Delta \phi) \\ & \approx l ( y, \phi_{0}) + \frac{\partial l}{\partial \phi}(y, \phi_{0}) \Delta \phi + \frac{1}{2} \frac{\partial^{2} l}{\partial \phi^{2}}(y, \phi_{0}) (\Delta \phi)^{2} \\ & = l ( y, \phi_{0}) + g(y, \phi_{0}) \Delta \phi + h(y, \phi_{0}) (\Delta \phi)^{2} \\ \end{aligned}$

因此， $g$ 、 $h$ 与第 $t$ 次迭代新增树结构 $f_{t}$ （ $q_{t}$ ）无关，故有

$G_{L} + G_{R} = G$ 、 $H_{L} + H_{R} = H$ 、 $G = \sum_{i \in I} g_{i}$ 、 $H = \sum_{i \in I} h_{i}$

■

使用贪心算法遍历连续特征（continuous features）所有可能划分点时，需先根据特征的值（feature values）对数据排序，然后遍历排序后的数据，通过累加梯度统计量计算Eq. (7)的分值（the algorithm must ﬁrst sort the data according to feature values and visit the data in sorted order to accumulate the gradient statistics for the structure score）。

3.2 近似算法（approximate algorithm）

当数据无法全部加载到内存中（the data does not ﬁt entirely into memory）或采用分布式设置（distributed setting）时，贪心算法运行低效。本文给出一种近似框架（an approximate framework）：

在这里插入图片描述
（1）根据特征分布的分位数给出候选划分点（ﬁrst proposes candidate splitting points according to percentiles of feature distribution）；

（2）将连续特性映射到候选划分区，累加统计量并从候选划分点中查找最优解（maps the continuous features into buckets split by these candidate points, aggregates the statistics and ﬁnds the best solution among proposals based on the aggregated statistics）。

候选划分点提名：

全局版本（global variant）：在构造树的初始化阶段，提名所有候选划分点；在查找各级划分点时使用统一候选划分点提名（the global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split ﬁnding at all levels）。
局部版本（local variant）：每次划分后，重新提名候选划分点（the local variant re-proposes after each split）；每次划分后，局部提名都对候选划分点进行改进，适于构造深层树（the local proposal reﬁnes the candidates after splits, and can potentially be more appropriate for deeper trees）。

当候选划分点足够多时，全局提名能达到局部提名的准确率（the global proposal can be as accurate as the local one given enough candidates）。

直接构造近似梯度统计直方图（construct approximate histograms of gradient statistics）
组策略替换分位数（use other variants of binning strategies instead of quantile）
分位数策略（quantile strategy beneﬁt from being distributable and recomputable）

分位数策略的精度能够逼近贪心算法（quantile strategy can get the same accuracy as exact greedy given reasonable approximation level）

在这里插入图片描述

3.3 加权分位数策略（weighted quantile sketch）

提名候选分割点对近似算法至关重要（one important step in the approximate algorithm is to propose candidate split points）。特征分位数（percentiles of a feature）能使候选点在数据上均匀分布（make candidates distribute evenly on the data）。令多重集合（multi-set） $\mathcal{D}_{k} = \{ (x_{1k}, h_{1}), (x_{2k}, h_{2}), \cdots, (x_{nk}, h_{n}) \}$ 表示各训练实例（training instance）的第 $k$ 个特征的值（ $k$ -th feature values）及其二阶梯度统计量（second order gradient statistics），定义评价函数（rank function） $r_{k}: \R \rightarrow [0, + \infin)$ 为：

$r_{k}(z) = \frac{1}{\sum_{(x, h) \in \mathcal{D}_{k}} h} \sum_{(x, h) \in \mathcal{D}_{k}, x \lt z} h \tag {8}$

其表示满足第 $k$ 个特征的值小于 $z$ 的实例在所有实例中的占比（the proportion of instances whose feature value $k$ is smaller than $z$ ）。候选分割点（candidate split points） $\{ s_{k, 1}, s_{k, 2}, \cdots, s_{k, l} \}$ 的选取目标为：

$| r_{k}(s_{k, j}) -r_{k}(s_{k, j + 1}) | \lt \epsilon, \ s_{k, 1} = \min_{i} x_{ik}, s_{k, l} = \max_{i} x_{ik} \tag {9}$

其中， $\epsilon$ 为近似因子（approximation factor），因此候选点个数约为 $\frac{1}{\epsilon}$ 。各数据点的权值为 $h_{i}$ 。由Eq. (3)可知

$\sum_{i + 1}^{n} \frac{1}{2} h_{i} \left( f_{t}(x_{i}) - \frac{g_{i}}{h_{i}} \right)^{2} + \Omega(f_{t}) + \text{constant}$

上式表示权值为 $h_{i}$ 、标签为 $\frac{g_{i}}{h_{i}}$ 的加权平方损失（weighted squared loss）。

■■

由方程（3），

$\begin{aligned} \tilde{\mathcal{L}}^{(t)} & = \sum_{i = 1}^{n} \left[ g_{i} f_{t}(\mathbf{x}_{i}) + \frac{1}{2} h_{i} f_{t}^{2}(\mathbf{x}_{i}) \right] + \Omega(f_{t}) \\ & = \sum_{i = 1}^{n} \frac{1}{2} h_{i} \left[ 2 \frac{g_{i}}{h_{i}} f_{t}(\mathbf{x}_{i}) + f_{t}^{2}(\mathbf{x}_{i}) \right] + \Omega(f_{t}) \\ & = \sum_{i = 1}^{n} \frac{1}{2} h_{i} \left[ (\frac{g_{i}}{h_{i}})^{2} + 2 \frac{g_{i}}{h_{i}} f_{t}(\mathbf{x}_{i}) + f_{t}^{2}(\mathbf{x}_{i}) \right] + \Omega(f_{t}) - \sum_{i = 1}^{n} \frac{1}{2} h_{i} (\frac{g_{i}}{h_{i}})^{2} \\ & = \sum_{i = 1}^{n} \frac{1}{2} h_{i} \left( f_{t}(\mathbf{x}_{i}) - (- \frac{g_{i}}{h_{i}}) \right)^{2} + \Omega(f_{t}) + \text{constant} \\ \end{aligned}$

上式表示权值为 $h_{i}$ 、标签为 $- \frac{g_{i}}{h_{i}}$ 的加权平方损失
原文有误

■

3.4 稀疏分割查找（sparsity-aware split finding）

若输入 $\mathbf{x}$ 稀疏（如数据缺失（missing values）、数据取值单一（frequent zero entries in the statistics）、独热编码（artifacts of feature engineering such as one-hot encoding）），
若输入 $\mathbf{x}$ 稀疏（如数据缺失（missing values）、数据取值单一（frequent zero entries in the statistics）、独热编码（artifacts of feature engineering such as one-hot encoding）），则需对稀疏数据予以额外处理（aware of the sparsity pattern in the data）。