Recently, I deeply studied the algorithm principle and model data structure of XGBoost, the artifact of data mining competition.

This article has participated in the "Newcomer Creation Ceremony" event to start the road of gold creation together.

Introduction: Anyone who is engaged in data mining related work must know the XGBoost algorithm. This generation of artifacts that once shined in the data mining competition is a classic algorithm proposed by Chen Tianqi in 2016. In essence, XGBoost is regarded as an optimized implementation of the GBDT algorithm, but in addition to the inheritance of the integrated algorithm concept, the specific design details are actually quite different. I have studied in depth recently, and briefly explored the data structure of the underlying design. I can't help feeling the subtlety of the algorithm! Chat to make a summary for the future reference!

picture

In 2016, Chen Tianqi was invited to participate in the sharing session about XGBoost

XGBoost is an ensemble algorithm in machine learning. It is divided into three ensemble genres and belongs to the Boosting genre. The Boosting genre is also the most active and powerful genre in the integration algorithm. In addition to XGBoost, there are Adaboost and GBDT in the front, and LightGBM and CatBoost in the back. Of course, LightGBM and CatBoost and XGBoost are generally regarded as improved and optimized implementations of GBDT.

The principle of the XGBoost algorithm is actually very mature and complete, and there are countless sharings on this aspect on the Internet, so this article does not want to repeat the previous work on it, but focuses on the following writing purposes: First, share from the perspective of explaining formulas Personal understanding of the principle of XGBoost algorithm; the second is to simply explore the underlying data structure design of XGBoost. The latter is similar to the previous tweet " Data Science: Decision Trees in Sklearn, how is the bottom layer designed and stored? "Location.

01 Learn formula derivation and understand principles

In order to understand the algorithm principle of XGBoost, the following mainly shares the derivation process of the 5 main formulas related to the algorithm. All formulas are mainly derived from the official XGBoost documentation ( xgboost.readthedocs.io/en/latest),…

Equation 1 - The additive model of the Boosting ensemble algorithm:

The formula is very simple, but important to understand the formula that follows. As a Boosting algorithm, XGBoost's integration idea follows the typical additive model, that is, the output of the integrated algorithm is equal to the sum of the outputs of each base learner (the regression problem is well understood by summation, and the classification problem is actually fitting logloss, Similar to the practice in logistic regression, mentioned later), in the above formula, fk(x) represents a single basic learner, and K represents the number of basic learners. Take the ensemble model with 2 basic learners as an example in the following figure. The output of the ensemble model of "son" is the sum of the respective outputs of the two basic learners, that is, 2+0.9=2.9; the output of the ensemble model of "grandfather" is two The sum of the outputs of the base learners, i.e. -1-0.9=-1.9.

picture

Equation 2 - Objective function of the base learner in XGBoost

Machine learning has three elements: models, policies, and algorithms. The strategy includes how to define or evaluate the quality of the model, in other words, it involves defining the loss function. Different from the continuous fitting residual in GBDT, XGBoost increases the structural loss of the model on its basis, which can be understood as the cost of model learning; while the residual corresponds to the experience loss, which can be understood as the gap and lack of model learning ability.

In the above objective function (objective function is a higher-level term than loss function, it generally includes two parts: objective function = loss + regular term, the smaller the better. In addition to the objective function and the loss function, there is also a related The term is called cost function, in a sense, the cost function can be approximately understood as a loss function.), the first part of the summation defines the gap between the current model training results and the true value, which is called empirical risk, and the specific measurement method depends on The corresponding loss function is defined. The typical loss function is: the regression problem corresponds to the MSE loss, and the classification problem corresponds to the logloss loss; the second part of the summation reflects the structural risk of the model and affects the generalization ability of the model. If the base learner chooses a decision tree, then the structural risk here is defined as:

其中,γ和λ均为正则项系数,T为决策树中叶子节点的个数。注意,这里是叶子节点的个数,而非决策树中节点的数量(CART决策树进行CCP后剪枝时,用到的正则项是计算所有节点个数)。另外,这是一般介绍XGBoost原理时的公式,也是陈天齐最早论文中的写法,在Python的xgboost工具包中,模型初始化参数中除了与这两个参数对应的gamma和reg_lambda之外,还有reg_alpha参数,表示的一阶正则项,此时可写作:

公式3——XGBoost中的****Taylor二阶展开近似

理解:这个Taylor二阶展开近似可谓是XGBoost的灵魂所在,也是最能体现其相较于GBDT的玄妙和强大之处。为了更好的解释上述近似,首先给出通常意义下的Taylor展开式:

当然,上述公式也只是展开近似到了二阶,只要f(x)无限可导,则可有更高阶的近似。在XGBoost中,应用Taylor二阶展开近似其实是只对模型的经验风险部分,也就是公式2中第一部分求和的每个子项。这里,再次给出单个的表达式:

在上式中,下标i对应的是训练集中的第i个样本,上角标t和t-1对应的集成算法中的第t轮和第t-1轮。那么进一步地,谁是f,谁是x,谁是△x呢?这里,f就是个函数记号,对应的是损失函数中的l,重点需要理解x和△x。

在模型训练时,训练数据集其实是确定的,在每个基学习器中都是那一套固定值,所以yi自然也不例外,在上述公式中就可看做是常数。在集成模型训练的第t轮,此时模型训练的目的是基于前t-1轮的训练结果(此时已经确定)来寻找最优的第t轮结果,以此来得到当前可能的最小损失。

实际上,在集成学习中,第一个基学习器往往已经能够拟合出大部分的结果出来,例如在惯用的拟合年龄的例子中,假设要拟合的是100这个结果,那么很可能第一个基学习器的拟合结果是90,而后面的N-1个学习器只是在不断的修正这个残差:10。举这个例子的目的是想表达:在上述公式中前t-1轮的拟合结果y_hat其实对应的就是f(x+△x)中的x,而第t轮的拟合值则可视作是浮动变量△x。至此,照着Taylor展开式的样式,上述目标函数可具体展开为:

其中gi和hi分别为一阶导和二阶导,分别写作如下:

进一步地,记y为真实值,y_hat为拟合值,则对于回归问题,适用最为常用的MSE损失,则其loss函数及相应的一阶导和二阶导分别为:

而对于分类问题,以二分类问题为例,XGBoost中默认的损失函数为logloss,相应的loss函数及对应一阶导和二阶导分别为:

公式4——决策树中的最优叶子权重求解

XGBoost理论上可以支持任何基学习器,但其实最为常用的还是使用决策树,Python中的xgboost工具库也是默认以gbtree作为基学习器。在决策树中,第t轮训练得到的最优决策树实际上就是寻求最优的叶子权重的过程,所以理解这个最优的叶子权重尤为重要。

好在上述两个公式的求解非常简单易懂,甚至说是初中的数学知识范畴,可比SVM中的拉格朗日对偶问题容易理解多了。首先看如下转换:

第一步的约等号当然是来源于公式3中的Taylor二阶展开近似,只不过此时将常数部分省略而已,需要注意的是此时的∑求解是以样本为粒度的求解,即此时i为样本序号,n为样本总数。而在第二步的等号转换中,则是以叶子节点为粒度,将落在同一叶子节点的多个样本进行了聚合,此时落在同一叶子节点上的所有样本预测结果均为其叶子权重ωj,各个叶子节点内部的求和对应为内部的∑。

有了以上的近似展开和各叶子节点的汇聚,则可以引出如下公式:

其中Gj和Hj分别为第j个叶子节点所有样本的一阶导和二阶导的求和,即:

上述目标公式可看做是T个一元二次表达式的求和,其中每个一元二次表达式中的变量为ωj。显然,求解形如f(x)=ax^2+bx+c的最小值问题是一个初中阶段的数学问题,进而容易得出最优的ωj及此时对应的损失函数最小取值结果为:

这里的一元二次函数一定存在最小值,因为其二次项的系数1/2(Hj+λ)一定是个正数!

公式5——决策树的分裂增益

公式4解决的是叶子节点的最优权重问题,那么实际上是绕过了一个前置问题:即决策树的内部节点如何进行分裂?内部节点如何进行分裂其实可进一步细分为两个子问题:

①选择哪个特征进行分裂?

②以什么阈值划分左右子树?

第一个问题很好解决,最简单也是一直沿用至今的做法都是对所有特征逐一遍历,对比哪个特征最带来增益最大。

而对于第二个问题,其实也是采用遍历寻优的方法来得到最优分裂阈值,至于如何遍历寻优,其实还可以进一步细分为两个问题:

i)选择哪些候选分裂阈值?

ii)如何度量哪个分裂阈值更优?

选择哪些候选分裂阈值就涉及到很多技巧,XGBoost和LightGBM都采用了直方图法来简化可能的最优分裂点候选值,这里涉及的细节还有很多,暂且不谈;而对于如何度量分裂阈值更优的问题,则刚好可以利用前面公式4中的结论——叶子节点在最优权重下的最小损失表达式。以此为基础,度量最优分裂阈值的流程是这样的:

  • 如果该节点不进行分裂,即将其视作一个叶子节点,可以得到当前的最小损失取值;

  • 对于选定的特征及阈值,将当前节点的所有样本切分为左右子树,进而可以得到左右子树对应的最小损失取值;

那么,从该节点直接作为叶子节点到将其分裂为左右两个子叶子节点是否会带来损失的降低呢?所以只需将分裂前后的损失相减即可!那么相减之后γT部分为什么变为-γ了呢?其实就是因为在分裂之前该部分的正则项对应1个叶子节点,而分裂之后则对应2个叶子节点,所以两部分的γT相减即为-γ。

以上,就是关于XGBoost中的几个核心公式推导环节的理解,相信理解了这5个公式就基本能够理解XGBoost是如何设计和实现的了。当然,XGBoost的强大和设计巧妙之处绝不止于上述算法原理,其实还有很多实用的技巧和优化,这也构成了XGBoost的scalable能力,具体可参考论文《XGBoost: A Scalable Tree Boosting System》。

02 查看源码,了解底层数据结构

第一部分主要介绍了XGBoost中的核心公式部分,下面简要分享一下XGBoost中的底层数据结构设计。之所以增加这部分工作,仍然是因为近期在做部分预研工作的需要,所以重点探究了一下XGBoost中底层是如何存储所有基学习器的,也就是各个决策树的训练结果。

为了了解XGBoost中是如何存储训练后的各个决策树,我们查看分类器的模型训练部分源码,经过简单查看就可以定位到如下代码:

picture

也就是说,XGBClassifer模型训练后的结果应该是保存在_Booster属性中。

当然,上述查看的xgboost提供的sklearn类型接口,在其原生训练方法中,实际上是调用xgboost.train函数来实现的模型训练,此时无论是回归任务还是分类任务,都是调用的这个函数,只是通过目标函数的不同来区分不同的任务类型而已。

为了进一步查看这个_Booster属性,我们实际训练一个XGBoost二分类模型,运用如下简单代码示例:

from sklearn.datasets import load_iris
from xgboost import XGBClassifier

X, y = load_iris(return_X_y=True)
# 原生鸢尾花数据集是三分类,对其进行采用为二分类
X = X[y<2]
y = y[y<2]

xgb = XGBClassifier(use_label_encoder=False)
xgb.fit(X, y, eval_metric='logloss')
复制代码

而后,通过dir属性查看一下这个_Booster的结果:

picture

实际上,这个_Booster属性是xgboost中定义的一个类,上述结果也可直接查看xgboost中关于Booster类的定义。在上述dir结果中,有几个函数值得重点关注:

  • save_model:用于将xgboost模型训练结果存储为文件,而且xgboost非常友好的是在1.0.0版本以后,直接支持存储为json格式,这可比pickle格式什么的方便多了,大大增强可读性

picture

  • load_model:有save_model就一定有load_model,二者是互逆操作,即load_model可将save_model的json文件结果读取为一个xgboost模型

  • dump_model:实际上,dump也有存储的含义,例如json中定义的读写函数就是load和dump。而在xgboost中,dump_model与save_model的区别在于:dump_model的存储结果是便于人类阅读,但该过程是单向的,即dump的结果不能再load回去;

  • get_dump:与dump_model类似,只不过不是存储为文件,而只是返回一个字符串;

  • trees_to_dataframe:含义非常明了,就是将训练后的所有树信息转化为一个dataframe。

这里,首先看下trees_to_dataframe的结果:

picture

似乎从列名来推断,除了最后的Cover和Category两个字段含义不甚明了之外,其他字段的含义都非常清楚,所以也不再做过多解释。

之后,再探索一下save_model和dump_model的结果。既然dump_model的结果便于人类阅读,那么就首先查看这个结果:

picture

这里截取了dump_model后的txt文件的三个决策树信息,可见dump_model的结果仅保留了各决策树的分裂相关信息,以booster[0]第一个决策树为例,该决策树有三个节点,节点的缩进关系表达了子节点对应关系,内部节点标号标识了所选分裂特征及对应阈值,但叶子节点后面的数值实际上并非是其权重。

而后,再探索一下save_model的json文件结果,首先查看整个json的结构关系:

picture

其中,trees部分是一个含有100个item的列表,对应了100个基学习器的信息,进一步查看,类似于sklearn中定义的Array-based Tree Representation形式,这里的决策树各个节点信息仍然是Array-based,即各属性的第i个取值表示了相应的第i个节点的对应属性。主要字段及含义如下:

picture

It is worth pointing out that after comparing the values ​​of the three attributes of left_children, right_children and parents, it is easy to infer that the node number of the decision tree in xgboost is a hierarchical traversal, which is different from the pre-order traversal used in sklearn.

The above only gives some simple and simple exploration of the basic learner information in xgboost. If you are interested, you can try more and view the corresponding source code design, hoping to help understand the principle of xgboost.

Guess you like

Origin juejin.im/post/7096111249175347213