GBDT and xgboost interview

How to do the game (let’s talk about the problem to solve first, whether it is a regression or a binary classification problem, what is the meaning of the KS curve, and can it be optimized (replace with AUC))

KS value: Take the cumulative value of the true rate and the false positive rate as the ordinate to get two curves, which is the KS curve.

The difference between GBDT and XGBoost (zhihu wepon: https://www.zhihu.com/question/41354392 )

Traditional GBDT uses CART as the base classifier, and xgboost also supports linear classifiers. At this time, xgboost is equivalent to logistic regression (classification problem) or linear regression (regression problem) with L1 and L2 regularization terms.
Traditional GBDT only uses first-order derivative information in optimization, while xgboost performs second-order Taylor expansion on the cost function, and uses first-order and second-order derivatives at the same time. By the way, the xgboost tool supports custom cost functions, as long as the function can take first and second order derivatives.
xgboost adds a regular term to the cost function to control the complexity of the model. The regular term contains the number of leaf nodes of the tree and the sum of the squares of the L2 modulus of the score output on each leaf node. From the perspective of Bias-variance tradeoff, the regular term reduces the variance of the model, making the learned model simpler and preventing overfitting, which is also a feature of xgboost over traditional GBDT.
Shrinkage, equivalent to the learning rate (eta in xgboost). After xgboost completes one iteration, it will multiply the weight of the leaf node by this coefficient, mainly to weaken the influence of each tree, so that there is more room for learning later. In practical applications, eta is generally set smaller, and then the number of iterations is set larger. (Supplement: The implementation of traditional GBDT also has a learning rate)
Column subsampling. Xgboost draws on the practice of random forest and supports column sampling, which can not only reduce overfitting, but also reduce computation. This is also a feature of xgboost that is different from traditional gbdt.
Handling of missing values. For samples with missing feature values, xgboost can automatically learn its split direction.
The xgboost tool supports parallelism. Isn't boosting a serial structure? How parallel? Note that the parallelism of xgboost is not the parallelism of tree granularity, and xgboost can only proceed to the next iteration after one iteration (the cost function of the t-th iteration contains the predicted value of the previous t-1 iteration). The parallelism of xgboost is at feature granularity. We know that one of the most time-consuming steps in decision tree learning is to sort the values of the features (because the best split point is to be determined). Before training, xgboost sorts the data in advance, and then saves it as a block structure. Repeated use of this structure in iterations greatly reduces the amount of computation. This block structure also makes parallelization possible. When splitting nodes, it is necessary to calculate the gain of each feature, and finally select the feature with the largest gain for splitting, then the gain calculation of each feature can be performed in multiple threads.
A parallelizable approximate histogram algorithm. When the tree node is split, we need to calculate the gain corresponding to each split point of each feature, that is, use the greedy method to enumerate all possible split points. When the data cannot be loaded into memory at one time or in a distributed situation, the efficiency of the greedy algorithm will become very low, so xgboost also proposes a parallel approximate histogram algorithm to efficiently generate candidate segmentation points.

The improvement of LightGBM relative to XGBoost (parallel calculation of approximate histogram?? I don't understand, the difference between building trees: leaf wise VS level wise? https://zhuanlan.zhihu.com/p/25308051 , http://msra.cn/zh -cn/news/blogs/2017/01/lightgbm-20170105.aspx)

Model common anti-overfitting scheme (cross-validation CV regularization (L1, L2))

How does L1 regularization solve non-derivable problem optimization (near gradient descent method (or called: axis descent method??) && least angle regression)

Common Optimization Algorithms: Convex Optimization Family: (Gradient Descent (BGD, SGD), Newton's Family (Variations: BFGS, DFP, etc.), Lagrangian Duality, Others?: Heuristic Optimization Algorithms, Ant Colony, Genetics, Simulation Annealing, tabu search, greedy algorithms...

CART树如何建树（内部节点离散特征取值是否，连续值切割相邻两点，小于或者大于阈值（ai+ai+1)/2，递归二分每个特征？？）用到的准则（回归：平方误差最小化；分类：基尼指数）

常见损失函数（线性回归：平方损失、0-1损失、LR：对数损失，提升树（此处指adaboost）：指数损失，SVM：合页损失函数...）

XGBoost如何解决缺失值问题？（能通过自动学习找到缺失值分裂方向）

XGBoost中的树剪枝（CART的剪枝原理：预剪枝和后剪枝策略？有点混乱）

不平衡数据如何解决(上采样、下采样或者自定义代价函数，xgboost中scale_pos_weight参数)

Linux，Shell会吗？（不会）
0/1label占比
sql ，where / having的区别：
逻辑回归、xgboost 关于0/1占比非常不均衡如何调参
feature_importance的原理

口头编程：旋转数组的最小数字（剑指offer原题，思路）
蚂蚁金服面试问题：0/1label占比； sql where / having的区别；逻辑回归、xgboost 关于0/1占比非常不均衡如何调参feature_importance的原理；KS的定义是什么，模型效果怎么看

GBDT and xgboost interview

Guess you like