Financial Risk Control Task-04 4 Modeling and parameter tuning

1 Learning Objectives

Learn machine learning models commonly used in the field of financial sub-control
Learning the modeling process and parameter tuning process of machine learning models
1. Complete the corresponding learning tasks

2 Introduction

Logistic regression model:
a. Understand logistic regression model;
b. Application of logistic regression model;
c. Advantages and disadvantages of logistic regression;
Tree model:
a. Understand the tree model;
b. The application of the tree model;
c. The advantages and disadvantages of the tree model;
Integrated model
a. Integrated model random forest model based on bagging idea
b. Integrated model XGBoost model based on boosting idea LightGBM model CatBoost model
Model comparison and performance evaluation:
a. Regression model/tree model/integrated model;
b. Model evaluation method;
c. Model evaluation results;
Model tuning:
a. Greedy tuning method;
b. Grid tuning method;
c. Bayesian tuning method;

3 Model related principles

3.1 逻辑回归模型https://blog.csdn.net/han_xiaoyang/article/details/49123419
4.3.2 决策树模型https://blog.csdn.net/c406495762/article/details/76262487
4.3.3 GBDT模型https://zhuanlan.zhihu.com/p/45145899
4.3.4 XGBoost模型https://blog.csdn.net/wuzhongqiang/article/details/104854890
4.3.5 LightGBM模型https://blog.csdn.net/wuzhongqiang/article/details/105350579
4.3.6 Catboost模型https://mp.weixin.qq.com/s/xloTLr5NJBgBspMQtxPoFA
4.3.7 时间序列模型(选学)RNN：https://zhuanlan.zhihu.com/p/45289691LSTM：https://zhuanlan.zhihu.com/p/83496936
4.3.8 推荐教材：
《机器学习》 https://book.douban.com/subject/26708119/
《统计学习方法》 https://book.douban.com/subject/10590856/
《面向机器学习的特征工程》 https://book.douban.com/subject/26826639/
《信用评分模型技术与应用》https://book.douban.com/subject/1488075/
《数据化风控》https://book.douban.com/subject/30282558/

4 Model comparison and performance evaluation

4.1 Logistic regression

Advantages
a. The training speed is fast. When classifying, the calculation amount is only related to the number of features; b
. It is simple and easy to understand, and the interpretability of the model is very good. From the weight of the features, we can see the impact of different features on the final result. Influence;
c. It is suitable for binary classification problems and does not need to scale the input features;
d. The memory resource occupation is small, and only the feature values of each dimension need to be stored;
Disadvantages
a. Logistic regression needs to pre-process missing values and outliers [refer to task3 feature engineering]; b
. Logistic regression cannot be used to solve nonlinear problems, because the decision surface of Logistic is linear;
c. It is more suitable for multicollinear data Sensitive, and it is difficult to deal with the problem of data imbalance;
d. The accuracy rate is not very high, because the form is very simple, it is difficult to fit the real distribution of the data;

4.2 Decision tree model

Advantages
a. Simple and intuitive, the generated decision tree can be visualized
b. Data does not require preprocessing, normalization, or missing data
c. Can handle both discrete and continuous values
Disadvantages
a. The decision tree algorithm is very easy to overfit, resulting in poor generalization ability (proper pruning can be performed)
b. The greedy algorithm is used, and it is easy to obtain a local optimal solution

4.3 Integrated model ensemble method (ensemble method)

By combining multiple learners to complete the learning task, through the ensemble method, multiple weak learners can be combined into a strong classifier, so the generalization ability of ensemble learning is generally better than that of a single classifier.

The integration methods mainly include Bagging and Boosting. Both Bagging and Boosting combine existing classification or regression algorithms in a certain way to form a more powerful classification. Both methods are methods of integrating several classifiers into one classifier, but the integration methods are different, and finally different effects are obtained. Common integrated models based on Baggin's idea include: random forest, integrated models based on Boosting idea: Adaboost, GBDT, XgBoost, LightGBM, etc.

The difference between Baggin and Boosting is summarized as follows:

Sample selection: The training set of the Bagging method is selected from the original set with replacement, so the training sets selected from the original set are independent for each round; while the Boosting method requires the training set of each round to remain unchanged, only The weight of each sample in the training set is changed in the classifier. The weights are adjusted according to the classification results of the previous round.
Sample weight: The Bagging method uses uniform sampling, so the weight of each sample is equal; while the Boosting method continuously adjusts the weight of the sample according to the error rate, the greater the error rate, the greater the weight
On the prediction function: All prediction functions in the Bagging method have equal weights; while in the Boosting method, each weak classifier has a corresponding weight, and the classifier with a small classification error will have a greater weight
Parallel computing: Each prediction function in the Bagging method can be generated in parallel; while each prediction function in the Boosting method can only be generated sequentially, because the latter model parameter requires the result of the previous round of the model.

4.4 Model Evaluation Method

For the model, its error on the training set is called training error or empirical error, and the error on the test set is called test error.

For us, we are more concerned about the learning ability of the model for new samples, that is, we hope to learn the general laws of all potential samples as much as possible through the learning of existing samples, and if the model learns the training samples If it is too good, it is possible to take some characteristics of the training samples themselves as the common characteristics of all potential samples, and then we will have the problem of overfitting.

Therefore, we usually divide the existing data set into two parts, the training set and the test set. The training set is used to train the model, and the test set is used to evaluate the model's ability to discriminate new samples.

For the division of data sets, we usually need to ensure that the following two conditions are met:

The distribution of the training set and the test set must be consistent with the real distribution of the sample, that is, the training set and the test set must be guaranteed to be independently and identically distributed from the real sample distribution;
The training set and the test set should be mutually exclusive

There are three methods for dividing data sets: hold-out method, cross-validation method and self-help method, which are introduced one by one below:

① Set-out method
The set-out method is to directly divide the data set D into two mutually exclusive sets, one of which is used as the training set S and the other as the test set T. It should be noted that the consistency of the data distribution should be ensured as much as possible during the division, that is, to avoid the impact on the final result due to the introduction of additional deviations during the data division process. In order to ensure the consistency of data distribution, we usually use stratified sampling to sample data.
Tips: Usually, about 2/3~4/5 of the samples in the data set D are used as the training set, and the rest are used as the test set.
②Cross-validation method
K-fold cross-validation usually divides the data set D into k parts, of which k-1 is used as a training set, and the remaining one is used as a test set, so that k sets of training/test sets can be obtained, which can be performed k times Training and testing, the final return is the mean of k test results. The division of data sets in cross-validation is still based on stratified sampling.
For the cross-validation method, the selection of the k value often determines the stability and fidelity of the evaluation results. Usually, the k value is selected to be 10. When k=1, we call it the leave-one-out method.
③ Self-help method
We take a sample from the data set D each time as an element in the training set, then put the sample back, and repeat this behavior m times, so that we can get a training set of size m, in which there are Samples appear repeatedly, and some samples do not appear, and we use those samples that have not appeared as the test set.
The reason for such sampling is because about 36.8% of the data in D has not appeared in the training set. Both the hold-out method and the cross-validation method use stratified sampling for data sampling and division, while the bootstrap method uses repeated sampling with replacement for data sampling

Dataset partition summary

When the amount of data is sufficient, the set-out method or k-fold cross-validation method is usually used to divide the training/test set;
Use the bootstrap method when the data set is small and it is difficult to effectively divide the training/test set;
When the data set is small and can be effectively divided, it is best to use the leave-one-out method for division, because this method is the most accurate

4.5 Model Evaluation Criteria

For this competition, we choose auc as the model evaluation standard. Similar evaluation standards include ks, f1-score, etc. For the specific introduction and implementation, you can review the content in task1.

Let's take a look at what exactly is auc?

In logistic regression, a threshold is usually set for the definition of positive and negative cases. Those greater than the threshold are positive, and those less than the threshold are negative. If we reduce this threshold, more samples will be identified as positive classes, increasing the recognition rate of positive classes, but at the same time, more negative classes will be misidentified as positive classes. In order to visualize this phenomenon, ROC is introduced.

The corresponding points in the ROC space are calculated according to the classification results, and the ROC curve is formed by connecting these points. The abscissa is False Positive Rate (FPR: False Positive Rate), and the ordinate is True Positive Rate (TPR: True Rate). Under normal circumstances, this curve should be above the line connecting (0,0) and (1,1), as shown in the figure:
insert image description here

Four points in the ROC curve:

Point (0,1): That is, FPR=0, TPR=1, which means that FN=0 and FP=0, all samples are correctly classified;
Point (1,0): that is, FPR=1, TPR=0, the worst classifier, avoiding all correct answers;
Point (0,0): that is, FPR=TPR=0, FP=TP=0, the classifier predicts each instance as a negative class;
Point (1,1): The classifier predicts each instance as a positive class

In summary: the closer the ROC curve is to the upper left corner, the better the performance of the classifier and the better its generalization performance. And generally speaking, if the ROC is smooth, then it can basically be judged that there is not much overfitting.

But for two models, how do we judge which model has better generalization performance? Here we have mainly the following two methods:

If the ROC curve of model A completely covers the ROC curve of model B, then we think that model A is better than model B;

If the two curves intersect, we can judge by comparing the area of the curve enclosed by the ROC and the X and Y axes. The larger the area, the better the performance of the model. We call this area AUC (area under ROC curve )

(I don't understand the code yet...)