Machine learning fusion model stacking 14 experience summaries and 5 successful cases (the most complete on the Internet, hardcore collection)

15419090:

I have read a lot of articles about fusion model stacking. Many authors tend to praise fusion model stacking and downplay its shortcomings, which is easy to mislead beginners. That's what it means.

Many of my students like to use the fusion model as a thesis or patent innovation point, which is a hot technology.

Recently, a classmate asked in the thesis modeling consultation whether the fusion model stacking is really reliable? This question made me think deeply. I think writing this article will make everyone understand the fusion model stacking more clearly. This article is a summary of my years of long-term experiments on fusion model stacking experience. It also took half a month to write this article, and most of the experiments were used for experiments. This article is relatively long, involves a lot of content, and has a lot of experimental data sets. It is estimated that it will be difficult to read it in a short time. You can bookmark this article first, and ponder it slowly later, so as to help you avoid the ten thousand years of pitfalls.

This article is more suitable for fusion model enthusiasts, model contest participants, writing papers, and patent students.

Stacking or stacking generalization is an ensemble machine learning algorithm.

It uses a meta-learning algorithm to learn how to best combine predictions from two or more underlying machine learning algorithms.

The benefit of stacking is that it can leverage the capabilities of a collection of well-performing models on a classification or regression task and potentially make better predictions than any single model in the ensemble. Note that I said possible, not absolute.

The following figure is the algorithm flow chart of the fusion model. We can see that the sub-model (base model) reads all the training data training data, instead of each sub-model only reads a part of the training data. Therefore, more sub-models can be added for observation in the early stage.

The final trained fusion model is like a real model, with predictive ability, classification ability and regression ability.

The previous article
"Model Contest Killer - Fusion Model (stacking)" introduced the fusion model stacking. You can check the details.

1. Fusion model stacking is difficult to apply to business models

Because the calculation time of the fusion model stacking is much longer than that of a single machine learning model. Commercial company models need to consider algorithm complexity, time cost, and interpretability, which are the pain points of fusion model stacking. In the previous kaggle model competition, a foreign contestant won the championship with fusion model stacking, but the sponsor funding company did not adopt it, because the fusion model had too many sub-models, which was very time-consuming and difficult to apply to actual business.

2. Fusion model stacking is very popular in academic papers

The disadvantages of fusion model stacking can also become advantages, that is, it is used in academia, especially for the release of papers. We have been in contact with a large number of paper consultations, and many colleagues in the academic circle believe that the more complex the model, the higher the value. In their eyes, the rank of deep learning model is higher than that of machine learning and statistical models. These are all misunderstandings. The choice of model algorithm should be combined with reality, depending on the scene and specific data sets, there is no completely accurate general routine. It is understandable that many academic colleagues have no business model experience. Therefore, I have read a lot of papers related to fusion models. Fusion model stacking can be composed of a large number of quantum models, there are many combinations, and it can also create a large number of paper innovations.

3. scikit-learn and mlxtend library

The scikit-learn and mlxtend libraries provide standard implementations of stacked ensembles in Python. Both scikit-learn and mlxtend libraries have their pros and cons. The advantage of the scikit-learn library is that logistic regression can be used as a meta-model (second-layer model). The mlxtend library is faster when running the stacking fusion model, but when using logistic regression, the model outside the support vector machine will report an error as a meta-model.

4. Fusion model stacking experiment time cost is high

Many sub-models have different data preprocessing methods, such as support vector machines, neural networks need to fill in missing values ​​of the data, and data smoothing, but integrated learning algorithms do not. Ensemble learning may get better results by directly using raw data.

The different data prediction logics of the sub-models lead to an increase in the diversity and number of fusion model experiments, which increases the time cost.

5. The performance of the fusion model is not necessarily higher than that of the sub-model

Many introductions about fusion models on the Internet convey a misunderstanding, that is, the performance of fusion models must be higher than that of single models. After we model, we should improve performance with fused models. But this is not the case in real time. In a large number of experiments, we found that the fusion model is often difficult to improve, and the performance is not as good as the sub-model, and it also consumes a lot of time for experiments.

For example, in the experiment on the breast cancer data set, we found that the fusion model auc is 0.9820, which is not as good as the above sub-model.

However, after we added KNN and lightgbm sub-models, the performance of the fusion model has been greatly improved, and surpassed all sub-models.

We have seen a large number of papers describing that the performance of the fusion model is better than that of the sub-models. That is because the author spent a lot of time experimenting and testing a set of fixed sub-models to get the conclusion that the performance of the fusion model has improved. The combination of submodels you see is not accidental, but the result of careful screening after a lot of time and experimentation.

6. Specific indicators for performance improvement of the fusion model

In a large number of experiments, we found that the fusion model has a higher probability of improving accuracy and f1 score than AUC. You can improve a certain indicator after merging with a group of sub-models, but it is not guaranteed to improve all indicators.

7. Fusion model improvement skills - cv parameter application

Stratified means stratified in English, and stratifiedkfold translated into Chinese is stratified K-fold cross-validation. When the target variable of the data set is unbalanced data, cross validation cross-validation will encounter insufficient randomness when dividing the data, such as a high proportion of good customers, a small proportion of bad customers, or even none.

stratifiedkfold is good for unbalanced data processing. If Stratified K-fold cross-validation is selected, the cross-validation will ensure that the proportion of categories in the original label, the proportion of categories in the training label, and the proportion of categories in the verification label are consistent during each training.

The figure below is the flowchart of the stratifiedkfold algorithm. We can see that the class target variable has three categories, and different categories have uniform cross-validation sampling.

When we call the cross_val_score function, remember to enter the cv parameter, generally choose 5 or 10. Enter any integer representing the number of folds in Stratified K-fold verification. Therefore, the cv parameter is very intelligent and can help us automatically solve the problem of unbalanced data processing of the target variable. When the data set is not large, the cv10 model performance may be better than cv5.

scores = model_selection.cross_val_score(clf, X, y,  
                                              cv=5, scoring='roc_auc')

8. Fusion model improvement skills - meta model meta_classifier selection

For most trainees, I recommend logistic regression as meta_classifier. In experiments on some data sets, other algorithms were not as good as logistic regression as meta-models. Taking the breast cancer data set as an example, I used logistic regression as the meta-model to obtain a fusion model with an auc of 0.9959, and used support vector machine as the meta-model to obtain a fusion model with an AUC of 0.982.

In the communication with some friends, I also found exceptions. Their data sets sometimes use the ensemble tree algorithm as the meta-model to obtain better fusion model performance. There are too many parameters in the fusion model, and everything depends on the experimental results.

9. Fusion model improvement skills - the number of sub-models is just right

In our experiments, we found that the sub-models of the fusion model stacking are not as many as possible, or as few as possible, and just right is the best.

We used 9 sub-models such as KNN and random forest to build the fusion model, with an AUC of 0.9953,

After reducing the number of sub-models, we used 6 sub-models to build a fusion model, and the AUC was 0.9957, which was much higher than the AUC of the fusion model of 9 sub-models. This shows that the sub-models of the fusion model are not as many as possible.

10. Fusion model improvement skills - delete the weakest sub-model, the fusion model can be improved

When we experiment with the fusion model, we can first increase the number of sub-models as much as possible, then observe which sub-models have weak performance, delete the sub-models that are obviously lagging behind, and the fusion model can be improved. As shown in the figure below, the AUC of the decision tree sub-model is 0.91, and the AUC of Gaussian Bayesian is 0.98, which is significantly lower than the performance of other sub-models. After deleting these two sub-models, the AUC of the fusion model increases from 0.9953 to 0.9957. In many experiments, we found that the performance of the decision tree and the Gaussian Bayesian model was too poor. Of course, this may be related to our experimental samples. It is not ruled out that these two algorithms perform well on some data sets.

11. Fusion Model Boosting Technique - Predicted class probabilities are used for meta-model training

A meta-classifier can be trained on predicted class labels, or on predicted class probabilities. We use the class probability predicted by the level1 model into the level2 meta-model to get better fusion model performance. If the predicted class labels are used, the fusion model will perform poorly.

This logic is very simple. Think about it, there are very few class label results. If it is a binary classification model, the class labels are only 0 and 1; if we choose probability prediction, the result is a decimal from 0-1. Therefore, the variety of class probabilities can make the model learn well and improve the performance of the model.

As long as you set use_probas=True, you can set the class probability for meta-model training.

sclf = StackingClassifier(classifiers=[clf1,clf2,clf3,clf4,clf5,clf6],              
                          meta_classifier=lr,use_probas=True)

12. Fusion Model Improvement Skills - Diversity Experiment

Some theories say that the greater the difference between the sub-models, the more independent they are from each other, and the greater the room for improvement of the fusion model. This theory can explain that the meta-model is generally logistic regression, and logistic regression requires the removal of highly correlated variables.

Multiple highly correlated variables sometimes drag down model performance. When the sub-model correlation is lower, the logistic regression has more room to play. The ensemble tree algorithm does not have such high requirements on variable correlation, so it can be relaxed appropriately. You can experiment, if the meta-model is an ensemble tree algorithm, is the sub-model independence requirement still valid?

The above is just a theory, and there is a big difference in the actual test. All students are subject to the actual test. This is only for reference.

I watched the video of Teacher Cai Cai, and she explained the diversity in detail, as follows:

12.1. Sample diversity: use the same variables to model, but sample different sample subsets for training each time. When the amount of data is small, downsampling can lead to a drastic drop in model performance.

12.2 Variable diversity: use the same variable matrix, but sample different feature subsets for training each time. When the amount of features is small, sampling features may lead to a sharp drop in model performance.

We can use the pipeline encapsulation method to obtain some variables of the dataset for training.

12.3 Random diversity/training diversity: use the same algorithm, but use different random number seeds random_state ((will lead to using different features, samples, starting points), or use different loss functions, use different impurity drops amount etc.

12.4 Algorithm diversity: Add different types of algorithms, such as integration, tree, probability, and linear models mixed. However, it should be noted that the effect of the model should not be too bad. Whether it is voting or averaging, if the effect of the model is too poor, the result of fusion may be greatly reduced.

13. Fusion model improvement skills - speed improvement

Fusion models use cross-validation and are very slow. It’s okay if you encounter a small data set; if you encounter a large data set, you need to choose the sub-model carefully. Assuming that our data set is very large and we want to save time, the SVM and catboost algorithms can be removed. These two sub-models are very time-consuming. SVM takes a long time to train for large data sets. catboost is a symmetric tree algorithm, and training data is also time-consuming.

Noise variables or variables of little significance in the data set can be deleted, which can reduce the dimension of the data set and improve the training time of the model.

Python reads Excel table data slower than csv data. We try to call pandas' read_csv() function to read data, which can save a lot of time. If the data set is particularly large, it can also be saved with the pickle package, and the reading speed is faster.

In short, the three aspects of variable screening, algorithm screening, and csv data reading can improve the training speed of the fusion model.

14. Fusion model improvement skills - data standardization processing

When the variance of the dataset is large, the predictive ability of our sub-models varies greatly. In the medical field, the variance of the data set is very small, such as age, blood routine test, the value is generally distributed from 0-100. But in the financial field, the data variance is very large. For example, Zhang San’s monthly income is 5,000 yuan, and Bill Gates’ monthly income is 500 billion. When the data variance is large and the sub-models are more independent, we need to standardize the data to reduce the data variance. If the sub-models are all ensemble tree algorithms, no data normalization is required. Mr. Toby generally elegantly called data standardization processing as smoothing processing, and the data after processing is smoother and will not fluctuate greatly.

The Python processing code is very simple, just call the preprocessing.scale() function of the sklearn package

from sklearn import preprocessing
X= preprocessing.scale(X)

Stacking Fusion Model Success Stories

Successful case of stacking fusion model 1 - breast cancer cell data set

The breast cancer cell data set has more than thirty variables and is used to build a breast cancer cell recognition model.

For the Wisconsin breast cancer data set, Mr. Toby used six sub-models of knn, Random Forest, CatBoost, neuron network, xgboost, and lightgbm to stack the fusion model. The AUC of the fusion model is higher than that of all sub-models.

Teacher Toby built a fusion model with seven sub-models of knn, Random Forest, CatBoost, neuron network, xgboost, lightgbm, and svm. The accuracy of the fusion model is higher than any sub-model.

Teacher Toby built a fusion model with five sub-models of knn, Random Forest, neuron network, xgboost, and svm. The f1 score performance of the fusion model is higher than that of any sub-model.

Breast cancer top ten classic machine learning modeling codes and complete fusion model codes can be obtained through "Python Machine Learning - Breast Cancer Cell Mining"
.


Successful case of stacking fusion model 2-Tianchi diabetes dataset

The Tianchi diabetes data set is used to establish a diabetes risk prediction model, with several variables and more than 5,000 data sets.

Teacher Toby built a fusion model with three sub-models of Random Forest, adaboost, and gradient boost. The f1 score performance of the fusion model is higher than that of any sub-model.

Fusion model f1 score performance improvement is much easier than AUC, and does not require too many sub-models.

Teacher Toby built a fusion model with three sub-models of Random Forest, adaboost, and xgboost. The auc score performance of the fusion model is higher than that of any sub-model. Before modeling, Mr. Toby used the median to fill in the missing data and did some data preprocessing to achieve this effect.

Tianchi diabetes data set is difficult to improve the accuracy rate. Teacher Toby spent a lot of time experimenting, first filling the missing values ​​with the median, and then using the fusion model built by the four sub-models of knn, neuron network, xgboost, and svm. The fusion model The accuracy score performance is higher than any sub-model.

The algorithmic principles of these four sub-models are very different, which ensures the diversity of the algorithm and the experimental results are also good.

The figure below is Mr. Toby's visualization of the accuracy indicators of the sub-model and fusion model, which is mainly reflected by the box diagram. We see that the fusion model has the highest accuracy.

Successful case of stacking fusion model 3-lending club data set

Lending Club is a well-known financial technology company in the United States. It has more than 120 variables and millions of data sets, with a total of about ten years of data sets. It belongs to the data set in the field of financial risk control and is suitable for banks, consumer finance companies, loan assistance companies, and financial technology companies.

Teacher Toby only used three sub-models of lightgbm, catboost, and xgboost to build a fusion model, which significantly improved the f1 score.

Due to the relatively large data set of lendingclub, Mr. Toby has limited time. How to use Mr. Toby's experience to improve accuracy and AUC will be done as homework for everyone.

If you are interested in lending club machine learning modeling, you can get it through "Python Risk Control Modeling Practical LendingClub"
.


Successful case of stacking fusion model 4 - p2p data set of Pterosaur loan, a subsidiary of Lenovo

Yilongdai has established operation centers in more than 100 prefecture-level cities across the country, covering thousands of districts, counties and nearly 10,000 towns and towns, and will establish a nationwide service network in many first- and second-tier cities across the country. Through this platform, it can help people with good credit and different needs to solve the problem of fund shortage, and at the same time, it can invest the surplus funds in their hands with higher returns for customers who need wealth appreciation. The main loan objects of Wing Loong Loan are to help agriculture, rural areas and rural households, individual industrial and commercial households, and small and micro business owners. Due to financial regulatory requirements, P2P must be transformed. At present, Pterosaur loan lending has been gradually reduced.

Teacher Toby only used three sub-models of lightgbm, catboost, and xgboost to build a fusion model, which significantly improved the f1 score.

Successful case of stacking fusion model 5 - Crohn's disease-causing gene mining model

Crohn's disease, also known as Crohn's disease, Crohn's disease, Crohn's disease, Crohn's disease, and granulomatous enteritis, is an intestinal inflammatory disease of unknown cause that can occur in any part of the gastrointestinal tract, but Occurs more frequently in the terminal ileum and right colon. Both chronic and nonspecific ulcerative colitis are collectively referred to as inflammatory bowel disease (IBD). The clinical manifestations are abdominal pain, diarrhea, and intestinal obstruction, accompanied by extraintestinal manifestations such as fever and nutritional disorders. The course of disease is more protracted, shows effect repeatedly, is difficult for a radical cure. There is no general cure, and many patients need surgical treatment when complications occur. The recurrence rate is related to factors such as the extent of the lesion, the strength of the disease invasion, the prolongation of the course of the disease, and the growth of age.

Many famous people have a history of Crohn's disease,

1. Larry Nance Jr., the head player of the current NBA Cavaliers, is the son of the former NBA dunk king, Larry Nance. He is a patient with Crohn's disease who inherited his father's ambition and galloped on the court. When Nance Jr. was 15 years old, he developed Crohn's disease. This disease caused Nance Jr. to lose his appetite and lack of energy. He began to become lethargic and had no energy to devote himself to basketball and schoolwork. It once made him think of giving up basketball.

2. In 2004, Menino, the then mayor of Boston, USA, was diagnosed with Crohn's disease after eating peanuts and causing severe abdominal pain while watching a baseball game.

3. The most well-known is the Supreme Commander of the Allied Forces during World War II, a five-star general, and later US President Eisenhower. Six months before the start of the campaign, he underwent surgery for Crohn's disease.

4. Posthumously awarded as a "model of the times", Wang Yiping, a doctoral supervisor at the Shanghai Institute of Materia Medica, Chinese Academy of Sciences, suffered from Crohn's disease for a long time before his death. From his diagnosis in 1993 to his death in 2018, Wang Yiping was sick for 25 years Persisting in scientific research and competing for time with the god of death, what is left behind is a bright picture of the modernization of traditional Chinese medicine.

The pain of having Crohn's disease is beyond the reach of ordinary people. Symptoms of Crohn's disease include chronic diarrhea, abdominal pain, weight loss, loss of appetite, fever and rectal bleeding, intestinal obstruction, joint pain, etc., seriously affecting the quality of life. Patients experience weakness from inability to eat and diarrhea, joint pain that prevents them from exercising, and complete changes in eating habits. Coupled with frequent toileting and exhaust that cannot be controlled by oneself, even normal social interaction will be hindered. Teacher Toby thought that Crohn's disease was very rare, but with the data query, it was found that the prevalence of the disease was increasing year by year. Bilibili has many self-proclaimed Crohn's disease patients, and posted videos to share their life with the disease.

Teacher Toby only uses three sub-models of lightgbm, catboost, and xgboost to build a fusion model to improve the accuracy rate.

Teacher Toby also used data mining to find out the high-risk genes of Crohn's disease, and I will introduce it to you later when I have time. Toby has worked with professors from the Chinese Academy of Sciences on the chronic disease project. Seeing the number of rare diseases increasing year by year, I have to lament that everyone takes care of themselves, cherishes their bodies, works just right, and doesn't work too hard.


Teacher Toby has more successful cases of stacking fusion models, which will be updated in succession. Welcome everyone to follow and collect the course "Python Financial Risk Control Scorecard Model and Data Analysis Micro-Professional Course" .

Copyright statement: The article comes from the official account (python risk control model), without permission, no plagiarism. Following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting.

Guess you like

Origin blog.csdn.net/toby001111/article/details/131268924