Project purpose: predict the value of customers' transactions
Data source: https://www.kaggle.com/c/santander-value-prediction-challenge
Data content: 4459 known customer transaction value and customer attributes (the specific content is not known, it may be gender , age , income , tax payment , etc. , each user has 4993 attributes ) steps:
- data analysis
- Eigenvalue selection
- Model building
- debugging
Data analysis first
There are 4459 rows and 4993 columns. In fact, 1845 column is float type, 3147 column is int type, and 1 column is object ( should be user id )
Observe that the number of eigenvalues is large
Preliminary treatment : remove the constant column and remove the duplicate column
Data changed from 4993 to 4732
Due to too many eigenvalues, it is difficult to plot and analyze
Use all eigenvalues directly
For predictive value analysis, observe the data distribution ( left in the figure below ) , most of the data are concentrated on the left, and log processing will make the data more consistent with the Gaussian distribution ( right in the figure below ) . Usually Gaussian distribution data prediction is more accurate (the reason is not very clear, my personal understanding is that if there is a larger value, the prediction deviation is a little bit, the loss changes greatly, which is not good for the fitting ) .
method 1
There may be a problem, too few samples, there may be overfitting. Let's take a look at the effect first.
First, a 4-layer dnn network is established ( see test_dnn.py for details )
Forecast result analysis
Test on the test set
Root Mean Square
Calculation method: sqrt((predicted value-original value)**2/sample number)
Rms=1.84
The figure below shows the distribution of prediction errors
Result analysis: the effect is not ideal, the predicted value is far from the true value, and there is a very large deviation
Cause Analysis:
- Model structure is not ideal
- Hyperparameter settings
- The sample is too small, there are 200w parameters but the sample is only 4000+, the over-fitting problem is serious ( after 20 iterations, over-fitting occurred )
Method 2
Use lightgbm
Use the lightgbm library directly ( can be used, but need to learn to adjust the parameters )
See test_lightgbm.py for details
Forecast result analysis
Test on the test set
Root Mean Square
Rms=1.35
Result analysis: the effect is still not ideal, but better than dnn, and there is no very large offset value
Cause Analysis:
- Overfitting
- Model parameter settings
Method 3
Use xgboost
Same method as above
forecast result
Rms=1.38
Result analysis: the effect is still not ideal
Cause Analysis:
- 2000 iterations are not enough, the model has not converged
- Model parameter settings
Method 4
Use catboost
Same method as above
forecast result
Rms=1.47
Result analysis: the effect is still not ideal
Method 5
Use the idea of integrated learning and mix the above methods
Sum the results of the 3 learners according to the weights to get the final result
Rms=1.36
Result analysis:
Use 4 methods to model the prediction target, among which dnn overfitting occurred very early due to too little data
Xgboost, lightgbm, and catboost are much better than dnn, but there are still deviations in the value prediction. But combined with Kaggle's forum posts, this is a good prediction without leak due to the characteristics of the data. Since the time required for tuning parameters is relatively large, it will not be performed. This is just a verification. The verification results are that Xgboost, lightgbm, and catboost have very good results in scenarios with less data.