Machine learning | 5 methods for predicting and analyzing customer value based on machine learning

Project purpose: predict the value of customers' transactions

Data source: https://www.kaggle.com/c/santander-value-prediction-challenge

Data content: 4459 known customer transaction value and customer attributes (the specific content is not known, it may be gender , age , income , tax payment , etc. , each user has 4993 attributes )1.png steps:

  • data analysis
  • Eigenvalue selection
  • Model building
  • debugging

 

Data analysis first

There are 4459 rows and 4993 columns. In fact, 1845 column is float type, 3147 column is int type, and 1 column is object ( should be user id )

 

Observe that the number of eigenvalues ​​is large

Preliminary treatment : remove the constant column and remove the duplicate column

Data changed from 4993 to 4732

Due to too many eigenvalues, it is difficult to plot and analyze

Use all eigenvalues ​​directly

For predictive value analysis, observe the data distribution ( left in the figure below ) , most of the data are concentrated on the left, and log processing will make the data more consistent with the Gaussian distribution ( right in the figure below ) . Usually Gaussian distribution data prediction is more accurate (the reason is not very clear, my personal understanding is that if there is a larger value, the prediction deviation is a little bit, the loss changes greatly, which is not good for the fitting ) .

 

method 1

There may be a problem, too few samples, there may be overfitting. Let's take a look at the effect first.

First, a 4-layer dnn network is established ( see test_dnn.py for details )

Forecast result analysis

Test on the test set

Root Mean Square

Calculation method: sqrt((predicted value-original value)**2/sample number)

Rms=1.84

The figure below shows the distribution of prediction errors

Result analysis: the effect is not ideal, the predicted value is far from the true value, and there is a very large deviation

Cause Analysis:

  1. Model structure is not ideal
  2. Hyperparameter settings
  3. The sample is too small, there are 200w parameters but the sample is only 4000+, the over-fitting problem is serious ( after 20 iterations, over-fitting occurred )

7b947ee9d124afadbeb171d3a315f61e.png

 

Method 2

Use lightgbm

Use the lightgbm library directly ( can be used, but need to learn to adjust the parameters )

See test_lightgbm.py for details

Forecast result analysis

Test on the test set

Root Mean Square

Rms=1.35

6b6f1a5d06a9608f3fdd3498379dacad.png

Result analysis: the effect is still not ideal, but better than dnn, and there is no very large offset value

Cause Analysis:

  1. Overfitting
  2. Model parameter settings

9.png

 

Method 3

Use xgboost

Same method as above

forecast result

Rms=1.38

10.png

Result analysis: the effect is still not ideal

Cause Analysis:

  1. 2000 iterations are not enough, the model has not converged
  2. Model parameter settings

 

Method 4

Use catboost

Same method as above

forecast result

Rms=1.47

Result analysis: the effect is still not ideal

 

Method 5

Use the idea of ​​integrated learning and mix the above methods

Sum the results of the 3 learners according to the weights to get the final result

Rms=1.36

12.png

Result analysis:

Use 4 methods to model the prediction target, among which dnn overfitting occurred very early due to too little data

Xgboost, lightgbm, and catboost are much better than dnn, but there are still deviations in the value prediction. But combined with Kaggle's forum posts, this is a good prediction without leak due to the characteristics of the data. Since the time required for tuning parameters is relatively large, it will not be performed. This is just a verification. The verification results are that Xgboost, lightgbm, and catboost have very good results in scenarios with less data.


Guess you like

Origin blog.51cto.com/11855672/2559560