Machine Learning | Bank Telemarketing Customer Purchase Probability Prediction Analysis Based on Machine Learning

Data set: A certain bank's telephone marketing downloaded by uci and whether to purchase data stored regularly .

Simulation goal: know customer data and predict the probability of purchasing financial products

I think that eliminating telemarketing data and only retaining basic attributes can simulate the data that actual banks can obtain .

Telemarketing data represents some data that influences the user's decision but is difficult to obtain . For example, when buying a house , buying a car , or going to school with children, these data banks cannot obtain them immediately, or the acquisition cost is relatively high. These data are not used here to participate in the prediction. Although the prediction accuracy will be reduced, it is more in line with the actual situation.

Then regular storage is a product that can be used as a kind of financial management. If you can realize and verify one kind of forecast, it can be extended to forecasts for multiple products

Case data , see the following table

Age age
Job jobs
Marital Marital status
Education status of education
Default Breach of contract-no no default yes yes
Balance Account Balance
House Do you want to buy a house-no no property yes yes
Loan Loan-no no loan yes no loan

Picture 1.png

data processing

Routine data cleaning routines ( empty value checking, deduplication, and outlier removal )

Because the data set is good, it basically does not need to be processed, but the realization of the data is likely to need to be cleaned, for example, the missing age cannot be simply added to 0.

Picture 2.png

Attempt to balance processing

1. The distribution of balance is smaller in larger values ​​and needs to be processed

2. Balance does not process

For data one-hot encoding, replace with 0,1 for 2 categories such as yes and no

The data after processing is

670a4b3cbeab8b981f0c65ee2e7cf849.png

Use lightgbm to model, the parameters are as follows

The result of the test set prediction is the customer serial number on the left, predict is the predicted purchase probability ( recommendation ) , real is the purchase situation ( 0 means not purchased, 1 means purchased )

Picture 5.png

To evaluate the quality of the model, for a small number of people to buy ( most of the predictions are less than 50% ) , it is difficult to use accurate to evaluate

For example ,

The real situation is that the purchase rate of category A is 0.1, the purchase rate of category B is 0.2, and the purchase rate of category C is 0.2.

That is, the real 100 A, 100 B, and 100 C are purchased as 10, 20, and 20 respectively

The two models have been trained to predict the purchase probability of A, B, C, and 3 types of people as 0.3, 0.2, 0.1; 0.15, 0.2, 0.2. The model believes that the three types of people, A, B, and C will not buy .

accurate is the correct number of people/total number of people predicted

accurate(model 1)=accurate(model 2)=250 (250 did not buy anything, the model predicts that no one will buy)/300=83%

If you use accurate to evaluate, the performance of Model 1 is equal to Model 2 .

But obviously Model 2 is more in line with the real situation, so we no longer use accurate as the standard here .

The method used here is to sort the predicted values. If the prediction is accurate, people with high probability will definitely buy more products. Use the following figure to measure the quality of the model. Red is random recommendation, and green is recommendation after sorting by probability.

If the green line starts to rise faster, the better the model is .

In addition to measuring the quality of the model, this picture is also a useful conclusion:

For a range of people, you can sort by the model first, and then select a certain range for marketing activities to increase the conversion rate .

Picture 6.png

Here we use 1000 people as the standard, and later models will also use this data as the judgment standard. Before sorting, 1000 people are recommended, and the purchase rate is 104/1000=10.4%. After sorting, 1000 people are recommended, and the purchase rate is 270/1000=27%. The biggest difference is 1362. In the recommendation of 1362 people, the difference between the number of purchasers and random recommendation using the sorting algorithm is the largest.

The importance of the feature value is shown in the figure below. It can be seen that the account balance and age are the two most important feature values.

Picture 8.png

Modeling through dnn and xgboost ( see py code for details )

Comparison of the effects of 3 types of modeling

When there are 1000 people, the actual number of purchases is as follows

3 cash withdrawals

Picture 10.png

Conclusion analysis In the current parameter settings, the effect is xgboost>lightgbm>dnn

In contrast, the previous data set considered only the basic attributes of customers, and then made predictions about the customer's purchase probability.

Now add telemarketing data to simulate some marketing data ( for example, a few recommended emails have been sent ).

Add 3 new parameters as

Duration: Duration of the last call

Campaign: In this promotion, the number of exercises

Poutcome: The result of the previous promotion. Success means success, failure means failure. Unknown and other are not very clear about the attributes.

The modeling results using lightgbm are as follows

Picture 12.png

The importance of eigenvalues ​​is as shown in the figure below. The newly added eigenvalues ​​duration ( continuous call time ) and campaign ( number of contacts ) have a very important impact on the prediction .

Picture 13.png

Compare the prediction without these 3 eigenvalues ( also use lightgbm )

Picture 14.png

This comparison shows that the more effective feature values, the more accurate the prediction.


Guess you like

Origin blog.51cto.com/11855672/2559171