Financial Scorecard Project—3. Data preprocessing and feature derivation in the churn warning model

introduction

  The following part of the complete code can be found on Github: https://github.com/Libra-1023/data-mining/blob/master/Bank_customer_churn/outlier_missingvalues_date_process.ipynb

1. Treatment of extreme values

  Extreme values ​​are also called outliers, which tend to distort the prediction results and affect the accuracy of the model. The influence of outliers in the regression model is particularly large. To use this model, we need to monitor and process them first.

1. The importance of extreme value (outlier) monitoring

  • Need to judge the influence of extreme values ​​on modeling by yourself, and choose a treatment method based on actual problems
  • The importance of detecting extreme values: due to the existence of extreme values, the model's estimation and prediction may have large deviations and changes
  • You can choose models that are not sensitive to extreme values, such as KNN, decision trees

The case is as follows:
  Through visualization, it is found that, under
Insert picture description here
  normal circumstances, extreme values ​​will bring certain deviations to the model. For example, in linear regression, extreme values ​​will significantly affect the parameter estimation of the model.
Insert picture description here

2. Treatment of extreme values

  How to deal with extreme values ​​in regression models?

  • Artificially reduce the extreme value to a certain normal value, such as replacing it with a 95% quantile (truncated).
    Example: Because the overdraft credit card usage limit exceeds 100%, 100% can be used instead
  • Delete extreme values
    Example: Very few cardholders are over 85 years old
  • Build a model separately
    Example: The credit card limit is extremely high

3. The method of extreme value detection-3σ criterion

Insert picture description here

2. Treatment of missing values

1. Types of missing values

  • Completely missing at random: missing values ​​have nothing to do with other variables, for example: missing marital status
  • Missing at random: missing values ​​depend on other variables, for example: missing "spouse name" depends on "marital status"
  • Completely non-random missing: missing values ​​depend on themselves, for example: high-income people are unwilling to provide family income

2. How to deal with missing values

  • Delete attributes or samples with missing values ​​(local tyrant behavior)
  • Imputation filling (usually used in the case of completely random missing and low degree of missing)
  • Treat missing as an attribute value (usually used for completely non-random missing)

3. Treatment of missing values ​​of continuous variables

  • For completely random deletion, when the deletion rate is not high, you can:

    1. Use a constant to fill in the gap, such as the mean. In particular, if there are extreme values, consider whether to eliminate extreme values ​​and then calculate the mean
    2. Randomly sample from non-missing values ​​to assign missing samples

  • For random missing that depends on some other variable, you can fill in the missing with completely random missing in the same layer

    For example: the variable "income" depends on the job status. When "work status" = "work", the income of the missing value can be replaced by the average value of the known income of all "working" cardholders,
    or the income of the missing value can be used for all "working" cardholders A random sample of the person’s known income instead

  • For completely non-random missing, you can treat the missing as an attribute and convert the variable into a categorical variable

4. Treatment of missing values ​​of categorical variables

  • When the missing rate is low,

    The most frequently occurring category can be filled by random sampling from other known samples.

  • When the missing rate is high,

    Consider excluding the variable (feature)

  • When the missing rate is between "very low" and "high",

    Can be regarded as a category

Three, the treatment of special variables

1. Categorical variables

  Variables expressing categories usually do not have the concept of "order" and have a limited range of values.

In this data set refers to: gender, industry, type of credit card, etc.

  • Some models can read categorical variables directly

    Decision tree

  • Some models cannot read categorical variables directly

    Regression model
    Neural network Models
    with "distance" measurement (SVM, KNN, etc.) _ Normalize before calculating the distance

  • When categorical variables cannot be directly put into the model, coding is required: replace the original value with a numerical value

    One-hot coding-categorical variables are converted into sparse matrix
    Dummy coding
    Concentration coding-more
    WOE codes are used in decision trees

2. Date/time variable

  • It often appears in the form of a string, for example: "2017-04-01 12:00:05"
  • Essentially numeric
  • It can be converted into days based on a certain base date. Based on the
    observation point, all account opening dates are converted into days from the observation point

Fourth, construct the characteristics of churn behavior

1. Internal data

  • Rich internal transaction details, including the volatility rate of local currency current savings, the average monthly and daily balance of local currency current savings, and the total number of phone banking transactions
    Insert picture description here
  • Features that can be constructed:

    The ratio of the amount of different transactions-the ratio of the transaction amount on the ATM to the transaction amount on the counter
    The average amount of a single transaction-the total transaction amount / the total
    number of transactions The ratio of the number of certain transactions to the total number of transactions

  • The information is redundant and needs to be eliminated according to the situation
    Insert picture description here

2. External data

  The external data contains the details of the customer in the telecom operator:

  • Talk time and frequency
  • Call details
  • Specific call behavior
  • other information

  The derived features are as follows:
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_46649052/article/details/114366393