Article Directory
introduction
The following part of the complete code can be found on Github: https://github.com/Libra-1023/data-mining/blob/master/Bank_customer_churn/outlier_missingvalues_date_process.ipynb
1. Treatment of extreme values
Extreme values are also called outliers, which tend to distort the prediction results and affect the accuracy of the model. The influence of outliers in the regression model is particularly large. To use this model, we need to monitor and process them first.
1. The importance of extreme value (outlier) monitoring
- Need to judge the influence of extreme values on modeling by yourself, and choose a treatment method based on actual problems
- The importance of detecting extreme values: due to the existence of extreme values, the model's estimation and prediction may have large deviations and changes
- You can choose models that are not sensitive to extreme values, such as KNN, decision trees
The case is as follows:
Through visualization, it is found that, under
normal circumstances, extreme values will bring certain deviations to the model. For example, in linear regression, extreme values will significantly affect the parameter estimation of the model.
2. Treatment of extreme values
How to deal with extreme values in regression models?
- Artificially reduce the extreme value to a certain normal value, such as replacing it with a 95% quantile (truncated).
Example: Because the overdraft credit card usage limit exceeds 100%, 100% can be used instead - Delete extreme values
Example: Very few cardholders are over 85 years old - Build a model separately
Example: The credit card limit is extremely high
3. The method of extreme value detection-3σ criterion
2. Treatment of missing values
1. Types of missing values
- Completely missing at random: missing values have nothing to do with other variables, for example: missing marital status
- Missing at random: missing values depend on other variables, for example: missing "spouse name" depends on "marital status"
- Completely non-random missing: missing values depend on themselves, for example: high-income people are unwilling to provide family income
2. How to deal with missing values
- Delete attributes or samples with missing values (local tyrant behavior)
- Imputation filling (usually used in the case of completely random missing and low degree of missing)
- Treat missing as an attribute value (usually used for completely non-random missing)
3. Treatment of missing values of continuous variables
-
For completely random deletion, when the deletion rate is not high, you can:
1. Use a constant to fill in the gap, such as the mean. In particular, if there are extreme values, consider whether to eliminate extreme values and then calculate the mean
2. Randomly sample from non-missing values to assign missing samples -
For random missing that depends on some other variable, you can fill in the missing with completely random missing in the same layer
For example: the variable "income" depends on the job status. When "work status" = "work", the income of the missing value can be replaced by the average value of the known income of all "working" cardholders,
or the income of the missing value can be used for all "working" cardholders A random sample of the person’s known income instead -
For completely non-random missing, you can treat the missing as an attribute and convert the variable into a categorical variable
4. Treatment of missing values of categorical variables
- When the missing rate is low,
The most frequently occurring category can be filled by random sampling from other known samples.
- When the missing rate is high,
Consider excluding the variable (feature)
- When the missing rate is between "very low" and "high",
Can be regarded as a category
Three, the treatment of special variables
1. Categorical variables
Variables expressing categories usually do not have the concept of "order" and have a limited range of values.
In this data set refers to: gender, industry, type of credit card, etc.
- Some models can read categorical variables directly
Decision tree
- Some models cannot read categorical variables directly
Regression model
Neural network Models
with "distance" measurement (SVM, KNN, etc.) _ Normalize before calculating the distance - When categorical variables cannot be directly put into the model, coding is required: replace the original value with a numerical value
One-hot coding-categorical variables are converted into sparse matrix
Dummy coding
Concentration coding-more
WOE codes are used in decision trees
2. Date/time variable
- It often appears in the form of a string, for example: "2017-04-01 12:00:05"
- Essentially numeric
- It can be converted into days based on a certain base date. Based on the
observation point, all account opening dates are converted into days from the observation point
Fourth, construct the characteristics of churn behavior
1. Internal data
- Rich internal transaction details, including the volatility rate of local currency current savings, the average monthly and daily balance of local currency current savings, and the total number of phone banking transactions
- Features that can be constructed:
The ratio of the amount of different transactions-the ratio of the transaction amount on the ATM to the transaction amount on the counter
The average amount of a single transaction-the total transaction amount / the total
number of transactions The ratio of the number of certain transactions to the total number of transactions - The information is redundant and needs to be eliminated according to the situation
2. External data
The external data contains the details of the customer in the telecom operator:
- Talk time and frequency
- Call details
- Specific call behavior
- other information
The derived features are as follows: