Chapter 4 Data Preprocessing and Feature Construction (Continued)

Apply for a scorecard model

Data preprocessing and feature construction (continued)

  • Course description: The characteristics of the logistic regression model need to be numerical, so categorical variables cannot be directly put into the model and need to be coded. In addition, in order to obtain the stability of the scoring model, numerical features need to be binned during modeling. Finally, before we bring it into the model, we need to do univariate and multivariate analysis of features.

table of Contents:

  • Feature binning
  • WOE and characteristic information value
  • Univariate analysis and multivariate analysis
  1. Feature binning
  • The concept of binning

In the development of the scorecard model, variables need to be binned before they can be put into the model. The definition of binning operation is as follows:

  • For numerical variables, divide them into a number of limited segments. For example, divide the income into <5K, 5K~10K, 10k~20k, >20k, etc.
  • For categorical variables, if the number of values ​​is large, combine them into several segments with a smaller number. For example, divide the provinces into {North, Shang, Guang}, {Su, Zhejiang, Anhui}, {Hei, Ji, Liao}, {Fujian, Guangdong, Hunan}, and others.

Reasons for the introduction of variable bin operation in the scorecard model

  • The scoring results need to have a certain degree of stability. For example, when the overall credit aptitude of the borrower remains unchanged, the scoring result should also remain stable. A little fluctuation in certain variables (such as income) should not affect the scoring results. For example, when the income is divided according to the above, even if the monthly income changes from 6k to 7k, the scoring result will not change when other factors remain unchanged.
  • For categorical variables, when the number of values ​​is large, the variables will expand if they are not binned. For example, for 31 provincial administrative regions (excluding Hong Kong, Macao and Taiwan), using onehot encoding will produce 31 variables; using dummy variable encoding will produce 30 variables.
  • Requirements for binning

Variables that do not need to be binned

For categorical variables, if the number of values ​​is small, there is generally no need for binning

Orderliness of binning results

For ordered variables (including numerical and ordered discrete types, such as academic qualifications), the binning requirements are required to ensure order

Balance of bins

Under stricter circumstances, the proportion of each box after the box can not be too different. Generally, the smallest proportion is required, and the proportion is not less than 5%

Monotonicity of binning

Under stricter circumstances, the bad sample rate requirement of each box after the ordinal variable is divided into boxes is in a monotonous relationship with the box.

For example, after dividing the income into <5K, 5K~10K, 10k~20k, >20k, the bad sample rates are 20%, 15%, 10%, and 5%, respectively.

Or, divide the academic qualifications into {lower than high school}, {high school, junior college}, {undergraduate, master}, and {doctoral} post. The bad sample rates are 15%, 10%, 5%, and 1%, respectively.

Number of bins

It is usually required that the number of boxes should not be too many, generally within 7 or 5 boxes.

Advantages and disadvantages of binning

advantage:

Stable: After binning, the fluctuation of the original value of the variable within a certain range will not affect the scoring result

Missing value processing: missing values ​​can be used as a single box, or combined with other values ​​as a box

Outlier handling: Outliers can be combined with other values ​​as a box

No need for normalization: from numerical to categorical, there is no difference in scale

Disadvantages:

A certain amount of information is lost: after the numerical variables are binned, they become several bins with limited values

Coding required: the variables after binning are categorical and cannot be directly brought into the logistic regression model. Numerical coding is required

Commonly used methods of binning

 

a) Chi-square boxing method

In the supervised binning algorithm, the chi-square binning method is a commonly used method. It is based on the chi-square distribution and the chi-square value to determine whether a certain factor will affect the target variable. For example, when testing whether gender affects the probability of default, a chi-square test can be used to judge.

The null hypothesis H0 of the chi-square test is: there is no difference between the observed frequency and the expected frequency, that is, the factor will not affect the target variable. Based on this assumption, the χ2 value is calculated, which represents the degree of deviation between the observed value and the theoretical value. According to the χ2 distribution and the degree of freedom, the probability P of obtaining the current statistics and the more extreme cases can be determined when the H0 hypothesis is established. If the P value is small, it means that the observed value deviates too much from the theoretical value, and the invalid hypothesis should be rejected, indicating that there is a significant difference between the comparative data; otherwise, the invalid hypothesis cannot be rejected, and the actual situation and theoretical hypothesis represented by the sample cannot be considered. have difference.

Calculation of chi-square value:

  • m: the number of values ​​of the factor; k: the number of categories
  • : Observation frequency of category k in the factor i group
  • : Expectations under the null hypothesis.

When the sample size is relatively large, the χ2 statistic approximately obeys the chi-square distribution with (m-1)(k-1) degrees of freedom.

Chi-square test case

 

The total default rate is (120+80)/(320+300)=32.25%

If gender is not related to default, it means that the default rate of men and women is the same, both are 32.25%, then:

The expected value of male default is 320*32.25% 104, and the expected non-default=320-104=216

The expected value of female default is 300*32.25% 97, and the expected non-default=300-97=203

Due to the existence of random factors, even if the hypothesis of "gender and default is not related" is established, the observed actual default population of males and females will not be exactly equal to 104 and 97. The idea of ​​chi-square test is to measure the probability that the difference between the predicted value and the observed value is caused by random factors . If this probability is small, the hypothesis that "gender is not related to default" is invalid, so the default rates of men and women are different. The probability here needs to be described by the probability corresponding to the chi-square value:

Since there are two types of gender and default status, the degree of freedom of the chi-square test is (2-1)(2-1)=1, =8.05 corresponding to p-value=0.005, so gender has a significant impact on default behavior .

Chi-square (ChiMerge) boxing method (continued)

The ChiMerge method adopts a bottom-up continuous merging method to complete the binning operation. In each step of the merging process, rely on the smallest chi-square value to find the optimal merging item. The core idea is that if two intervals can be merged, then the bad samples in these two intervals need to have the closest distribution, which means that the chi-square value of the two intervals is the smallest. So the steps of ChiMerge are as follows:

  1. After sorting the numerical variables, divide them into groups with more intervals and set them as
  2. Calculate the combined chi-square value, the combined chi-square value, and the combined chi-square value
  3. Find the smallest one among all the merged chi-square values ​​in the previous step, assuming that it is merged to form a new
  4. Repeat 2 and 3 until the termination conditions are met

The general ChiMerge termination conditions are:

  1. After a certain merger, the p-value of the smallest chi-square value exceeds 0.9 (or 0.95, 0.99, etc.), or
  2. After a certain side is merged, the total number of unmerged intervals reaches the specified number (for example, 5, 10, 15, etc.)

 

Binning and merging in the case of non-monotonic bad sample rate

As mentioned earlier, when the chi-square boxing method completes boxing, the bad sample rate of each box does not necessarily meet the monotonous requirement, and further merging is needed at this time. There are 2 options at this time:

  1. The chi-square binning method is used to reduce the number of bins. For example, if the bad sample rate is not monotonic when the current is divided into 5 boxes, you can set the number of bins to 4 in the chi-square binning method, and check the monotonicity when the number of bins is 4. If it is satisfied, the binning will be stopped; if not, the number of bins can be further reduced. The minimum number of bins is 2, because when there are only two bins, the existence of monotonicity loses its meaning.
  2. For boxes that currently do not satisfy monotonicity, they can be merged with previous or subsequent boxes. As shown in the figure on the previous page, the bad sample rate of the third box is lower than the two boxes, so it needs to be combined. Choose to merge with the previous or subsequent boxes, according to the following principles:
  3. After the merger, the degree of non-monotony is reduced. For example, after the third box and the fourth box are combined, the overall monotonicity is guaranteed, so the program is implemented
  4. If both solutions can alleviate non-monotonicity, you can choose the "better" one. Generally speaking, you can consider whether it is "better" from two points. Assume that merging boxes 2 and 3 is better than merging boxes 3 and 4 because
  • The chi-square value after combining boxes 2 and 3 is lower than the chi-square value after combining boxes 3 and 4, or
  • After merging boxes 2 and 3, the proportion of all boxes is more balanced than the proportion after merging boxes 3 and 4.

Determine the uniformity of distribution after binning

  • Suppose the original variable is divided into m boxes, and the proportions of each box are respectively.
  • The following formula can be used to measure the uniformity of the proportion:
  •  
  • It can be known from Schwartz's inequality that at that time, is the smallest,

    equal. When one of them is 1 and the other is 0, the maximum is equal to 1. So it can be seen that the smaller the Balance, the more uniform.

Bins with special values

In actual business work, there are sometimes some special values, such as missing, in addition to some normal observations. From the previous analysis, it can be known that some variables in the data of this case contain some missing values. In the scorecard model, we usually regard missing values ​​as a special value. The binning of continuous variables needs to exclude these special values ​​in advance, that is, special values ​​do not participate in binning.

When a continuous variable has a special value, the special value needs to be regarded as a separate box, and the rest of the normal values ​​participate in the binning, and the number of bins is the preset number minus the number of special values. Note here:

  • Since the special value cannot be compared with other values, when testing the monotonicity of the bad sample rate, the bad sample rate of the special value is not considered
  • When the proportion of the special value is small (for example, less than 5%), consider combining the special value with a box of the normal value, and usually combine with the smallest box or the largest box

Binning of categorical (unordered) variables

The ChiMerge binning method introduced above is for numerical variables, such as income, age, etc. The ordering of the original variables should be maintained in the binning process. For categorical variables, if it is unordered and the number of values ​​is large, you need to perform numerical coding before ChiMerge binning, and replace the original categorical values ​​with numbers. The commonly used numerical code is the average bad sample rate corresponding to the numerical value.

For example, province is a commonly used variable in the scoring model. In 31 provincial-level administrative regions (excluding Hong Kong, Macao and Taiwan), we replaced the original provincial-level administrative regions with the bad sample rate of each province in the sample. Under such conversion, categorical variables are converted into numerical variables. Then you can use the ChiMerge binning method to perform binning operations. The provinces after binning may be {North, Shanghai, Guangzhou, Shenzhen}, {Su, Zhejiang, Lu, Min}, {Other}, etc.

Binning of categorical (ordinal) variables

For ordered categorical variables, such as education = {primary school, junior high school, high school, college, undergraduate, master, doctorate}, first sort the variable, and then you can still sort the boxes according to the ChiMerge boxing method of numerical variables . The final binning result of "Education" may be {elementary school, junior high school, high school}, {college, undergraduate}, {master, doctorate}

Advantages and disadvantages of ChiMerge binning method

 

  1. WOE and characteristic information value

WOE code

Encoding operation is an operation that replaces non-numerical values ​​with numerical values, in order to allow the model to perform mathematical operations on it. For example, you can use 3 sets of integers between 0 and 255 to encode colors. In the development of the scorecard model, all variables become groups after the variables are binned. At this time, it needs to be coded before the next step of modeling. In the scorecard model, WOE (Weight of Evidence) is commonly used for coding after binning. The calculation formula is as follows:

 

The meaning of WOE code

Note the WOE formula

We have:

  1. The symbolic nature of WOE:

That is, if the WOE of a box is positive, it indicates that the bad sample rate of the box is lower than the average bad sample rate of the entire sample, and it is relatively easier to have good samples

  1. The monotonic nature of WOE:

That is, the monotonicity of WOE is opposite to the monotonicity of bad sample rate.

Points to note when using WOE codes

  • It can be seen from the WOE calculation formula that to make a box meaningful, and must be a positive number greater than 0. This also means that in the binning operation in the previous step, each bin must contain both good and bad samples.
  • In the logarithmic calculation of the above formula, the proportions of good and bad samples are in the numerator and denominator respectively. It is also possible that the proportions of good and bad samples are in the denominator and numerator respectively, but it is required that in a certain model, all variables are handled in the same way. At the same time, the calculation method of WOE has certain requirements for the signs of variables in the subsequent logistic regression model.

Advantages and disadvantages of WOE encoding

Advantages of WOE encoding

Improve the performance of the model: Use the relative total log odds in each box as the coding basis to improve the prediction accuracy of the model

The scale of uniform variables: empirically, the value range after WOE encoding is generally between -4 and 4

WOE invariance in stratified sampling: If the modeling requires stratified sampling of good and bad samples, the WOE calculated after sampling is consistent with the WOE calculated without sampling

Disadvantages of WOE encoding

Each box is required to include good and bad samples at the same time: it has been explained before

Invalid for multi-category labels: If the number of target variables exceeds 2, the WOE after binning cannot be calculated

Characteristic information value (IV)

In the scorecard model, the work of measuring the importance of variables is a necessary work. In the early stage of feature engineering, we can often derive a large number of variables, but there is no guarantee that these variables are important for model development. By measuring the importance of variables, we can select relatively more important variables from them, and provide dimensionality reduction capabilities for subsequent analysis. Here we measure its importance by calculating the characteristic information value (Information Value). The calculation formula is as follows:

 

From the calculation of the above formula, it can be seen that the IV of a variable is the weight of the WOE of each box of the variable, and the weight is. As mentioned earlier, WOE calculations can also be. At this time, the weight also affects the correction to. Regarding IV, we have:

Non-negativity : if, then, and, then there is, thus, then IV>0.

Weight : WOE reflects the excess of the ratio of good to bad in each box relative to the ratio of good to bad for the entire sample, and IV reflects the significance of this excess in the sense of the volume of the box. For example, the good and bad in one box accounted for 2% and 1% respectively, and the good and bad in another box accounted for 20% and 10% respectively. From the perspective of WOE, the two are the same, both are ln(2). However, the former has a smaller volume and the latter has a larger volume, respectively (2%-1%)=1% and (20%-10%)=10%. So the latter is more significant.

Regarding IV, we need to pay attention to several points:

  1. IV measures the importance of the overall feature, not the importance of each box. The larger the IV value, the higher the importance of the variable. However, the value of IV should not be too large, otherwise there may be a risk of overfitting.
  2. Like WOE, IV also requires that each box contains both good and bad samples
  3. IV is not only affected by the importance of variables, but also related to the binning method. Generally speaking, the finer the granularity of a variable bin, the higher the IV. So we need to pay attention to the rationality of binning. The IV can only be compared when the number of bins of several variables is not much different.

3. Univariate analysis and multivariate analysis

  • Single Factor Analysis

After finishing variable binning, WOE coding and IV calculation, we need to do univariate analysis. Generally speaking, it is analyzed from two perspectives:

  1. The importance of variables. The importance of variables can start from the judgment of the IV value. Different IV values ​​reflect the importance of varying degrees of variables. Generally speaking, the choice of IV is as follows:

But when the IV is abnormally high, for example, more than 1, you need to pay attention to the variable binning method at this time may be unstable.

  1. The stability of the variable distribution. With suitable variables, the proportion of each box will not be very different. If the proportion of a certain variable in one box is much lower than other boxes, the stability of the variable is also weaker.

Univariate analysis is considered from the two perspectives of importance and distribution stability. Usually, first select the variable whose IV is higher than the threshold (such as 0.2), and then select the variable with more uniform binning.

 

Multi Factors Analysis

After the univariate analysis is completed, we also need to control the integrity of the variables, and use the technology of multivariate analysis to further reduce the scale of variables and form a globally better variable system. Multivariate analysis analyzes the characteristics of variables from the following two perspectives and completes the selection work:

  • Pairwise linear correlation between variables
  • Multicollinearity between variables

It is not allowed to have too strong pairwise linear correlation between variables. The main reason is:

  • If the pairwise linear correlation between the variable and the variable is strong, it indicates that there is a certain amount of information redundancy between the two variables. At the same time, it remains in the model, which is unnecessary, and it also increases the burden of model development, deployment and maintenance.
  • Strong linear correlation can even affect the parameter estimation of the regression model. In the parameter estimation of the regression model, when there is a strong linear correlation between the two variables, the parameter estimation will have a larger deviation

 

Multivariate analysis (continued)

After completing the pairwise linear correlation test between variables, we also need to test whether there is multicolinearity (multicolinearity). Multicollinearity refers to the strong linear correlation between a certain variable and the linear combination of other variables in a group of variables. Similarly, the existence of strong multicollinearity means that there is information redundancy, and it has an impact on the parameter estimation of the model. Multicollinearity is usually measured by the variance inflation factor (VIF), which is calculated as follows:

Among them is the coefficient of determination for the correct linear regression.

Generally speaking, we use 10 to measure whether there is multicollinearity. For VIF>10, it can be considered that there is multicollinearity between variables. At this point, one variable needs to be removed step by step, and the remaining variables are calculated with VIF. If it is found that the VIF of the remaining variable pair is less than 10 after the elimination, the one with the lower IV is eliminated from and. If eliminating one variable at a time does not reduce the VIF, then eliminating 2 variables at a time until there is no multicollinearity between the variables.

Guess you like

Origin blog.csdn.net/weixin_42224488/article/details/109667117