Scorecard Model Development - Qualitative Indicator Screening

Quantitative indicators are numerical, and we can also use regression methods to screen them, so what about qualitative indicators? 
R provides us with a very powerful IV value calculation algorithm. By referring to the informationvalue package in R, to calculate the IV value of each indicator, the importance measure between each qualitative indicator can be obtained, and the high predictive indicator can be selected. . 
There are many small partners who do not know what informationvalue is: 
let me say roughly, IV value is a common indicator for measuring the correlation between two nominal variables (one of which is a binary variable).

library(InformationValue)
library(klaR)
credit_risk<-ifelse(train_kfolddata[,"credit_risk"]=="good",0,1)
#将违约状态变量用0和1表示,1表示违约。
tmp<-train_kfolddata[,-21]
data<-cbind(tmp,credit_risk)
data<-as.data.frame(data)

factor_vars<-c("status","credit_history","purpose","savings","employment_duration",
               "personal_status_sex","other_debtors","property",
               "other_installment_plans","housing","job","telephone","foreign_worker")
#获取所有名义变量
all_iv<-data.frame(VARS=factor_vars,IV=numeric(length(factor_vars)),
                   STRENGTH=character(length(factor_vars)),stringsAsFactors = F)
#初始化待输出的数据框
for(factor_var in factor_vars)
{
  all_iv[all_iv$VARS==factor_var,"IV"]<-InformationValue::IV(X=
                                                               data[,factor_var],Y=data$credit_risk)  
  #计算每个指标的IV
  all_iv[all_iv$VARS==factor_var,"STRENGTH"]<-attr(InformationValue::IV(X=
                                                                          data[,factor_var],Y=data$credit_risk),"howgood")  
  #提取每个IV指标的描述
}
all_iv<-all_iv[order(-all_iv$IV),]    #排序IV
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

It can be seen from the results that the optional qualitative input index is shown in Table 3.12. 
write picture description here

In summary, the quantitative and qualitative model entry indicators in model development are shown in Table 3.13. 
write picture description here

For the quantitative and qualitative indicators of the model, the continuous variables are segmented (quantitative indicators are segmented), so as to calculate the WOE of the quantitative indicators and perform necessary dimensionality reduction for discrete variables. The segmentation methods for continuous variables are usually divided into two methods: equidistant segmentation and optimal segmentation. Equidistant segmentation refers to dividing a continuous variable into several intervals of equal distance, and then calculating the WOE value of each interval separately. The optimal segmentation means that according to the distribution attribute of the variable, combined with the change of the variable's ability to predict the default state variable, according to certain rules, the values ​​with close attributes are grouped together to form several intervals with unequal distances, and finally the default value is obtained. The optimal segmentation with the strongest predictive ability of state variables.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324413824&siteId=291194637