[Python common functions] This article allows you to thoroughly grasp the toad.quality function in Python

Everything is a process from quantitative change to qualitative change, and learning Python is no exception. Only by knowing the common functions in a language well, can we be handy in the process of dealing with problems and quickly find the optimal solution. This article will explore the toad.quality function in Python with you, so that you can understand the principle of this function in the shortest time. You can also use the fragmented time to consolidate this function to make you more efficient in the process of processing.


  

1. Install the toad package

  
quality is a function under the toad library, and the toad package needs to be installed first to call it. Open cmd, the installation statement is as follows:

pip install toad

If the installation is successful, the results will be displayed as follows:
  
insert image description here

  
  

Two, quality function definition

  
The function of the quality function is to calculate the four indicators of iv, gini, entropy, and unique of the variables in the data frame. Among them, the definition of iv can refer to the article IV and WOE in risk control modeling, and the definition of gini and entropy can be found on Baidu. The role of these three indicators is to measure the ability of variables to distinguish the occurrence of an event. unique calculates the number of values ​​for a variable.
  
Its basic call syntax is as follows:

import toad

toad.quality(dataframe, target='target', cpu_cores=0, iv_only=False)

dataframe: dataset.
  
target: target column or dependent variable column.
  
cpu_cores: The maximum number of CPU cores that will be used, "0" means all CPUs will be used, "-1" means all CPUs will be used except one.
  
iv_only: Boolean value, whether to display only the iv column, the default is false.
  
  

3. Example of quality function

  

1 Import library and load data

  
Background: It is necessary to analyze the long positions, related risks, court enforcement, risk list and overdue information of 7252 customers to construct the customer's pre-loan score card A card. Before constructing the scorecard, it is necessary to screen the customer's information and select the variables with high correlation with the customer's overdue information.
  
First read the data, the specific code is as follows:

#[1]读取数据
import os
import toad
import numpy as np
import pandas as pd

os.chdir(r'F:\公众号\70.数据分析报告')
date = pd.read_csv('testtdmodel1.csv', encoding='gbk')
date.head(3)

os.chdir: Set the file path for data storage.
  
pd.read_csv: read data.
  
got the answer:
  
insert image description here

  
  

2 instances

  

Example 1: Call the quality function with default parameters

  
Let's first look at the effect of only entering the data frame and dependent variable, and using the default values ​​for the rest of the parameters. The code is as follows:

to_drop = ['input_time', '申请状态', '历史最高逾期天数.x'] # 去掉ID列和month列
quality_result = toad.quality(date.drop(to_drop,axis=1),'y')
quality_result

got the answer:

picture

  
From the results, the values ​​of the four indicators iv, gini, entropy, and unique of the corresponding variables are calculated, and they are sorted in descending order by iv. Students who are familiar with modeling should know that this function can be used in variable selection.

  

Example 2: iv_only parameter is set to True

  
Next, look at the result of setting iv_only to True. The code is as follows:

to_drop = ['input_time', '申请状态', '历史最高逾期天数.x'] # 去掉ID列和month列
toad.quality(date.drop(to_drop,axis=1),'y',iv_only=True)

got the answer:

picture
  

In Comparative Example 1, it can be found that when iv_only is set to True, the three indicators iv, gini, and entropy only calculate the iv value. If the amount of data is large and there are many features, but you want to save calculation time, this setting is more suitable.

  
  

4. Comparing the decile to calculate the iv value

  
In order to compare the difference between calculating iv with the toad.quality function and calculating iv with ten equal divisions. First define the function of calculating iv by 10 equal divisions, the specific code is as follows:

#等频切割变量函数
def bin_frequency(x,y,n=10): # x为待分箱的变量,y为target变量.n为分箱数量
    total = y.count()       #1 计算总样本数
    bad = y.sum()           #2 计算坏样本数
    good = total-bad        #3 计算好样本数
    if x.value_counts().shape[0]==2:    #4 如果该变量值是0和1则只分两组
        d1 = pd.DataFrame({
    
    'x':x,'y':y,'bucket':pd.cut(x,2)})
    else:
        d1 = pd.DataFrame({
    
    'x':x,'y':y,'bucket':pd.qcut(x,n,duplicates='drop')})  #5 用pd.cut实现等频分箱
    d2 = d1.groupby('bucket',as_index=True)  #6 按照分箱结果进行分组聚合
    d3 = pd.DataFrame(d2.x.min(),columns=['min_bin'])
    d3['min_bin'] = d2.x.min()               #7 箱体的左边界
    d3['max_bin'] = d2.x.max()               #8 箱体的右边界
    d3['bad'] = d2.y.sum()                   #9 每个箱体中坏样本的数量
    d3['total'] = d2.y.count()               #10 每个箱体的总样本数
    d3['bad_rate'] = d3['bad']/d3['total']   #11 每个箱体中坏样本所占总样本数的比例
    d3['badattr'] = d3['bad']/bad            #12 每个箱体中坏样本所占坏样本总数的比例
    d3['goodattr'] = (d3['total'] - d3['bad'])/good    #13 每个箱体中好样本所占好样本总数的比例
    d3['WOEi'] = np.log(d3['badattr']/d3['goodattr'])  #14 计算每个箱体的woe值
    IV = ((d3['badattr']-d3['goodattr'])*d3['WOEi']).sum()  #15 计算变量的iv值
    d3['IVi'] = (d3['badattr']-d3['goodattr'])*d3['WOEi']   #16 计算IV
    d4 = (d3.sort_values(by='min_bin')).reset_index(drop=True) #17 对箱体从大到小进行排序
    cut = []
    cut.append(float('-inf'))
    for i in d4.min_bin:
        cut.append(i)
        cut.append(float('inf'))
        WOEi = list(d4['WOEi'].round(3))
    return IV,cut,WOEi,d4

Call the function to calculate the iv value of a single variable, the specific code is as follows:

IV,cut,WOEi,d4 = bin_frequency(date['1个月内申请人在多个平台申请借款'], date['y'], 10)
print(IV)
d4

got the answer:
  
picture

It can be found that the iv time value of the 10-equivalent calculation variable [the applicant applies for loans on multiple platforms within one month] is 0.397. However, in example 1, the result calculated by using the toad.quality function is 0.613, obviously the value calculated by toad.quality is higher than the value calculated by 10 equal divisions. Explain that different cutting methods have a greater impact on the iv value of the variable.
  
Is that true for all variables?
  
We use the batch method to calculate the iv value of the variables in the data frame in 10 equal parts, and then compare it with the iv calculated by the toad.quality method. First cycle to calculate the iv value of 10 equal parts, the specific code is as follows:

columns = list(quality_result.index)
dt = date
dt = dt.fillna(-999999)
dt = dt.replace('NaN', -999999)
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 1000)
pd.set_option('display.max_columns',None)
IV_table = [0]
IV_name = ['xx']
for i in columns:    
    try:
        IV,cut,WOEi,d4 = bin_frequency(dt[i], dt['y'], 10)
        IV_table.append(IV)
        IV_name.append(i)
        print('变量【', i, '】的', 'IV=', IV, end='\n')
        display(d4)
    except:
        print(i)
        pass
    #print('=======================================' ,end='\n')
IV_name_table = pd.DataFrame({
    
    'IV_name':IV_name, 'iv_value':IV_table})

IV_name_table

got the answer:

picture

Put the two results together for comparison, the code is as follows:

IV_name_table = IV_name_table.loc[1:, :]  #去除第一行无用值
quality_result['IV_name'] = quality_result.index  #加变量名列
pd.merge(IV_name_table, quality_result, on=['IV_name'], how='left') #合并数据

Obtain the result:
  
insert image description here
  
Among them, the iv_value column is the iv value calculated by the decile, and the iv column is the iv value calculated by the toad.quality function. It can be found that the gap between some variables calculated by the two is still quite large, but the general trend is the same. When using, you can choose one of the two methods for calculation according to the specific scene, or you can calculate both, and find the union to select variables.
  
So far, the quality function in Python has been explained. If you want to know more about the functions in Python, you can read the relevant articles of the "Learning Python" module in the official account.
  
[ Limited time free access to the group ] The group provides recruitment information related to learning Python, playing with Python, risk control modeling, artificial intelligence, and data analysis, excellent articles, learning videos, and can also exchange related problems encountered in learning and work. Friends who need it can add WeChat ID 19967879837, and add time to note the groups they want to join, such as risk control modeling.
  
You may be interested in:
Drawing Pikachu in PythonUsing
Python to draw word cloudsPython
face recognition - you are the only one in my eyesPython
draws a beautiful starry sky map (beautiful background)
Use the py2neo library in Python to operate neo4j and build an association mapPython
romance Confession source code collection (love, rose, photo wall, confession under the stars)

Guess you like

Origin blog.csdn.net/qq_32532663/article/details/132381541