And iv Python value calculation woe

Computational logic

First calculation WOE value, recalculated value IV.

Wherein Y or N are YES, NO, because of variables in the reaction, and 1 is 0.

  • Yi is the number 1 in the i-th group, YT all (the Total) is the number 1.
  • Ni is the number of i-th group 0, NT all (the Total) is the number 0.

For example

Data are as follows, x were taken 1-9, y is 1 and 0 correspond.

x,y
1,1
2,1
3,0
4,1
5,1
6,0
7,0
8,0
9,1

9 that if the x-row data divided into three groups:

  • Group 0: x = 1,2,3
  • Group 1: x = 4,5,6
  • Group 2: x = 7,8,9

WEO value of the group 0 is calculated as follows.

  • Y0 = 2, because 分组内when x = 1,2 when y is 1, a total of two, is 2.
  • YT = 5, because this column y in total there are 5 1.
  • N0 = 1, because 分组内when x = 3 when y is 0, a total of 1, is 1.
  • NT = 4, because this column y total has four zeros.
WOE_0
=ln((2/5)/(1/4))
=ln(0.4/0.25)
=ln(1.6)
=0.47

With WOE, counted IV:

IV_0
=(2/5-1/4)*WOE_0
=0.15*0.47
=0.0705

Thus calculated IV_0 = 0.0705. Similarly calculated IV_1 = 0.070501, IV_2 = 0.274887. Iv namely the X'siv=iv_0+iv_2+iv_3=0.415888

Python code

import pandas as pd
import numpy as np
def iv_woe(data:pd.DataFrame, target:str, bins:int = 10) -> (pd.DataFrame, pd.DataFrame):
    """计算woe和IV值
    
    参数:
    - data: dataframe数据
    - target: y列的名称
    - bins: 分箱数(默认是10)
    """
    newDF,woeDF = pd.DataFrame(), pd.DataFrame()
    cols = data.columns
    for ivars in cols[~cols.isin([target])]:
        # 数据类型在bifc中、且数据>10则分箱
        if (data[ivars].dtype.kind in 'bifc') and (len(np.unique(data[ivars]))>10):
            binned_x = pd.qcut(data[ivars], bins,  duplicates='drop')
            d0 = pd.DataFrame({'x': binned_x, 'y': data[target]})
        else:
            d0 = pd.DataFrame({'x': data[ivars], 'y': data[target]})
        d = d0.groupby("x", as_index=False).agg({"y": ["count", "sum"]})
        d.columns = ['Cutoff', 'N', 'Events']
        d['% of Events'] = np.maximum(d['Events'], 0.5) / d['Events'].sum()
        d['Non-Events'] = d['N'] - d['Events']
        d['% of Non-Events'] = np.maximum(d['Non-Events'], 0.5) / d['Non-Events'].sum()
        d['WoE'] = np.log(d['% of Events']/d['% of Non-Events'])
        d['IV'] = d['WoE'] * (d['% of Events'] - d['% of Non-Events'])
        d.insert(loc=0, column='Variable', value=ivars)
        print("Information value of " + ivars + " is " + str(round(d['IV'].sum(),6)))
        temp =pd.DataFrame({"Variable" : [ivars], "IV" : [d['IV'].sum()]}, columns = ["Variable", "IV"])
        newDF=pd.concat([newDF,temp], axis=0)
        woeDF=pd.concat([woeDF,d], axis=0)
    return newDF, woeDF

transfer

mydata = pd.read_csv("./data.csv",encoding='utf8')
newDF,woeDF=iv_woe(mydata,'y')

You can be obtained. Note that, here default 10 group, x is a value of 0-10 in the above example, 10 minutes is insufficient sets, each set value. Note that if the judge sentences them

Guess you like

Origin www.cnblogs.com/heenhui2016/p/12517791.html