2020 Xiamen International Bank Digital Innovation Financial Cup Modeling Contest Baseline Sharing

Score: 0.34

Tournament address: https://www.dcjingsai.com/v2/cmptDetail.html?id=439&=76f6724e6fa9455a9b5ef44402c08653&ssoLoginpToken=&sso_global_session=e44c4d57-cd19-4ada-a1d3-a525090252bf86&stl_ksession=1205

Competition background

In the era of digital finance, the development of big data and artificial intelligence technologies in the banking industry is changing with each passing day, and institutions in the industry are accelerating the development of digital transformation. Xiamen International Bank, as a leading small and medium-sized bank with distinctive technology, has consistently used the power of digital financial technology for many years, practiced the concept of "digital empowerment", and continued to promote smart risk control, smart marketing, smart operation, smart management, and the use of artificial intelligence Establish an intelligent customer service model and financial intelligent marketing service system with big data analysis technology to improve the intelligent and precise level of the marketing process, and provide customers with more intimate and available financial services.

Xiamen International Bank and the Data Mining Research Center of Xiamen University, in order to build an industry exchange platform to explore hot technical issues such as machine learning and artificial intelligence with elites from all walks of life, and DataCastle data castle to jointly hold the "2020 Second Xiamen International Bank" Shuchuang Financial Cup "Modeling Competition". With the concept of "finance + technology", this competition focused on real scenes in financial marketing, with a total bonus of 310,000 yuan.

task

With the development of technology, banks have successively created a variety of online and offline customer contacts to meet customer needs such as daily business processing and channel transactions. Facing a large number of customers, banks need a more comprehensive and accurate insight into customer needs. In the actual business development process, it is necessary to discover the loss of customers and predict the changes in customers' funds; conduct marketing to customers in advance/timely to reduce the loss of bank funds. This competition provides customer behavior and asset information in actual business scenarios as modeling objects. On the one hand, it hopes to show the actual data mining capabilities of each contestant. On the other hand, the contestants need to combine the modeling results in the semi-finals to propose corresponding The marketing solution fully reflects the value of data analysis.

Label description

label -1 down

label 0 Maintain stability

label 1 promotion

Official description

Customer contribution is mainly related to the customer's Aum value

Evaluation function KAPPA

The Kappa coefficient is an index used for consistency testing and can also be used to measure the effect of classification. Because for classification problems, the so-called consistency is whether the model prediction results are consistent with the actual classification results. The calculation of the kappa coefficient is based on the confusion matrix, with a value between -1 and 1, usually greater than 0.

The calculation formula of the kappa coefficient based on the confusion matrix is as follows:

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-kP1yqr6Y-1604027964570)(https://www.zhihu.com/equation?tex=kappa+%3D+%5Cfrac %7Bp_o-p_e%7D%7B1-p_e%7D+)]

among them:

[External link image transfer failed. The source site may have an anti-leech chain mechanism. It is recommended to save the image and upload it directly (img-EZJSQjE6-1604027964572)(https://www.zhihu.com/equation?tex=p_o+%3D+%5Cfrac+ %7B%E5%AF%B9%E8%A7%92%E7%BA%BF%E5%85%83%E7%B4%A0%E4%B9%8B%E5%92%8C%7D%7B%E6 %95%B4%E4%B8%AA%E7%9F%A9%E9%98%B5%E5%85%83%E7%B4%A0%E4%B9%8B%E5%92%8C%7D)] , Is actually acc .

[External link image transfer failed. The origin site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-Oddrpc4s-1604027964573)(https://www.zhihu.com/equation?tex=p_e+%3D+%5Cfrac %7B%5Csum_%7Bi%7D%7B%E7%AC%ACi%E8%A1%8C%E5%85%83%E7%B4%A0%E4%B9%8B%E5%92%8C+%2A+%E7 %AC%ACi%E5%88%97%E5%85%83%E7%B4%A0%E4%B9%8B%E5%92%8C%7D%7D%7B%28%5Csum%7B%E7%9F %A9%E9%98%B5%E6%89%80%E6%9C%89%E5%85%83%E7%B4%A0%7D%29%5E2%7D+)], that is, the corresponding " The sum of the product of the actual and the predicted number", divided by the "square of the total number of samples".

For details, please refer to this URL:

https://zhuanlan.zhihu.com/p/67844308

Data introduction

Refer to the competition website, tasks and data

Program

After observing the data, it is found that there are B6, B7, and B8 in the last month of each quarter. For the test set, his last month's cust_no contains most of the cust_no, to predict the future trend of the customer. The forecast is based on the situation of the last quarter, where the following three cust_no are lost after merging features, and the same is lost after merging the first two months: ['0xb2d0afb2', '0xb2d2ed87', '0xb2d2d9d2']. Here first set their labels to 0. Lost the column containing NAN.

     #没找到这三个cust_no。摸奖。
    low=pd.DataFrame()
    low['cust_no']=['0xb2d0afb2', '0xb2d2ed87', '0xb2d2d9d2']
    low['label']=[0,0,0]

Basic characteristics

At present, there are no features, just a simple encoder for I3, I8, and I12. Then use I3 for group training (customer level), and the rest is full stud.

 le = LabelEncoder()
    train_B7['I3'] = le.fit_transform(train_B7['I3'].astype(str))
    test_B7['I3'] = le.transform(test_B7['I3'].astype(str))
    le = LabelEncoder()
    train_B7['I8'] = le.fit_transform(train_B7['I8'].astype(str))
    test_B7['I8'] = le.transform(test_B7['I8'].astype(str))
    le = LabelEncoder()
    train_B7['I12'] = le.fit_transform(train_B7['I12'].astype(str))
    test_B7['I12'] = le.transform(test_B7['I12'].astype(str))

    predictionsB4 = pd.DataFrame()


    predictionsB7 = pd.DataFrame()
    scoresB7 = list()

    for eve_id in tqdm(test_B7.I3.unique()):
        prediction,score= run_lgb_id(train_B7, test_B7, target='label', eve_id=eve_id)
        predictionsB7=predictionsB7.append(prediction)
        scoresB7.append(score)

Data used

For the training set, only September and December data are used, and the test set also uses March data

    # 1.读取文件：
    train_label_3=pd.read_csv(r'E:\For_test2-10\data\厦门_data\train_label\y_Q3_3.csv')
    train_label_4 = pd.read_csv(r'E:\For_test2-10\data\厦门_data\train_label\y_Q4_3.csv')

    train_3 = pd.DataFrame()


    train_4 = pd.DataFrame()



    id3_data = pd.read_csv(r'E:\For_test2-10\data\厦门_data\train_feature\cust_avli_Q3.csv')
    id4_data = pd.read_csv(r'E:\For_test2-10\data\厦门_data\train_feature\cust_avli_Q4.csv')

    #合并有效客户的label
    train_label_3 = pd.merge(left=id3_data, right=train_label_3, how='inner', on='cust_no')
    train_label_4 = pd.merge(left=id4_data, right=train_label_4, how='inner', on='cust_no')
    #合并个人信息
    inf3_data = pd.read_csv(r'E:\For_test2-10\data\厦门_data\train_feature\cust_info_q3.csv')
    inf4_data = pd.read_csv(r'E:\For_test2-10\data\厦门_data\train_feature\cust_info_q4.csv')
    train_label_3 = pd.merge(left=inf3_data, right=train_label_3, how='inner', on='cust_no')
    train_label_4 = pd.merge(left=inf4_data, right=train_label_4, how='inner', on='cust_no')



    #第3季度信息提取
    for i in range(9,10):
        aum_3=pd.read_csv(r'E:\For_test2-10\data\厦门_data\train_feature\aum_m'+str(i)+'.csv')


        be_3 = pd.read_csv(r'E:\For_test2-10\data\厦门_data\train_feature\behavior_m' + str(i) + '.csv')


        cun_3 = pd.read_csv(r'E:\For_test2-10\data\厦门_data\train_feature\cunkuan_m' + str(i) + '.csv')

        fre_3=pd.merge(left=aum_3,right=be_3,how='inner', on='cust_no')
        fre_3=pd.merge(left=fre_3,right=cun_3,how='inner', on='cust_no')
        train_3=train_3.append(fre_3)

    train_fe3=pd.merge(left=fre_3,right=train_label_3,how='inner', on='cust_no')

    train_fe3.to_csv(r'E:\For_test2-10\data\厦门_data\train_feature\train3_fe_B7.csv',index=None)

    #第4季度信息提取
    for i in range(12,13):
        aum_4=pd.read_csv(r'E:\For_test2-10\data\厦门_data\train_feature\aum_m'+str(i)+'.csv')


        be_4 = pd.read_csv(r'E:\For_test2-10\data\厦门_data\train_feature\behavior_m' + str(i) + '.csv')


        cun_4 = pd.read_csv(r'E:\For_test2-10\data\厦门_data\train_feature\cunkuan_m' + str(i) + '.csv')

        fre_4=pd.merge(left=aum_4,right=be_4,how='inner', on='cust_no')
        fre_4=pd.merge(left=fre_4,right=cun_4,how='inner', on='cust_no')
        train_3=train_3.append(fre_4)

    train_fe4=pd.merge(left=fre_4,right=train_label_4,how='inner', on='cust_no')

    train_fe4.to_csv(r'E:\For_test2-10\data\厦门_data\train_feature\train4_fe_B7.csv',index=None)

    train_B7=[train_fe3,train_fe4]
    train_B7=pd.concat(train_B7)

    test = pd.DataFrame()
    idtest_data = pd.read_csv(r'E:\For_test2-10\data\厦门_data\test_feature\cust_avli_Q1.csv')
    inftest_data = pd.read_csv(r'E:\For_test2-10\data\厦门_data\test_feature\cust_info_q1.csv')
    test_inf = pd.merge(left=inftest_data, right=idtest_data, how='inner', on='cust_no')
    # 第3季度信息提取
    for i in range(3, 4):
        aum = pd.read_csv(r'E:\For_test2-10\data\厦门_data\test_feature\aum_m' + str(i) + '.csv')

        be = pd.read_csv(r'E:\For_test2-10\data\厦门_data\test_feature\behavior_m' + str(i) + '.csv')

        cun = pd.read_csv(r'E:\For_test2-10\data\厦门_data\test_feature\cunkuan_m' + str(i) + '.csv')

        fre = pd.merge(left=aum, right=be, how='inner', on='cust_no')
        fre = pd.merge(left=fre, right=cun, how='inner', on='cust_no')
        test = test.append(fre)

    test_fe = pd.merge(left=test, right=test_inf, how='inner', on='cust_no')
    test_fe.to_csv(r'E:\For_test2-10\data\厦门_data\train_feature\test_fe_B7.csv', index=None)

    test_B7=test_fe.dropna(axis=1, how='any')
    train_B7=train_B7.dropna(axis=1, how='any')

model

Using the LGB model, 5-fold cross-validation

def run_lgb_id(df_train, df_test, target, eve_id):
    feature_names = list(
        filter(lambda x: x not in ['label','cust_no'], df_train.columns))


    # 提取 eve_ID 对应的数据集
    df_train = df_train[df_train.I3 == eve_id]
    df_test = df_test[df_test.I3 == eve_id]



    model = lgb.LGBMRegressor(num_leaves=32,
                              max_depth=6,
                              learning_rate=0.08,
                              n_estimators=10000,
                              subsample=0.9,
                              feature_fraction=0.8,
                              reg_alpha=0.5,
                              reg_lambda=0.8,
                              random_state=2020)
    oof = []
    prediction = df_test[['cust_no']]
    prediction[target] = 0

    kfold = KFold(n_splits=5, random_state=2020)
    for fold_id, (trn_idx, val_idx) in enumerate(kfold.split(df_train, df_train[target])):
        X_train = df_train.iloc[trn_idx][feature_names]
        Y_train = df_train.iloc[trn_idx][target]
        X_val = df_train.iloc[val_idx][feature_names]
        Y_val = df_train.iloc[val_idx][target]

        lgb_model = model.fit(X_train,
                              Y_train,
                              eval_names=['train', 'valid'],
                              eval_set=[(X_train, Y_train), (X_val, Y_val)],
                              verbose=0,
                              eval_metric='mse',
                              early_stopping_rounds=20,
                             )

        pred_val = lgb_model.predict(X_val, num_iteration=lgb_model.best_iteration_)
        df_oof = df_train.iloc[val_idx][[target, 'cust_no']].copy()
        df_oof['pred'] = pred_val
        oof.append(df_oof)

        pred_test = lgb_model.predict(df_test[feature_names], num_iteration=lgb_model.best_iteration_)

        prediction[target] += pred_test / kfold.n_splits


        del lgb_model, pred_val, pred_test, X_train, Y_train, X_val, Y_val
        gc.collect()

    df_oof = pd.concat(oof)
    score = mean_squared_error(df_oof[target], df_oof['pred'])
    print('MSE:', score)

    return prediction,score

Finally, MSE is used as an offline evaluation indicator

Big guys can modify it

It is possible to drop cust_no merge when merging features haha

Online: around 0.34

The code is largely borrowed from the baseline shared by Hengmen

The first time you share the baseline

thank you all

Xiamen International Bank Digital Innovation Financial Cup Modeling Competition