客户贷款逾期预测[1]-逻辑回归模型

任务

      预测贷款客户是否会逾期,status为响应变量,有0和1两种值,0表示未逾期,1表示逾期。

代码:

# -*- coding: utf-8 -*-
"""
Created on Thu Nov 15 13:02:11 2018

@author: keepi
"""

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
pd.set_option('display.max_row',1000)

#导入数据
data = pd.read_csv('data.csv',encoding='gb18030')
data = pd.DataFrame(data.fillna(10))

#特征工程
'''
n = set(data['reg_preference_for_trad'])
dic = {}
for i,j in enumerate(n):
    dic[j] = i
data['reg_preference_for_trad'] = data['reg_preference_for_trad'].map(dic)
'''
x_dummy = pd.get_dummies(data['reg_preference_for_trad'])
data = pd.concat([data.drop('reg_preference_for_trad',axis=1),x_dummy],axis=1,sort=False)
data.drop('source',axis=1,inplace=True)
data.drop('bank_card_no',axis=1,inplace=True)
data.drop('latest_query_time',axis=1,inplace=True)
data.drop('loans_latest_time',axis=1,inplace=True)
data.drop('id_name',axis=1,inplace=True)

#划分测试集、训练集
train,test = train_test_split(data,test_size=0.3,random_state=25)
y_train = train.loc[:,'status']
train_2 = train.drop('status',axis=1)
y_test = test.loc[:,'status']
test_2 = test.drop('status',axis=1)

#模型训练与预测
lr = LogisticRegression(C=190,dual=True,random_state=535)
lr.fit(train_2,y_train) 

y_test_pre = lr.predict(test_2)

#评分
score = f1_score(y_test,y_test_pre,average='macro')
print('验证集分数',score)

验证集分数:0.43838

扫描二维码关注公众号,回复: 4351133 查看本文章

遇到的问题

    1.SettingWithCopyWarning:A value is trying to be set on a copy of a slice from a DataFrame

           原因是我在处理数据时对原始数据进行了修改

train.drop('status',axis=1,inplace=True)
#警告:SettingWithCopyWarning
#修改为下面代码即可
train_2 = train.drop('status',axis=1)

    2.固定了划分测试集和训练集的随机数种子,每次训练的分数都不同

           因为逻辑回归的随机数种子没有设置

lr = LogisticRegression(C=100,dual=True,random_state=535)   #这样即可

    3.在用svm预测后计算f1值的时候出现警告:

UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

        这个是说f1值因为某些项为0所以无法计算,因为我训练出来的结果全为1,而测试集中的标签含有0,1两种值。那么为什么用LinearSVC训练后会只预测出一种值呢?

 

参考

        哑变量与one-hot编码

猜你喜欢

转载自blog.csdn.net/truffle528/article/details/84072452
今日推荐