cph = CoxPHFitter()训练过程中遇到的坑以及画图

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/sinat_26566137/article/details/82746632

画图报错:会报valueError,原因是可能画图软件没有达到指定版本;
解决方案:(1)更新plt,安装最新的到0.17;可能还会需要安装最新的lifelines;
DataFrames的画图:
参考:https://blog.csdn.net/grey_csdn/article/details/70768721
如下,DataFrame画图:

  from pandas import Series,DataFrame
  from numpy.random import randn
  import numpy as np
  import matplotlib.pyplot as plt
  df = DataFrame(randn(10,5),columns=['A','B','C','D','E'],index = np.arange(0,100,10))
  df.plot()
cph = CoxPHFitter()画图:
import pandas as pd
from lifelines import CoxPHFitter
import matplotlib.pyplot as plt

cph = CoxPHFitter()
df1 = pd.read_csv('/home/sc/Downloads/tmp/shixin_cox_all_data_to_model_new.csv')

#训练方式1,只用以下几个特征训练
# c = ['defendant_judgedoc_cnt','network_share_zhixing_cnt','shixin_label', 'survival_time','regcap','judgedoc_cnt']

c =['is_revoke','is_cancel','court_notice_is_no','established_year','r1_subsidiary_invest_max_dx_zx','r2_controlled_invest_max_dx_zx',
'r4_common_corporate_shi_xin',
'r4_common_corporate_zhi_xin','judgedoc_cnt',
    'network_share_judge_doc_cnt','network_all_link_defendant_judgedoc_cnt',
'companyname_change_cnt','business_range_change_cnt','regcap_change_cnt','share_change_cnt','fr_change_cnt',
'address_change_cnt','director_change_cnt','network_fr_judgedoc_cnt','shixin_label', 'survival_time']
#'is_cancel',
df1 =df1[c]


#训练方式2:去掉全为0的特征.
# a =['company_name','r1_subsidiary_invest_max', 'r2_controlled_invest_max', 'r3_common_company_controlled_invest', 'r4_common_corporate']
# c_1 =['network_share_shixin_cnt','litigant_defendant_contract_dispute_cnt','litigant_defendant_bust_cnt','litigant_copyright_dispute_cnt']
# a.extend(c_1)
# df1 = df1.drop(a, axis=1)


df1 = df1.fillna(0)
# shixin_0 = df1[(df1['shixin_label'] == 0)][0:5000]
# shixin_1 = df1[(df1['shixin_label'] == 1)][0:2000]
# df1 = pd.concat([shixin_0,shixin_1])
shixin_0 = df1[(df1['shixin_label'] == 0)][0:100000]
shixin_1 = df1[(df1['shixin_label'] == 1)][0:30000]
df1 = pd.concat([shixin_0,shixin_1])
# df1 = df1.sort_values(by="survival_time" , ascending=True)
# print(df1["survival_time"])
# df1['group'] =(df1.groupby(['survival_time','shixin_label']).size()).tolist()
#
# print(df1['group'])


cph.fit(df1, duration_col='survival_time', event_col='shixin_label', show_progress=True, step_size=0.1)
cph.print_summary()
cph.plot() #画得是两个变量之间的相关关系值
plt.show()
cph.plot_covariate_groups('established_year', [0, 5, 10, 15])
plt.show()
# harper= df1['established_year']
# ax = plt.subplot(2,1,1)
# df1.predict_cumulative_hazard(harper).plot(ax=ax)
#
# ax = plt.subplot(2,1,2)
# df1.predict_survival_function(harper).plot(ax=ax)


# from lifelines import CoxPHFitter
# from lifelines.datasets import load_regression_dataset
# from lifelines.utils import k_fold_cross_validation
# import numpy as np
# regression_dataset = load_regression_dataset()
# cph = CoxPHFitter()
# ###做k折交叉验证的时候,会导致有些特征取值全为0,会报ValueError: delta contains nan value(s). Convergence halted.错误;
# scores = k_fold_cross_validation(cph, df1, duration_col='survival_time', event_col='shixin_label',k=3)
# print(scores)
# print(np.mean(scores))
# print(np.std(scores))

(2)训练过程中遇到的坑:
虽然结果指标Concordance相比之前提升了不少,但是其特征的显著性全都很低,原因是步长step_size调的过小,将step_size=0.00001 调至step_size=0.1,即可以看到有些特征的显著性较强(三颗星:*),这背后的原因还没弄清楚;另外,会发现样本量整体数量与样本中正负样本比例对结果会造成轻微影响;

猜你喜欢

转载自blog.csdn.net/sinat_26566137/article/details/82746632