【Driven Data竞赛】——疫苗接种情况预测参赛

今天记录一下参加的DrivenData平台的一个数据类竞赛当Score:0.8515，排名(209/1808)前11.5%，还很菜，待提高ing.

DrivenData是什么呢？目前有很多的数据类竞赛平台，较为大家熟知的有kaggle、天池等，DrivenData也是一个数据竞赛平台，上面有很多数据挖掘类竞赛，可以根据赛题类型、困难度进行选择适合自己的进行参加，有很多竞赛还是很适合初学者的。

我参与的是疫苗接种情况预测，全名: “Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines”.也可以从此链接直达。
下面介绍一下这个赛题的基本情况：

一、赛题简介

1.问题描述

首先这是一个分类问题，是一个双标签分类问题；要求我们预测两个变量h1n1_vaccine和seasonal_vaccine 分别代表个人接受H1N1和季节性流感疫苗的可能性
该竞赛的评分标准是ROC_AUC，即求出接种每种疫苗的概率提交后与真实值计算ROC_AUC，越高越好。提交的值是一个人获得每种疫苗的概率，而不是二进制标签。

注意：不是多分类，是双标签！！！

2.数据和特征

该赛题数据来源为：2009年底和2010年初，美国进行的2009年全国H1N1流感调查。共计26707条数据，35个特征，其中12个文本特征，23个离散数字特征。

二、数据预处理

首先，里边有很多文本特征，文本数据是离散的描述型数据，因此将其用形如0，1，2….的离散数字表示，这里没有用独热码，因为尝试过用独热码，会产生特别多维特征，但没有实际意义。

然后发现数据中有很多缺失值：

在这里插入图片描述
在这个缺失值可视化图中，每一列代表一个特征的数据确实情况，从图中可以看出确实数据的位置，白色位置代表数据缺失。

从下面的缺失值条形图可直观看出各个特征的缺失百分比，更方便判断如何处理缺失值：
在这里插入图片描述
从图中可以看出有较多特征出现缺失，缺失维度较多，但缺失百分比相对不高，因此要对其进行填充。利用KNN方法进行空缺值填充：

基本思路：循环遍历，依次将每一维含有缺失值的特征作为Y,当前不含有缺失值的所有特征集合作为X。

①利用在当前Y上未空缺的样本作为训练集，进行K近邻学习器的训练，当前Y缺失的样本作为测试集，利用所学得的学习器对其进行预测。
②将预测值Y填充到对应缺失位置。
③将填充后的数据作为下一轮学习的数据集，填充完成的特征Y添加到输入特征集X，进行下一个缺失特征的填充
……
循环以上步骤，直到所有特征均填充完成，数据集没有缺失值。

首先导入一些之后需要的包

clf=KNeighborsClassifier(n_neighbors=2)
mfeature=ffeature.copy()
for i in range(0,len(ffeature)):
    tfeature.append(ffeature[i])
    train=train_data[tfeature]
    
    test=test_set_features[tfeature]
    #把没有缺失值放进training_nonull
    training_nonull=train[train.isnull().any(axis=1)!= True].values
    training_nonull=pd.DataFrame(training_nonull)
    training_nonull.columns=tfeature
    #把有缺失值放进training_null
    training_null=train[train.isnull().any(axis=1)== True].values
    training_null=pd.DataFrame(training_null)
    training_null.columns=tfeature
    #处理验证集,把没有缺失值放进test_nonull
    test_nonull=test[test.isnull().any(axis=1)!= True].values
    test_nonull=pd.DataFrame(test_nonull)
    test_nonull.columns=tfeature
    #把有缺失值放进test_null
    test_null=test[test.isnull().any(axis=1)== True].values
    test_null=pd.DataFrame(test_null)
    test_null.columns=tfeature 
    #进行训练
    clf.fit(training_nonull.iloc[:,0:len(tfeature)-1],training_nonull.iloc[:,len(tfeature)-1:len(tfeature)])
        #进行当前空缺属性的填充
    y_test_pre = clf.predict(training_null.iloc[:,0:len(tfeature)-1])
    
    test_pre = clf.predict(test_null.iloc[:,0:len(tfeature)-1])

    #把预测值填充到空缺处
    training_null.iloc[:,-1]=y_test_pre
    train_c=pd.concat([training_nonull,training_null])
    train_c=train_c.sort_values(by=['respondent_id'],ascending=True) 
    #填充验证集
    test_null.iloc[:,-1]=test_pre
    test_c=pd.concat([test_nonull,test_null])
    test_c=test_c.sort_values(by=['respondent_id'],ascending=True) 
    #从有空缺属性集去除当前拿出的这个
    mfeature.remove(ffeature[i])
    train_k=train_data[mfeature]
    train_k['respondent_id']=train_data['respondent_id']
    
    test_k=test_set_features[mfeature]
    test_k['respondent_id']=test_set_features['respondent_id']
    #合并出下一步的全部数据集
    train_data=pd.merge(train_c,train_k, how='inner', on='respondent_id')
    
    test_set_features=pd.merge(test_c,test_k, how='inner', on='respondent_id')
print(test_set_features)

三、特征工程

首先计算特征相关性，Person相关性不适用于离散特征，因此采用MI进行特征间相关性的衡量。图中颜色越深，相关性越高，颜色越浅，相关性越低。

在这里插入图片描述

#特征相关性
def MI_matirx(dataframe):
    data = np.array(dataframe)
    n = len(data[0, :])
    result = np.zeros([n, n])

    for i in range(n):
        for j in range(n):
             result[i, j] =metrics.normalized_mutual_info_score(data[:, i], data[:, j])
           
    RT = pd.DataFrame(result)
    return RT
df_mic = MI_matirx(train)                      
print(df_mic)

feature=train.columns.values.tolist()

然后特征中有一列为个人信息的id，这个特征对于我们的预测没有作用，所以将其剔除。

进行了随机森林特征选择、逻辑回归特征选择等，但经过特征选择后的预测准确率均下降，所以当前最高得分中，利用的是所有特征进行拟合。
在此仅举例随机森林特征选择过程：训练随机森林模型，根据模型计算得出的特征重要性，选择排名前25的特征：
在这里插入图片描述

四、数据集划分

采用train_test_split()进行数据划分，以2：8比例划分，并利用stratify=y以保证数据集的划分不改变样本分布

X=train_data
Y=train_set_labels.iloc[:,1:]
X_train, y_train, X_test,y_test= train_test_split(X,y,test_size=0.2, stratify=y ,random_state=10）

五、模型预测

尝试了多种基础模型和stacking集成模型，但最终还没能把stacking集成模型和深度学习模型性能调上去，目前性能最佳的是xgboost模型。因为这里是双标签问题，要用一个模型同时输出两个变量的预测结果，所以可采用MultiOutputClassifier实现。

#xgboost
from xgboost import XGBClassifier
xgb = MultiOutputClassifier(estimator=XGBClassifier(booster='gbtree',n_estimators=1000,learning_rate=0.01,subsample=0.6,colsample_bytree=0.4,colsample_bylevel=0.8,max_depth=6))
xgb.fit(X_train, y_train)
#predictedpro_test= xgb.predict(X_test)
predictedpro_test =xgb.predict_proba(X_test)
y_preds = pd.DataFrame(
    {
    
    
        "h1n1_vaccine": predictedpro_test[0][:, 1],
        "seasonal_vaccine": predictedpro_test[1][:, 1],
    },
    index = y_test.index
)
print("y_preds.shape:", y_preds.shape)
print(y_preds.head())
auc=roc_auc_score(y_test,y_preds,average="macro")
print(auc)

其中调参过程：

from sklearn.model_selection import RandomizedSearchCV
param_grid = {
    
     'estimator__n_estimators':range(100,3501,100),"estimator__learning_rate":[0.001,0.005,0.01,0.02,0.05,0.06,0.07,0.08,0.1]}
xgb = MultiOutputClassifier(estimator=XGBClassifier(booster='gbtree',subsample=0.6,colsample_bytree=0.4,colsample_bylevel=0.8,max_depth=6))
cv_xgb = RandomizedSearchCV(estimator=xgb, param_distributions=param_grid, cv= 5, n_jobs=-1)
cv_xgb.fit(X_train, y_train)
cv_xgb.best_params_,cv_xgb.best_score_