机器学习赛事:快来一起挖掘幸福感 — 阿里云天池

概述

  • 学习地址:
    https://tianchi.aliyun.com/specials/promotion/aicampml?invite_channel=3&accounttraceid=baca918333cb45008b70655b544a5aeadgkm

  • 学习内容:机器学习赛事:快来一起挖掘幸福感

  • 思路:根据之前所学的内容,先用KNN聚类进行缺失值补充,然后使用logsitic回归进行分类和预测。

  • 问题:最终的test组的y值都是5,但是我预测出来都是4。不确定是中间步骤出问题了,还是logistic回归分类在这个例子中不适用。

  • 最终结果全不一样就离谱:我的预测都是4,实际值都是5。

一、数据处理

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

train_abbr = pd.read_csv(r'D:\学习\数据\快来挖掘幸福感数据\happiness_train_abbr.csv')
train = pd.read_csv(r'D:\学习\数据\快来挖掘幸福感数据\happiness_train_complete.csv',encoding='GBK')
test_abbr = pd.read_csv(r'D:\学习\数据\快来挖掘幸福感数据\happiness_test_abbr.csv',encoding='ISO-8859-1')
test = pd.read_csv(r'D:\学习\数据\快来挖掘幸福感数据\happiness_test_complete.csv',encoding='GBK')

# 将无效统计数据标记为nan
train = train.replace([-8, -1, -2, -3], np.nan)
((train == -8) | (train == -1) | (train == -2) | (train == -3)).sum().sum()

# 查看所有特征的缺失值个数和缺失率
for i in range(train.shape[1]):
    n_miss = train.iloc[:,i].isnull().sum()
    perc = (n_miss / train.shape[0]) * 100
    if n_miss > 0:
        print('>Col: %d, Missing: %d, Missing ratio: (%.2f%%)' % (i, n_miss, perc))


# 查看哪些变量不是 数字类的类型
cols = train.columns
for col in cols:
    if str(train[col].dtype) == 'object':
        print(col)

#处理时间特征
train['survey_time'] = pd.to_datetime(train['survey_time'],format='%Y-%m-%d %H:%M:%S')
train["weekday"]=train["survey_time"].dt.weekday
train["year"]=train["survey_time"].dt.year
train["quarter"]=train["survey_time"].dt.quarter
train["hour"]=train["survey_time"].dt.hour
train["month"]=train["survey_time"].dt.month
#  datetime.datetime 的对象方法和属性
# dt.year, dt.month, dt.day	年、月、日
# dt.hour, dt.minute, dt.second	时、分、秒


# 删除缺失值非常多的变量和无关变量
train = train.drop(columns=["id","survey_time","edu_other"])


#是否入党; 其他资产、投资换位零一变量
train["join_party"]=train["join_party"].map(lambda x:0 if pd.isnull(x)  else 1)
train["property_other"]=train["property_other"].map(lambda x:0 if pd.isnull(x)  else 1)
train["invest_other"]=train["invest_other"].map(lambda x:0 if pd.isnull(x)  else 1)
  • 相同的方法整理test数据
test = test.replace([-8, -1, -2, -3], np.nan)

# 处理时间变量
test['survey_time'] = pd.to_datetime(test['survey_time'],format='%Y-%m-%d %H:%M:%S')
test["weekday"]=test["survey_time"].dt.weekday
test["year"]=test["survey_time"].dt.year
test["quarter"]=test["survey_time"].dt.quarter
test["hour"]=test["survey_time"].dt.hour
test["month"]=test["survey_time"].dt.month

test = test.drop(columns=["id","survey_time","edu_other"])

#是否入党; 其他资产、投资换位零一变量
test["join_party"]=test["join_party"].map(lambda x:0 if pd.isnull(x)  else 1)
test["property_other"]=test["property_other"].map(lambda x:0 if pd.isnull(x)  else 1)
test["invest_other"]=test["invest_other"].map(lambda x:0 if pd.isnull(x)  else 1)

# 空值通过KNN分类填充

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5, metric='nan_euclidean')
test_filled = imputer.fit_transform(test)
test[test.columns] = test_filled

test.head()

二、运用KNN填充缺失值

# 空值通过KNN分类填充
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5, metric='nan_euclidean')
train_filled = imputer.fit_transform(train)
train[train.columns] = train_filled


train["happiness"]=train["happiness"].map(lambda x:x-1)
y = train["happiness"].astype(int) #必须是整数

train = train.drop(columns = ["happiness"])
x = train

三、运用logistic regression 回归和预测

from sklearn.linear_model import LogisticRegression
## 定义 逻辑回归模型 
clf = LogisticRegression(random_state=56, solver='lbfgs', max_iter=10000)

# 在训练集上训练逻辑回归模型
clf.fit(x, y)


# 在训练集和测试集上分布利用训练好的模型进行预测
train_predict = clf.predict(x)
test_predict = clf.predict(test)
test_predict=list(map(lambda x: x + 1, test_predict))
test_predict = np.array(test_predict)

# 测量准确度
from sklearn import metrics
## 利用accuracy(准确度)【预测正确的样本数目占总预测样本数目的比例】评估模型效果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y,train_predict))


test_sub = pd.read_csv(r'D:\学习\数据\快来挖掘幸福感数据\happiness_submit.csv')
test_sub = test_sub.drop(columns = ['id'])
test_sub.head()
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(test_sub,test_predict))


test_sub = test_sub.values.ravel()
score = np.sum(test_sub - (test_predict))/len(test_predict)
print(score)


result = pd.DataFrame(test_predict, columns = ['prediction'])
result['test_submit'] = test_sub
result.head()

猜你喜欢

转载自blog.csdn.net/weixin_49340599/article/details/112493286