Kaggle(一):Titanic

虽然理论知识学了很多,但是实际操作还没有积累,现在每天积累一题。

---------------------------不积跬步无以至千里---------------------------------------

Titanic的数据分为test.csv和train.csv,每一行row代表一个乘客的详细信息,每一列column代表一个feature,最后一列是存活信息,1代表存活,0代表没存活。那么,需要用train数据去训练模型,拿到好模型和参数后去测试test,得到test训练之后每个人的存活与否,再与原存活信息比对,看是否一致。

一:导入包

#数据处理和整理
import pandas as pd
import numpy as np
import random as rnd
#画图
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#导入的是二分类相关模型,模型融合用的是随机森林。
from sklean.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

二:获得数据

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
combine = [train_df,test_df]  #list中可以放入str,dict,int,float,也可以放入dataframe。可以用同一种处理迭代在两个数据上,保持数据一致性。

三:分析数据

查看每一列的特征名称

print(train_df.columns.values)  #也可以list[dataframe.columns.values], list[dataframe]
['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']

看一下每一列所包含的数据是什么?数值型数据还是分类型数据?数值型数据就可以用来计算,分类型数据可以用来定性。

train_df.head()  #调用方法,产生数值需要括号,谨记类的调用知识。如果看头十行,head(10)

那么,对数据要上一下几点的心:

  • Which features are available in the dataset?  
  • Which features are categorical?
  • Which features are numerical?
  • Which features are mixed data types?
  • Which features may contain errors or typos?
  • Which features contain blank, null or empty values?
  • What are the data types for various features?
查看每个column包含多少数据,那么哪一行缺数据,数据的个数就小于总行数。
train_df.info
查看每一列和survived的联系

思想:过滤式特征选择 统计特征对于结果的贡献程度。

简单一点的方法:groupby

train[['Pclass','Survived']].groupby(['Pclass',as_index=False],mean().sort_values(['Survivued'],ascending = False)

把'Pclass'特征和结果'Survived'组合起来,通过Pclass作为标签,计算年龄的存活率。这个有一点不好,就是Pclass有三个值,1,2,3,这三个值如果作为x点乘θ的贡献其实是平等的,但是1,2,3却赋予了它们不等的值,需要做one-hot处理

查看各列的组合和survived的联系

grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', size=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
grid.add_legend()

四:数据清洗

去掉脏数据和对结果没贡献的数据。

train_df = train_df.drop(['Ticket','Cabin'],axis=1) #axis=1去掉columns
test_df  = test_df.drop(['Ticket','Cabin'],axis = 1)
combine = [train_df,test_df] #去掉数据需要单独做,再combine。

生成一个新特征,新特征提取自旧特征,相当于是旧特征的再加工。

We want to analyze if Name feature can be engineered to extract titles and test correlation between titles and survival, before dropping Name and PassengerId features. 这个有点玄学,但是能学到一点,如何通过dataframe做re的事情外国人的名字是Jack.Johson。那么,通过匹配前面Jack.就可以拿到他们的title。

for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract('([A-za-z]+)\.',expand=False)      #dataset.ColumnName也可以用dataset['ColumnName']
pd.crosstab(train_df['Title'],train_df['Sex'])    #第一个参数是指定index,第二个参数是指定column
Sex	female	male
Title		
Capt	0	1
Col	0	2
Countess	1	0
Don	0	1
Dr	1	6
Jonkheer	0	1
Lady	1	0
Major	0	2
Master	0	40    
还可以这样更进一步玩:
df = pd.crosstab(df['Title'], df['Sex'],values=df['Survived'],aggfunc=sum)  #把对应项的值求和 

把特征的值替换掉:

for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')   #replace(x,y) 把x替换成y

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

train_df.head()

###########################更进一步处理数据##############################

哑变量处理(这是跑完下面的建模分析做对比之后,我第二次再加工的数据,下面的预测都是未做哑变量处理的)

dict_Pclass = {1:'1st',2:'2nd',3:'3rd'}
#总觉得Pclass那边做astype(int)不太好,因为普通乘客3,会员2,尊贵黄金会员1是等价的所以改回来,用str代替了数字1,2,3,并用get_dummies处理。
train_df ["Pclass"] = train_df["Pclass"].map(dict_Pclass)
test_df["Pclass"] = test_df["Pclass"].map(dict_Pclass)
test_df = test_df.drop(['Age*Pclass'],axis = 1)
train_df = pd.get_dummies(train_df) #把dataframe中的str数据分类,形成新特征,包含该新特征的样本是1,不包含是0.
test_df = pd.get_dummies(test_df)  #好处是更加容易做计算,坏处是对特征稀释。


处理过后对训练集的预测结果:

LR:80.920000000000002 < 81.37 未处理SVC:83.28 < 83.95 未处理
kNN:84.510000000000005 < 84.95
naive_bayes:75.530000000000001 > 73.73perceptron:81.140000000000001 > 74.06linear_SVC:81.930000000000007 > 81.14Decision_Tree:86.760000000000005 = 86.76

Random_Forest:86.640000000000001 = 86.64

讲道理,进行了哑变量变换之后,应该线性模型都会更加精确,不知道为什么LR和SVC,kNN变小了,是不是因为样本数很小,体现不出来?

################总之,初步的处理了一下数据###################

X_train = train_df.drop(['Survived'],axis = 1)
y_train = train_df['Survived']

X_test = test_df.drop(['PassengerId'],axis=1).copy()  #传值

#Logistic Regression
logist = LogisticRegression()  #初始化类
logist.fit(X_train,y_train)   #调用类中的fit函数,训练数据,不需要返回值
y_predict = logist.predict(X_test)
acc_log = round(logist.score(X_train, y_train) * 100, 2)
acc_log
​
#81.370000000000005

coeff_df = pd.DataFrame(train_df.columns.delete(0))

coeff_df["Correlation"] = pd.Series(logist.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)

svc = SVC()
svc.fit(X_train,y_train)
svc_predict = svc.predict(X_test)
acc_svc = round(svc.score(X_train,y_train ) *100,2)   #给出该模型的平均正确率
print(acc_svc)

#83.95

kn = KNeighborsClassifier()
kn.fit(X_train,y_train)
y_test = kn.predict(X_test)
acc_knn = round(kn.score(X_train,y_train) * 100, 2)
acc_knn

#84.959999999999994

gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
y_test = gaussian.predict(X_test)
acc_gaus = round(gaussian.score(X_train,y_train) * 100, 2)
acc_gaus

#73.739999999999995

percep = Perceptron()
percep.fit(X_train, y_train)
y_test = percep.predict(X_test)
acc_per = round(percep.score(X_train, y_train) * 100, 2)
acc_per

#74.069999999999993

l_svc = LinearSVC()
l_svc.fit(X_train,y_train)
y_test = l_svc.predict(X_test)
acc_lsvc = round(l_svc.score(X_train, y_train) * 100 , 2)
acc_lsvc

#81.140000000000001

deci_tre = DecisionTreeClassifier()
deci_tre.fit(X_train,y_train)
deci_tre_y_test = deci_tre.predict(X_test)
acc_deci = round(deci_tre.score(X_train, y_train) * 100 , 2)
acc_deci

#86.760000000000005

ran_fo = RandomForestClassifier()
ran_fo.fit(X_train, y_train )
y_test = ran_fo.predict(X_test)
acc_ran_fo = round(ran_fo.score(X_train, y_train) * 100, 2)
acc_ran_fo

#86.640000000000001


submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived":y_test
    })

submission.to_csv("Titanic_submission.csv",index=False)









猜你喜欢

转载自blog.csdn.net/gaoyishu91/article/details/80087303
今日推荐