项目一:Kaggle Titanic
预备知识:《python机器学习基础教程》第二,四,五,六章知识。
编译环境:JupyterNotebook
代码:
import numpy as np
import pandas as pd
import sklearn.model_seletion import GridSearchCV
import sklearn.model_seletion import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprogressing import StandardScaler
from sklearn.svm import SVC
import warning
warnings.filterwarnings(action='ignore)
step 1:Loading DataSet
这里的训练集和测试集和代码放在同一个目录
train = pd.read_csv(‘titanic_train.csv’)
test = pd.read_csv(‘titanic_test.csv’)
test2 = pd.read_csv(‘titanic_test.csv’)
titanic = pd.concat([train, test], sort=False)
len_train = train.shape[0]
step 2:Data Analysis
通过下面代码知道了特征Age,Cabin等存在缺值。
titanic.isnull().sum()[titanic.isnull().sum() > 0]
填充缺值
#以平均数填充
train.Age = train.Age.fillna(train.Age.mean())
test.Age = train.Age.fillna(test.Age.mean())
train.Fare = train.Fare.fillna(train.Fare.mean())
test.Fare = test.Fare.fillna(test.Fare.mean())
train.Cabin = train.Cabin.fillna(‘unknow’)
test.Cabin = test.Cabin.fillna(‘unknow’)
#以第一个众数填充
train.Embarked = train.Embarked.fillna(train[‘Embarked’].mode()[0])
test.Embarked = test.Embarked.fillna(test[‘Embarked’].mode()[0])
step3: Feature Engineering(最重要的一个步骤!!!)
#删除Name中的第一个逗号,空格号。新建一个名为Name2的特征
train.Name2 = train[‘Name’].apply(lambda x: x.split(’,’)[0].strip())
test.Name2 = test[‘Name’].apply(lambda x: x.split(’,’)[0].strip())
#删除不想要的特征
train.drop([‘Passengerld’, ‘Name’], axis=1, inplace=True)
test.drop([‘Passengerld’, ‘Name’], axis=1, inplace=True)
#Turning categorical into numerical(字符数据变为数值数据)
titanic = pd.concat([train, test], sort=False)
titanic = pd.get_dummies(titanic)
#对处理后的数据划分训练集和测试集
train = titanic[:len_train]
test = titanic[len_train:]
X_train = train.drop(‘Survived’, axis=1)
y_train = train[‘Survived’]
X_test = test.drop(‘Survived’, axis=1)
step 4: Model
r = [0.0001, 0.001, 0.1, 1, 10, 50, 100]
PSVM = [{‘svc__C’: r, ‘svc__kernel’: [linear]},
{‘svc__C’: r, ‘svc__gamma’: r, ‘svc__kernel’: [rbf]}]
svc = make_pipeline(StandardScale(), SVC(random_state=1))
GSSVM = GridSeacherCV(estimator=svc, param_grid=PSVM, scoring=‘accuracy’, cv=2)
scores = cross_val_score(GSSVM, X_train.astype(float), y_train, scoring=‘accuracy’, cv=5)
#np.mean(scores) #精度平均值约等于0.82
step 5:Submission
GSSVM.fit(X_train, y_train)
pre = GSSVM.predict(X_test)
output = pd.DataFrame({‘Passengerld’: test2[‘Passengerld’],
‘Survived’: pre})
output.to_csv(‘Submission.csv’, index=False)