Kaggle first experience of the Titans nits survive forecast

Learning over ID3 decision tree, C4.5, CART algorithm, try to find a place to hand, Kaggle practice 赛泰坦尼特 very good record

Process

First, register an account and then search for Titanic in the top menu bar Competitions inside, find the Titanic practice sessions, practice sessions will be used to help with Getting Started in the page of the game there are a lot of entry-recommended, it is worth a look.

Obtaining a data set
Explore the data set
Cleaning datasets
Feature Selection
Training data set
Forecast data set
Submit the results file

Obtaining a data set

Data set in the Data panel menu bar inside the game, there are three data sets

train.csv: training data set
test.csv: prediction data sets results in the need
gender_submission.csv: predictions submit a template (this can be submitted directly to 77% accuracy rate, and write your own Python calls only 75% of machine learning algorithms, interesting)

Explore the data set

After the data set we need to get inside and see what is written, data integrity is incomplete and the like. On the site description of the data columns as follows:

PassengerId: Number of passengers
Survived: Are survived
Pclass: tickets grade
Name: Passenger Name
Sex: Sex passengers
SibSp: the number of relatives (siblings, spouse number)
Parch: the number of relatives (parents, number of children)
Ticket: ticket number
Fare: Tickets prices
Cabin: Cabin
Embarked: Log in port

Python code used is as follows:

import pandas as pd

train_data = pd.read_csv("../docs/train.csv")
test_data = pd.read_csv("../docs/test.csv")
# 了解数据表的基本情况：行数、列数、每列的数据类型、数据完整度;
print(train_data.info())
print("_"*30)
# 了解数据表的统计情况：总数、平均值、标准差、最小值、最大值
print(train_data.describe())
print("_"*30)
# 查看字符串类型（非数字）的整体情况
print(train_data.describe(include=['O']))
print("_"*30)
# 查看前五行数据
print(train_data.head())
print("_"*30)
# 查看后五行数据
print(train_data.tail())
print("_"*30)

Operating results as follows:

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
______________________________
       PassengerId    Survived      Pclass  ...       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  ...  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642  ...    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071  ...    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000  ...    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000  ...    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000  ...    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000  ...    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000  ...    8.000000    6.000000  512.329200

[8 rows x 7 columns]
______________________________
                            Name   Sex  Ticket    Cabin Embarked
count                        891   891     891      204      889
unique                       891     2     681      147        3
top     Allen, Mr. William Henry  male  347082  B96 B98        S
freq                           1   577       7        4      644
______________________________
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]
______________________________
     PassengerId  Survived  Pclass  ...   Fare Cabin  Embarked
886          887         0       2  ...  13.00   NaN         S
887          888         1       1  ...  30.00   B42         S
888          889         0       3  ...  23.45   NaN         S
889          890         1       1  ...  30.00  C148         C
890          891         0       3  ...   7.75   NaN         Q

[5 rows x 12 columns]

Cleaning datasets

By Discovery Age, Cabin Fare and three data has been missing, Age and Fare are numeric, be filled using a simple average, Cabin for the string, wherein S max, is simple to fill the missing S it

train_data["Age"].fillna(train_data["Age"].mean(), inplace=True)
test_data["Age"].fillna(test_data["Age"].mean(), inplace=True)

train_data["Fare"].fillna(train_data["Fare"].mean(), inplace=True)
test_data["Fare"].fillna(test_data["Fare"].mean(), inplace=True)

train_data["Embarked"].fillna("S", inplace=True)
test_data["Embarked"].fillna("S", inplace=True)

Feature Selection

Through data exploration, PassengerId passenger numbers, classification useless, Name for the passenger's name, it's useless; Cabin field too many missing values, temporarily abandoned; Ticket to the ticket number, disorganized and irregular, give up; the rest of the field It is: Plass, Sex, Age, SibSp, Parch, Fare, Embarked, these features will be followed as training data, and the coding target into a digital data object represented.

from sklearn.feature_extraction import DictVectorizer

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
train_features = train_data[features]
train_labels = train_data['Survived']
test_features = test_data[features]

dvec=DictVectorizer(sparse=False)
train_features = dvec.fit_transform(train_features.to_dict(orient='record'))

Training data set

Python using machine learning library training a decision tree model,

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion="entropy")
clf.fit(train_features, train_labels)

Forecast data set

Test data set inside the test data, and outputs the result to the csv file, for submitting the Kaggle

test_features = dvec.transform(test_features.to_dict(orient="record"))
    pred_labels = clf.predict(test_features)
    print(test_features)
    print(pred_labels)
    print("_"*30)

with open("submission.csv", encoding="utf-8", mode="w", newline="") as f:
    write = csv.writer(f, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
    write.writerow(["PassengerId", "Survived"])
    count = 0
    for item in test_data.values:
        print(item[0])
        write.writerow([item[0], pred_labels[count]])
        count = count + 1

Can do some verification, the former is easy to verify using the training data, which is a K-fold cross-validation

import numpy as np
from sklearn.model_selection import cross_val_score

acc_decision_tree = round(clf.score(train_features, train_labels), 6)
print(acc_decision_tree)
print("_"*30)

print(np.mean(cross_val_score(clf, train_features, train_labels, cv=10)))

Submit the results file

Click Kaggle Bennett Titans game page Submit Predictions, step generated on the outcome document submitted submission.csv, ranking in the game is about more than 10,000, hey, ranking is not critical, this attempt was a bit mean, a quick look at the entire forecast procedures and processes of Kaggle.

Some try to carry on the back of their own

Age and use Fare median Complement: no good average forecast was 69%, the average is 73%