Titanic set of machine learning data

Titanic passenger data collection and iris data set, is one of the most commonly used machine learning sample data sets

Download dataset

Log https://www.kaggle.com , account page
https://www.kaggle.com/walterfan/account select "Create API Token" page, download kaggle.json

File contents

{"username":"$user_name","key":"$user_key"}

Installation kaggle, download titanic data set, the first set about environment

cp kaggle.json ~/.kaggle/kaggle.json

#或者

chmod 600 /workspace/config/kaggle.json
export KAGGLE_CONFIG_DIR=/workspace/config/kaggle.json

#或者

export KAGGLE_USERNAME=$user_name
export KAGGLE_KEY=$user_key

pip install kaggle
kaggle competitions download -c titanic

Dataset Overview

Data is divided into three csv file

  • Training data set: training set (train.csv) used to build the model of machine learning
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
...
  • Test data set: test set (test.csv), can be used to test your machine learning models, the same data structure

  • Prediction sample gender_submission.csv, a set of assumptions and forecasts for all women-only passenger to survive

PassengerId,Survived
892,0
893,1
894,0
895,0
896,1
....

Data Dictionary

variable definition The key
survival Survival survival or not 0 = No, 1 = Yes
pclass Ticket class cabin grade 1 = 1st, 2 = 2nd, 3 = 3rd
sex gender sex
Age Age in years of age
sibsp # Of siblings / spouses aboard the Titanic ship or the number of siblings spouse
respect # Of parents, the number of parents / children aboard the Titanic ship or children
ticket Ticket number ticket number
fare Passenger fare fare
cabin Cabin number Cabin No.
embarked Port of Embarkation port of embarkation C = Cherbourg Cherboug, Q = Queenstown Queenstown, S = Southampton Southampton

Draw a histogram to reveal the age of the passengers on the Titanic distribution

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

from matplotlib.ticker import PercentFormatter

plt.title('Passenger Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')

# 读取泰坦尼克号的数据集
titanic = pd.read_csv('./titanic/train.csv')
# 去除无年龄数据的样本
titanic.dropna(subset=['Age'], inplace=True)


plt.hist(titanic.Age, # 绘图数据
        bins = 10 # 指定直方图的的区间, 划分为10组
        color = 'steelblue', # 指定填充色
        edgecolor = 'k', # 指定直方图的边界色
        label = 'age ',# 指定标签
        alpha = 0.7 )# 指定透明度

plt.legend()
plt.show()

The horizontal axis is the sample group, in accordance with the parameters of the bins were divided into 10 sections, each section is 10 years old
and the vertical axis is the sample frequency, the higher number of passengers looks between 20-30 years of age, the number of children a lot

1598924-ebd7ca87ec8503f9.png

The normed parameter is specified as True, the draw frequency histograms

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

from matplotlib.ticker import PercentFormatter

plt.title('Passenger Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')

# 读取泰坦尼克号的数据集
titanic = pd.read_csv('./titanic/train.csv')
# 去除无年龄数据的样本
titanic.dropna(subset=['Age'], inplace=True)


plt.hist(titanic.Age, # 绘图数据
        bins = np.arange(titanic.Age.min(),titanic.Age.max(),10), # 指定直方图的条形, 从最大到最小, 10岁为一个台阶
        normed = True,#指定为频率直方图
        color = 'steelblue', # 指定填充色
        edgecolor = 'k', # 指定直方图的边界色
        label = 'age frequency',# 指定标签
        alpha = 0.7 )# 指定透明度

plt.gca().yaxis.set_major_formatter(PercentFormatter(0.1))

plt.legend()
plt.show()

Children under 10 years can be seen in nearly 10% of the total number of passengers, while the number of passengers between 20 to 40 years old, more than half

1598924-f0b3df901403d904.png

The cumulative parameter set to true, then look at the cumulative frequency distribution

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

from matplotlib.ticker import PercentFormatter

plt.title('Passenger Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')

# 读取泰坦尼克号的数据集
titanic = pd.read_csv('./titanic/train.csv')
# 去除无年龄数据的样本
titanic.dropna(subset=['Age'], inplace=True)


plt.hist(titanic.Age, # 绘图数据
        bins = np.arange(titanic.Age.min(),titanic.Age.max(),10), # 指定直方图的条形, 从最大到最小, 10岁为一个台阶
        normed = True,#指定为频率直方图
        color = 'steelblue', # 指定填充色
        edgecolor = 'k', # 指定直方图的边界色
        cumulative = True, # 积累直方图
        label = 'age cumulative frequency',# 指定标签
        alpha = 0.7 )# 指定透明度

plt.gca().yaxis.set_major_formatter(PercentFormatter(1))

plt.legend()
plt.show()

It can be seen 80% of the passenger's age at 40 years of age

1598924-c7c4a8d1002bea98.png

Look at the relationship between male and female survival rate and age of passengers

import numpy as np
import pandas as pd
import seaborn as sns


# 读取泰坦尼克号的数据集
titanic = pd.read_csv('./titanic/train.csv')
# 去除无年龄数据的样本
titanic.dropna(subset=['Age'], inplace=True)

survived = 'survived'
not_survived = 'not survived'

fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(10, 4))
women = titanic[titanic['Sex']=='female']
men = titanic[titanic['Sex']=='male']

ax = sns.distplot(women[women['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[0], kde =False)
ax = sns.distplot(women[women['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[0], kde =False)
ax.legend()
ax.set_title('Female')

ax = sns.distplot(men[men['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[1], kde = False)
ax = sns.distplot(men[men['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[1], kde = False)
ax.legend()
_ = ax.set_title('Male')
plt.show()
1598924-073c06ba7b18cdca.png

Look at the relationship between the cockpit and the level of survival

import numpy as np
import pandas as pd
import seaborn as sns

titanic = pd.read_csv('./titanic/train.csv')
sns.barplot(x='Pclass', y='Survived', data=titanic)
plt.show()
1598924-c304c6ba79307783.png

Reference material

Guess you like

Origin blog.csdn.net/weixin_33989780/article/details/90885650