Titanic passenger data collection and iris data set, is one of the most commonly used machine learning sample data sets
Download dataset
Log https://www.kaggle.com , account page
https://www.kaggle.com/walterfan/account select "Create API Token" page, download kaggle.json
File contents
{"username":"$user_name","key":"$user_key"}
Installation kaggle, download titanic data set, the first set about environment
cp kaggle.json ~/.kaggle/kaggle.json
#或者
chmod 600 /workspace/config/kaggle.json
export KAGGLE_CONFIG_DIR=/workspace/config/kaggle.json
#或者
export KAGGLE_USERNAME=$user_name
export KAGGLE_KEY=$user_key
pip install kaggle
kaggle competitions download -c titanic
Dataset Overview
Data is divided into three csv file
- Training data set: training set (train.csv) used to build the model of machine learning
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
...
Test data set: test set (test.csv), can be used to test your machine learning models, the same data structure
Prediction sample gender_submission.csv, a set of assumptions and forecasts for all women-only passenger to survive
PassengerId,Survived
892,0
893,1
894,0
895,0
896,1
....
Data Dictionary
variable | definition | The key |
---|---|---|
survival | Survival survival or not | 0 = No, 1 = Yes |
pclass | Ticket class cabin grade | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | gender sex | |
Age | Age in years of age | |
sibsp | # Of siblings / spouses aboard the Titanic ship or the number of siblings spouse | |
respect | # Of parents, the number of parents / children aboard the Titanic ship or children | |
ticket | Ticket number ticket number | |
fare | Passenger fare fare | |
cabin | Cabin number Cabin No. | |
embarked | Port of Embarkation port of embarkation | C = Cherbourg Cherboug, Q = Queenstown Queenstown, S = Southampton Southampton |
Draw a histogram to reveal the age of the passengers on the Titanic distribution
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from matplotlib.ticker import PercentFormatter
plt.title('Passenger Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
# 读取泰坦尼克号的数据集
titanic = pd.read_csv('./titanic/train.csv')
# 去除无年龄数据的样本
titanic.dropna(subset=['Age'], inplace=True)
plt.hist(titanic.Age, # 绘图数据
bins = 10 # 指定直方图的的区间, 划分为10组
color = 'steelblue', # 指定填充色
edgecolor = 'k', # 指定直方图的边界色
label = 'age ',# 指定标签
alpha = 0.7 )# 指定透明度
plt.legend()
plt.show()
The horizontal axis is the sample group, in accordance with the parameters of the bins were divided into 10 sections, each section is 10 years old
and the vertical axis is the sample frequency, the higher number of passengers looks between 20-30 years of age, the number of children a lot
The normed parameter is specified as True, the draw frequency histograms
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from matplotlib.ticker import PercentFormatter
plt.title('Passenger Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
# 读取泰坦尼克号的数据集
titanic = pd.read_csv('./titanic/train.csv')
# 去除无年龄数据的样本
titanic.dropna(subset=['Age'], inplace=True)
plt.hist(titanic.Age, # 绘图数据
bins = np.arange(titanic.Age.min(),titanic.Age.max(),10), # 指定直方图的条形, 从最大到最小, 10岁为一个台阶
normed = True,#指定为频率直方图
color = 'steelblue', # 指定填充色
edgecolor = 'k', # 指定直方图的边界色
label = 'age frequency',# 指定标签
alpha = 0.7 )# 指定透明度
plt.gca().yaxis.set_major_formatter(PercentFormatter(0.1))
plt.legend()
plt.show()
Children under 10 years can be seen in nearly 10% of the total number of passengers, while the number of passengers between 20 to 40 years old, more than half
The cumulative parameter set to true, then look at the cumulative frequency distribution
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from matplotlib.ticker import PercentFormatter
plt.title('Passenger Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
# 读取泰坦尼克号的数据集
titanic = pd.read_csv('./titanic/train.csv')
# 去除无年龄数据的样本
titanic.dropna(subset=['Age'], inplace=True)
plt.hist(titanic.Age, # 绘图数据
bins = np.arange(titanic.Age.min(),titanic.Age.max(),10), # 指定直方图的条形, 从最大到最小, 10岁为一个台阶
normed = True,#指定为频率直方图
color = 'steelblue', # 指定填充色
edgecolor = 'k', # 指定直方图的边界色
cumulative = True, # 积累直方图
label = 'age cumulative frequency',# 指定标签
alpha = 0.7 )# 指定透明度
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.legend()
plt.show()
It can be seen 80% of the passenger's age at 40 years of age
Look at the relationship between male and female survival rate and age of passengers
import numpy as np
import pandas as pd
import seaborn as sns
# 读取泰坦尼克号的数据集
titanic = pd.read_csv('./titanic/train.csv')
# 去除无年龄数据的样本
titanic.dropna(subset=['Age'], inplace=True)
survived = 'survived'
not_survived = 'not survived'
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(10, 4))
women = titanic[titanic['Sex']=='female']
men = titanic[titanic['Sex']=='male']
ax = sns.distplot(women[women['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[0], kde =False)
ax = sns.distplot(women[women['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[0], kde =False)
ax.legend()
ax.set_title('Female')
ax = sns.distplot(men[men['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[1], kde = False)
ax = sns.distplot(men[men['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[1], kde = False)
ax.legend()
_ = ax.set_title('Male')
plt.show()
Look at the relationship between the cockpit and the level of survival
import numpy as np
import pandas as pd
import seaborn as sns
titanic = pd.read_csv('./titanic/train.csv')
sns.barplot(x='Pclass', y='Survived', data=titanic)
plt.show()