项目地址
https://www.kaggle.com/c/titanic
项目介绍:
除了乘客的编号以外,还包括下表中10个字段,构成了数据的所有特征
Variable |
Definition |
Key |
survival |
是否存活 |
0 = No, 1 = Yes |
pclass |
票的等级 |
1 = 1st, 2 = 2nd, 3 = 3rd |
sex |
性别 |
|
Age |
年龄 |
|
sibsp |
同乘配偶或兄弟姐妹 |
|
parch |
同乘孩子或父母 |
|
ticket |
票号 |
|
fare |
乘客票价 |
|
cabin |
客舱号码 |
|
embarked |
登船港口 |
C = Cherbourg, Q = Queenstown, S = Southampton |
导入数据
train_df = pd.read_csv("..\train.csv") test_df = pd.read_csv("..test.csv")
查看数据整体缺失情况
结果如下:存在null值得字段有Age、Fare和Cabin,其中Cabin缺失最为严重,缺失率77.1%
train_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object
连续型变量分布情况
train_df.describe() PassengerId Survived Pclass Age SibSp \ count 891.000000 891.000000 891.000000 714.000000 891.000000 mean 446.000000 0.383838 2.308642 29.699118 0.523008 std 257.353842 0.486592 0.836071 14.526497 1.102743 min 1.000000 0.000000 1.000000 0.420000 0.000000 25% 223.500000 0.000000 2.000000 20.125000 0.000000 50% 446.000000 0.000000 3.000000 28.000000 0.000000 75% 668.500000 1.000000 3.000000 38.000000 1.000000 max 891.000000 1.000000 3.000000 80.000000 8.000000 Parch Fare count 891.000000 891.000000 mean 0.381594 32.204208 std 0.806057 49.693429 min 0.000000 0.000000 25% 0.000000 7.910400 50% 0.000000 14.454200 75% 0.000000 31.000000 max 6.000000 512.329200
离散变量情况