Kaggle项目实战一:Titanic: Machine Learning from Disaster

项目地址

    https://www.kaggle.com/c/titanic

项目介绍:

    除了乘客的编号以外,还包括下表中10个字段,构成了数据的所有特征

Variable

Definition

Key

survival

是否存活

0 = No, 1 = Yes

pclass

票的等级

1 = 1st, 2 = 2nd, 3 = 3rd

sex

性别

 

Age

年龄

 

sibsp

同乘配偶或兄弟姐妹

 

parch

同乘孩子或父母

 

ticket

票号

 

fare

乘客票价

 

cabin

客舱号码

 

embarked

登船港口

C = Cherbourg, Q = Queenstown, S = Southampton

导入数据

train_df = pd.read_csv("..\train.csv")
test_df = pd.read_csv("..test.csv")

     查看数据整体缺失情况

结果如下:存在null值得字段有Age、Fare和Cabin,其中Cabin缺失最为严重,缺失率77.1%

train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object

连续型变量分布情况

train_df.describe()
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   
            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  

离散变量情况

猜你喜欢

转载自www.cnblogs.com/bethansy/p/9037513.html
今日推荐