Python数据分析与机器学习实战-08.pandas常用预处理方法

1.空值判断与筛选：pd.isnull(age)，True为空，表现为NaN的形式；False不为空

2.布尔值当作下标取值：

##默认是True，会把所有的缺失值打印出来，如果==False后，打印缺失值以外的数值 good_ages1 = titanic_survival["Age"][age_is_null] good_ages2= titanic_survival["Age"][age_is_null== False]

3.如果某列数值有缺失值，那么这列的求和是nan：print(sum(titanic_survival["Age"]))#因为有确实值，所以结果是nan
4.求缺失值以外的值的运算，但是很少这样做，一般缺失值会用均值填充等手段补齐，而不是丢掉数据。

titanic_survival["Age"].mean()：求均值

5.透视功能的使用：index是行，分类依据。values是列数据。aggfunc()函数默认是求均值np.mean

titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)

6.取具体的值：通过索引和列名取值：titanic_survival.loc[766,"Pclass"]

7.排序：titanic_survival.sort_values("Age",ascending=False)
8.重新定义索引：new_titanic_survival.reset_index(drop=True)#重新定义索引，True是丢掉原来的索引，重新定义索引

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv("/Users/liyili2/Downloads/tang/pandas_data/titanic_train.csv")
head_data=titanic_survival.head()
age = titanic_survival["Age"]
age_is_null = pd.isnull(age)

结果：
0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool

2.布尔值当作下标传参数，特别神奇：

#可以把布尔值传参数
##默认是True，会把所有的缺失值打印出来，如果==False后，打印缺失值以外的数值
good_ages1 = titanic_survival["Age"][age_is_null]
good_ages2= titanic_survival["Age"][age_is_null== False]
print(good_ages1)
print(good_ages2)

5     NaN
17    NaN
19    NaN
26    NaN
28    NaN
       ..
859   NaN
863   NaN
868   NaN
878   NaN
888   NaN
Name: Age, Length: 177, dtype: float64


29.69911764705882

#丢掉缺失值数据 axis=0是删除行，axis=1是删除列，一般不会删除一列的
drop_na_columns = titanic_survival.dropna(axis=1)
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age", "Sex"])
result=new_titanic_survival[["Age", "Sex"]]
result=new_titanic_survival["Age"]
print(result)
print(len(result))
print(len(titanic_survival))


0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
885    39.0
886    27.0
887    19.0
889    26.0
890    32.0
Name: Age, Length: 714, dtype: float64
714---丢掉确实值后行数
891---原来的行数

3.如果某列数值有缺失值，那么这列的求和是nan

print(sum(titanic_survival["Age"]))#因为有确实值，所以结果是nan

nan

4.求缺失值以外的值，但是很少这样做，一般缺失值会用均值填充等手段补齐，而不是丢掉数据。

#pandas中函数：mean()可以直接对确实值以外的数据进行平均求值
#但是现实中不这样操作，不建议这样用，确实值可以用均值等填充
correct_mean_age = titanic_survival["Age"].mean()
print(correct_mean_age)

29.69911764705882

5.透视功能的使用：aggfunc()函数默认是求均值np.mean

passenger_classes = [1, 2, 3]
#引进透视表，index是我们最终统计什么，values值数统计的数据是什么，aggfunc表示我们要酸index和values之间的什么关系
#比如我要统计每个班平均成绩，那么index就是班级,values就是成绩,aggfunc就是要用求平均的函数np.mean()
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
print (passenger_survival)

结果：
       Survived
Pclass          
1       0.629630
2       0.472826
3       0.242363

passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")
print(passenger_age)

              Age
Pclass           
1       38.233441
2       29.877630
3       25.140620

port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
print(port_stats)

                Fare  Survived
Embarked                      
C         10072.2962        93
Q          1022.2543        30
S         17439.3988       217

6.取具体的某个值，通过索引和列名取值

#具体取值，类似SQL中的where 条件
row_index_83_age = titanic_survival.loc[83,"Age"]
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print(row_index_83_age)
print(row_index_1000_pclass)

28.0
1

#pandas排序
new_titanic_survival = titanic_survival.sort_values("Age",ascending=False)
print(new_titanic_survival[0:10])
titanic_reindexed = new_titanic_survival.reset_index(drop=True)#重新定义索引，True是丢掉原来的索引，重新定义索引
print(titanic_reindexed.iloc[0:10])



   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
630          631         1       1  ...  30.0000   A23         S
851          852         0       3  ...   7.7750   NaN         S
493          494         0       1  ...  49.5042   NaN         C
96            97         0       1  ...  34.6542    A5         C
116          117         0       3  ...   7.7500   NaN         Q
672          673         0       2  ...  10.5000   NaN         S
745          746         0       1  ...  71.0000   B22         S
33            34         0       2  ...  10.5000   NaN         S
54            55         0       1  ...  61.9792   B30         C
280          281         0       3  ...   7.7500   NaN         Q

[10 rows x 12 columns]
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0          631         1       1  ...  30.0000   A23         S
1          852         0       3  ...   7.7750   NaN         S
2          494         0       1  ...  49.5042   NaN         C
3           97         0       1  ...  34.6542    A5         C
4          117         0       3  ...   7.7500   NaN         Q
5          673         0       2  ...  10.5000   NaN         S
6          746         0       1  ...  71.0000   B22         S
7           34         0       2  ...  10.5000   NaN         S
8           55         0       1  ...  61.9792   B30         C
9          281         0       3  ...   7.7500   NaN         Q

[10 rows x 12 columns]

qq_39817865

发布了49 篇原创文章 · 获赞 9 · 访问量 3482

私信关注