1.空值判断与筛选:pd.isnull(age),True为空,表现为NaN的形式;False不为空
2.布尔值当作下标取值:
##默认是True,会把所有的缺失值打印出来,如果==False后,打印缺失值以外的数值
good_ages1 = titanic_survival["Age"][age_is_null]
good_ages2= titanic_survival["Age"][age_is_null== False]
3.
如果某列数值有缺失值,那么这列的求和是nan:print(sum(titanic_survival["Age"]))#因为有确实值,所以结果是nan
4.求缺失值以外的值的运算,但是很少这样做,一般缺失值会用均值填充等手段补齐,而不是丢掉数据。
titanic_survival["Age"].mean():求均值
5.透视功能的使用:index是行,分类依据。values是列数据。aggfunc()函数默认是求均值np.mean
titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
6.取具体的值:通过索引和列名取值:titanic_survival.loc[766,"Pclass"]
7.排序:titanic_survival.sort_values("Age",ascending=False)
8.重新定义索引:new_titanic_survival.reset_index(drop=True)#重新定义索引,True是丢掉原来的索引,重新定义索引
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv("/Users/liyili2/Downloads/tang/pandas_data/titanic_train.csv")
head_data=titanic_survival.head()
age = titanic_survival["Age"]
age_is_null = pd.isnull(age)
结果:
0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 True
889 False
890 False
Name: Age, Length: 891, dtype: bool
2.布尔值当作下标传参数,特别神奇:
#可以把布尔值传参数
##默认是True,会把所有的缺失值打印出来,如果==False后,打印缺失值以外的数值
good_ages1 = titanic_survival["Age"][age_is_null]
good_ages2= titanic_survival["Age"][age_is_null== False]
print(good_ages1)
print(good_ages2)
5 NaN
17 NaN
19 NaN
26 NaN
28 NaN
..
859 NaN
863 NaN
868 NaN
878 NaN
888 NaN
Name: Age, Length: 177, dtype: float64
29.69911764705882
#丢掉缺失值数据 axis=0是删除行,axis=1是删除列,一般不会删除一列的
drop_na_columns = titanic_survival.dropna(axis=1)
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age", "Sex"])
result=new_titanic_survival[["Age", "Sex"]]
result=new_titanic_survival["Age"]
print(result)
print(len(result))
print(len(titanic_survival))
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
885 39.0
886 27.0
887 19.0
889 26.0
890 32.0
Name: Age, Length: 714, dtype: float64
714---丢掉确实值后行数
891---原来的行数
3.如果某列数值有缺失值,那么这列的求和是nan
print(sum(titanic_survival["Age"]))#因为有确实值,所以结果是nan
nan
4.求缺失值以外的值,但是很少这样做,一般缺失值会用均值填充等手段补齐,而不是丢掉数据。
#pandas中函数:mean()可以直接对确实值以外的数据进行平均求值
#但是现实中不这样操作,不建议这样用,确实值可以用均值等填充
correct_mean_age = titanic_survival["Age"].mean()
print(correct_mean_age)
29.69911764705882
5.透视功能的使用:aggfunc()函数默认是求均值np.mean
passenger_classes = [1, 2, 3]
#引进透视表,index是我们最终统计什么,values值数统计的数据是什么,aggfunc表示我们要酸index和values之间的什么关系
#比如我要统计每个班平均成绩,那么index就是班级,values就是成绩,aggfunc就是要用求平均的函数np.mean()
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
print (passenger_survival)
结果:
Survived
Pclass
1 0.629630
2 0.472826
3 0.242363
passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")
print(passenger_age)
Age
Pclass
1 38.233441
2 29.877630
3 25.140620
port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
print(port_stats)
Fare Survived
Embarked
C 10072.2962 93
Q 1022.2543 30
S 17439.3988 217
6.取具体的某个值,通过索引和列名取值
#具体取值,类似SQL中的where 条件
row_index_83_age = titanic_survival.loc[83,"Age"]
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print(row_index_83_age)
print(row_index_1000_pclass)
28.0
1
#pandas排序
new_titanic_survival = titanic_survival.sort_values("Age",ascending=False)
print(new_titanic_survival[0:10])
titanic_reindexed = new_titanic_survival.reset_index(drop=True)#重新定义索引,True是丢掉原来的索引,重新定义索引
print(titanic_reindexed.iloc[0:10])
PassengerId Survived Pclass ... Fare Cabin Embarked
630 631 1 1 ... 30.0000 A23 S
851 852 0 3 ... 7.7750 NaN S
493 494 0 1 ... 49.5042 NaN C
96 97 0 1 ... 34.6542 A5 C
116 117 0 3 ... 7.7500 NaN Q
672 673 0 2 ... 10.5000 NaN S
745 746 0 1 ... 71.0000 B22 S
33 34 0 2 ... 10.5000 NaN S
54 55 0 1 ... 61.9792 B30 C
280 281 0 3 ... 7.7500 NaN Q
[10 rows x 12 columns]
PassengerId Survived Pclass ... Fare Cabin Embarked
0 631 1 1 ... 30.0000 A23 S
1 852 0 3 ... 7.7750 NaN S
2 494 0 1 ... 49.5042 NaN C
3 97 0 1 ... 34.6542 A5 C
4 117 0 3 ... 7.7500 NaN Q
5 673 0 2 ... 10.5000 NaN S
6 746 0 1 ... 71.0000 B22 S
7 34 0 2 ... 10.5000 NaN S
8 55 0 1 ... 61.9792 B30 C
9 281 0 3 ... 7.7500 NaN Q
[10 rows x 12 columns]