第2章 索引

第2章 索引

import numpy as np
import pandas as pd
df = pd.read_csv('data/table.csv',index_col='ID')
df.head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
import numpy as np
import pandas as pd
df=pd.read_csv('data/table.csv',index_col='ID')
df.head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
df.describe()
Height Weight Math
count 35.000000 35.000000 35.000000
mean 174.142857 74.657143 61.351429
std 13.541098 12.895377 19.915164
min 155.000000 53.000000 31.500000
25% 161.000000 63.000000 47.400000
50% 173.000000 74.000000 61.700000
75% 187.500000 82.000000 77.100000
max 195.000000 100.000000 97.000000

一、单级索引

1. loc方法、iloc方法、[]操作符

最常用的索引方法可能就是这三类,其中iloc表示位置索引,loc表示标签索引,[]也具有很大的便利性,各有特点

(a)loc方法(注意:所有在loc中使用的切片全部包含右端点!)

① 单行索引:

(注意:所有在loc中使用的切片全部包含右端点!这是因为如果作为Pandas的使用者,那么肯定不太关心最后一个标签再往后一位是什么,但是如果是左闭右开,那么就很麻烦,先要知道再后面一列的名字是什么,非常不方便,因此Pandas中将loc设计为左右全闭)

df.loc[1103]
School          S_1
Class           C_1
Gender            M
Address    street_2
Height          186
Weight           82
Math           87.2
Physics          B+
Name: 1103, dtype: object
df.loc[1101]
School          S_1
Class           C_1
Gender            M
Address    street_1
Height          173
Weight           63
Math             34
Physics          A+
Name: 1101, dtype: object

② 多行索引:

df.loc[[1102,2304]]
df.loc[[1102,2201]]
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
2201 S_2 C_2 M street_5 193 100 39.1 B
df.loc[1304:].head()
df.loc[1304:].head()
School Class Gender Address Height Weight Math Physics
ID
1304 S_1 C_3 M street_2 195 70 85.2 A
1305 S_1 C_3 F street_5 187 69 61.7 B-
2101 S_2 C_1 M street_7 174 84 83.3 C
2102 S_2 C_1 F street_6 161 61 50.6 B+
2103 S_2 C_1 M street_4 157 61 52.5 B-
df.loc[2402::-1].head()
df.loc[2402:2401:-1].head()
School Class Gender Address Height Weight Math Physics
ID
2402 S_2 C_4 M street_7 166 82 48.7 B
2401 S_2 C_4 F street_2 192 62 45.3 A

③ 单列索引:

df.loc[:,'Height'].head()
df['Height'].head()
df.loc[1101:1102,'Height']
ID
1101    173
1102    192
Name: Height, dtype: int64

④ 多列索引:

df.loc[:,['Height','Math']].head()
df.loc[:,['Height','Math']].head()
df[['Height','Math']].head()
Height Math
ID
1101 173 34.0
1102 192 32.5
1103 186 87.2
1104 167 80.4
1105 159 84.8
df.loc[:,'Height':'Math'].head()
df.loc[:,'Height':'Math'].head()

Height Weight Math
ID
1101 173 63 34.0
1102 192 73 32.5
1103 186 82 87.2
1104 167 81 80.4
1105 159 64 84.8

⑤ 联合索引:

df.loc[1102:2401:3,'Height':'Math'].head()
df.loc[1102:2201:3,'Height':'Math'].head()
df.loc[1102:2201:3,'Height':'Math':2].head()
Height Math
ID
1102 192 32.5
1105 159 84.8
1203 160 58.8
1301 161 31.5
1304 195 85.2

⑥ 函数式索引:

df.loc[lambda x:x['Gender']=='M'].head()
#loc中使用的函数,传入参数就是前面的df
df.loc[lambda x:x['Gender']=='M'].head()
df[df['Gender']=='M'].head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1201 S_1 C_2 M street_5 188 68 97.0 A-
1203 S_1 C_2 M street_6 160 53 58.8 A+
1301 S_1 C_3 M street_4 161 68 31.5 B+
#这里的例子表示,loc中能够传入函数,并且函数的输入值是整张表,输出为标量、切片、合法列表(元素出现在索引中)、合法索引
def f(x):
    return [1101,1103]
df.loc[f]
def lam(x):
    return x['Gender']=='M'
df.loc[lam].head()
def f(x):
    return [1101,1103]
df.loc[f]
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1103 S_1 C_1 M street_2 186 82 87.2 B+

⑦ 布尔索引(将重点在第2节介绍)

df.loc[df['Address'].isin(['street_7','street_4'])].head()
df.loc[df['Address'].isin(['street_7','street_4'])].head()
School Class Gender Address Height Weight Math Physics
ID
1105 S_1 C_1 F street_4 159 64 84.8 B+
1202 S_1 C_2 F street_4 176 94 63.5 B-
1301 S_1 C_3 M street_4 161 68 31.5 B+
1303 S_1 C_3 M street_7 188 82 49.7 B
2101 S_2 C_1 M street_7 174 84 83.3 C
df.loc[[True if i[-1]=='4' or i[-1]=='7' else False for i in df['Address'].values]].head()
df.loc[[True if i[-1]=='4'or i[-1]=='7' else False for i in df['Address'].values]].head()
School Class Gender Address Height Weight Math Physics
ID
1105 S_1 C_1 F street_4 159 64 84.8 B+
1202 S_1 C_2 F street_4 176 94 63.5 B-
1301 S_1 C_3 M street_4 161 68 31.5 B+
1303 S_1 C_3 M street_7 188 82 49.7 B
2101 S_2 C_1 M street_7 174 84 83.3 C

小节:本质上说,loc中能传入的只有布尔列表和索引子集构成的列表,只要把握这个原则就很容易理解上面那些操作

(b)iloc方法(注意与loc不同,切片右端点不包含)

① 单行索引:

df.iloc[3]
df.iloc[0]
# df.iloc[[0,3]]
School          S_1
Class           C_1
Gender            M
Address    street_1
Height          173
Weight           63
Math             34
Physics          A+
Name: 1101, dtype: object

② 多行索引:

df.iloc[[0,3]]
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1104 S_1 C_1 F street_2 167 81 80.4 B-
df.iloc[3:5]
#df.iloc[5:3:-1]
School Class Gender Address Height Weight Math Physics
ID
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
df.iloc[4:2:-1]
School Class Gender Address Height Weight Math Physics
ID
1105 S_1 C_1 F street_4 159 64 84.8 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-

③ 单列索引:

df.iloc[:,3].head()
# df.iloc[0,0]
# df.iloc[:,0:3].head()
ID
1101    street_1
1102    street_2
1103    street_2
1104    street_2
1105    street_4
Name: Address, dtype: object

④ 多列索引:

df.iloc[:,7::-2].head()
df.iloc[:,7::-2].head()
Physics Weight Address Class
ID
1101 A+ 63 street_1 C_1
1102 B+ 73 street_2 C_1
1103 B+ 82 street_2 C_1
1104 B- 81 street_2 C_1
1105 B+ 64 street_4 C_1

⑤ 混合索引:

df.iloc[3::4,7::-2].head()
df.iloc[3::4,7::-2]
Physics Weight Address Class
ID
1104 B- 81 street_2 C_1
1203 A+ 53 street_6 C_2
1302 A- 57 street_1 C_3
2101 C 84 street_7 C_1
2105 A 81 street_4 C_1
2204 B- 74 street_1 C_2
2303 C 99 street_7 C_3
2402 B 82 street_7 C_4

⑥ 函数式索引:

df.iloc[lambda x:[3]].head()
df.iloc[lambda x:[3]].head()
df.iloc[[i for i in range(5)]]
def f(x):
    temp=[]
    index=0
    for i in x['Physics'].values:
        if i =='A+':
            temp.append(index)
        index+=1
    return temp
df.iloc[f]
#type(df['Physics'].values)
#type(df['Physics']=='A+')
df.iloc[list(df['Physics']=='A+')]#纯布尔Series会报错,将布尔转成list就可以了
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1203 S_1 C_2 M street_6 160 53 58.8 A+
2203 S_2 C_2 M street_4 155 91 73.8 A+

小节:iloc中接收的参数只能为整数或整数列表或布尔列表,不能使用布尔Series,如果要用就必须如下把values拿出来

#df.iloc[df['School']=='S_1'].head() #报错
df.iloc[(df['School']=='S_1').values].head()
#type((df['School']=='S_1').values)
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+

(c) []操作符

如果不想陷入困境,请不要在行索引为浮点时使用[]操作符,因为在Series中的浮点[]并不是进行位置比较,而是值比较,非常特殊

(c.1)Series的[]操作

① 单元素索引:

s = pd.Series(df['Math'],index=df.index)
s[1101]
#使用的是索引标签
s=pd.Series(df['Math'],index=df.index)
s[1101]
# t=df['Math']
# t[1101]
34.0

② 多行索引:

s[0:4]

#使用的是绝对位置的整数切片,与元素无关,这里容易混淆
Series([], Name: Math, dtype: float64)

③ 函数式索引:

s[lambda x: x.index[16::-6]]
# s[lambda x:x.index[16::-6]]
#注意使用lambda函数时,直接切片(如:s[lambda x: 16::-6])就报错,此时使用的不是绝对位置切片,而是元素切片,非常易错
ID
2102    50.6
1301    31.5
1105    84.8
Name: Math, dtype: float64

④ 布尔索引:

s.head()
ID
1101    34.0
1102    32.5
1103    87.2
1104    80.4
1105    84.8
Name: Math, dtype: float64
s[s>80]
# s[s>80]
ID
1103    87.2
1104    80.4
1105    84.8
1201    97.0
1302    87.7
1304    85.2
2101    83.3
2205    85.4
2304    95.5
Name: Math, dtype: float64

【注意】如果不想陷入困境,请不要在行索引为浮点时使用[]操作符,因为在Series中[]的浮点切片并不是进行位置比较,而是值比较,非常特殊

s_int = pd.Series([1,2,3,4],index=[1,3,5,6])
s_float = pd.Series([1,2,3,4],index=[1.,3.,5.,6.])
s_int
# s_int=pd.Series([1,2,3,4],index=[1,3,5,6])
# s_float=pd.Series([1,2,3,4],index=[1.,3.,5.,6.])
# s_int
# s_float
# s_int[2:]
# s_float[2:]
3.0    2
5.0    3
6.0    4
dtype: int64
s_int[2:]
5    3
6    4
dtype: int64
s_float
1.0    1
3.0    2
5.0    3
6.0    4
dtype: int64
#注意和s_int[2:]结果不一样了,因为2这里是元素而不是位置
s_float[2:]
3.0    2
5.0    3
6.0    4
dtype: int64

(c.2)DataFrame的[]操作

① 单行索引:

df[1:3]

#这里非常容易写成df['label'],会报错
#同Series使用了绝对位置切片
#如果想要获得某一个元素,可用如下get_loc方法:
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
row = df.index.get_loc(1102)
df[row:row+1]
row=df.index.get_loc(1102)
df[row:row+1]

School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+

② 多行索引:

#用切片,如果是选取指定的某几行,推荐使用loc,否则很可能报错
df[3:5]
df[2:4]
School Class Gender Address Height Weight Math Physics
ID
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-

③ 单列索引:

df['School'].head()
df['School'].head()
ID
1101    S_1
1102    S_1
1103    S_1
1104    S_1
1105    S_1
Name: School, dtype: object

④ 多列索引:

df[['School','Math']].head()
df[['School','Math']].head()
School Math
ID
1101 S_1 34.0
1102 S_1 32.5
1103 S_1 87.2
1104 S_1 80.4
1105 S_1 84.8

⑤函数式索引:

df[lambda x:['Math','Physics']].head()
df[lambda x:['Math','Physics']].head()
Math Physics
ID
1101 34.0 A+
1102 32.5 B+
1103 87.2 B+
1104 80.4 B-
1105 84.8 B+

⑥ 布尔索引:

df[df['Gender']=='F'].head()
df[df['Gender']=='F'].head()
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
1202 S_1 C_2 F street_4 176 94 63.5 B-
1204 S_1 C_2 F street_5 162 63 33.8 B

小节:一般来说,[]操作符常用于列选择或布尔选择,尽量避免行的选择

2. 布尔索引

(a)布尔符号:’&’,’|’,’~’:分别代表和and,或or,取反not

df[(df['Gender']=='F')&(df['Address']=='street_2')].head()
df[(df['Gender']=='F')&(df['Address']=='street_2')].head()
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
2401 S_2 C_4 F street_2 192 62 45.3 A
2404 S_2 C_4 F street_2 160 84 67.7 B
df[(df['Math']>85)|(df['Address']=='street_7')].head()
df[(df['Math']>90)|(df['Address']=='street_7')].head()
School Class Gender Address Height Weight Math Physics
ID
1201 S_1 C_2 M street_5 188 68 97.0 A-
1303 S_1 C_3 M street_7 188 82 49.7 B
2101 S_2 C_1 M street_7 174 84 83.3 C
2202 S_2 C_2 F street_7 194 77 68.5 B+
2205 S_2 C_2 F street_7 183 76 85.4 B
df[~((df['Math']>75)|(df['Address']=='street_1'))].head()
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
1202 S_1 C_2 F street_4 176 94 63.5 B-
1203 S_1 C_2 M street_6 160 53 58.8 A+
1204 S_1 C_2 F street_5 162 63 33.8 B
1205 S_1 C_2 F street_6 167 63 68.4 B-

loc和[]中相应位置都能使用布尔列表选择:

df.loc[df['Math']>60,(df[:8]['Address']=='street_6').values].head()
df.loc[df['Math']>60,df.columns=='Physics'].head()
# df[:8]['Address']=='street_6'
#如果不加values就会索引对齐发生错误,Pandas中的索引对齐是一个重要特征,很多时候非常使用
#但是若不加以留意,就会埋下隐患
#思考:为什么df.loc[df['Math']>60,(df[:8]['Address']=='street_6').values].head()得到和上述结果一样?values能去掉吗?
ID
1101    street_1
1102    street_2
1103    street_2
1104    street_2
1105    street_4
1201    street_5
1202    street_4
1203    street_6
Name: Address, dtype: object
df[:8]['Address']=='street_6'
ID
1101    False
1102    False
1103    False
1104    False
1105    False
1201    False
1202    False
1203     True
Name: Address, dtype: bool

(b) isin方法

df[df['Address'].isin(['street_1','street_4'])&df['Physics'].isin(['A','A+'])]
df[df['Address'].isin(['street_1','street_4'])&(df['Physics'].isin(['A','A+']))]
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
2105 S_2 C_1 M street_4 170 81 34.2 A
2203 S_2 C_2 M street_4 155 91 73.8 A+
#上面也可以用字典方式写:
df[df[['Address','Physics']].isin({'Address':['street_1','street_4'],'Physics':['A','A+']}).all(1)]#
df[df[['Address','Physics']].isin({'Address':['street_1','street_4'],'Physics':['A','A+']}).all(1)]
#all与&的思路是类似的,其中的1代表按照跨列方向判断是否全为True
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
2105 S_2 C_1 M street_4 170 81 34.2 A
2203 S_2 C_2 M street_4 155 91 73.8 A+

3. 快速标量索引

当只需要取一个元素时,at和iat方法能够提供更快的实现:

display(df.at[1101,'School'])
display(df.loc[1101,'School'])
display(df.iat[0,0])
display(df.iloc[0,0])
#可尝试去掉注释对比时间
%timeit df.at[1101,'School']
%timeit df.loc[1101,'School']
%timeit df.iat[0,0]
%timeit df.iloc[0,0]
'S_1'



'S_1'



'S_1'



'S_1'


3.9 µs ± 796 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
6.08 µs ± 11 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.49 µs ± 881 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
6.73 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

4. 区间索引

此处介绍并不是说只能在单级索引中使用区间索引,只是作为一种特殊类型的索引方式,在此处先行介绍

(a)利用interval_range方法

pd.interval_range(start=0,end=5)
pd.interval_range(start=0,end=5,closed='both')
#closed参数可选'left''right''both''neither',默认左开右闭
IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4], [4, 5]],
              closed='both',
              dtype='interval[int64]')
pd.interval_range(start=0,periods=8,freq=5)
pd.interval_range(start=0,periods=9,freq=5)
#periods参数控制区间个数,freq控制步长
IntervalIndex([[0, 5], [5, 10], [10, 15], [15, 20], [20, 25], [25, 30], [30, 35], [35, 40], [40, 45]],
              closed='both',
              dtype='interval[int64]')

(b)利用cut将数值列转为区间为元素的分类变量,例如统计数学成绩的区间情况:

math_interval = pd.cut(df['Math'],bins=[0,40,60,80,100])#.astype('interval')
#注意,如果没有类型转换,此时并不是区间类型,而是category类型
math_interval.head()
math_interval=pd.cut(df['Math'],bins=[0,40,60,80,100]).astype('interval')
math_interval.head()
ID
1101      (0, 40]
1102      (0, 40]
1103    (80, 100]
1104    (80, 100]
1105    (80, 100]
Name: Math, dtype: interval

(c)区间索引的选取

df_i = df.join(math_interval,rsuffix='_interval')[['Math','Math_interval']]\
            .reset_index().set_index('Math_interval')
df_i.head()
# 由于df1和df2中有重叠的列名,所以还需要分别指定lsuffix和rsuffix参数来表示合并后的列名后缀以区分合并后的列名。

# df_j=df.join(math_interval,rsuffix='_r')[['Math','Math_r']].set_index('Math_r')
# df_j.head()
ID Math
Math_interval
(0, 40] 1101 34.0
(0, 40] 1102 32.5
(80, 100] 1103 87.2
(80, 100] 1104 80.4
(80, 100] 1105 84.8
df_i.loc[65].head()
#包含该值就会被选中
df_i.loc[80].head()
ID Math
Math_interval
(60, 80] 1202 63.5
(60, 80] 1205 68.4
(60, 80] 1305 61.7
(60, 80] 2104 72.2
(60, 80] 2202 68.5
df_i.loc[[65,90]].head()
#df_i.loc[[65,90]]
ID Math
Math_interval
(60, 80] 1202 63.5
(60, 80] 1205 68.4
(60, 80] 1305 61.7
(60, 80] 2104 72.2
(60, 80] 2202 68.5

如果想要选取某个区间,先要把分类变量转为区间变量,再使用overlap方法:

#df_i.loc[pd.Interval(70,75)].head() 报错
display(df_i[df_i.index.astype('interval').overlaps(pd.Interval(70, 85))].head())
display(df_i.index.astype('interval').overlaps(pd.Interval(70,85)))
ID Math
Math_interval
(80, 100] 1103 87.2
(80, 100] 1104 80.4
(80, 100] 1105 84.8
(80, 100] 1201 97.0
(60, 80] 1202 63.5
array([False, False,  True,  True,  True,  True,  True, False, False,
        True, False,  True, False,  True,  True,  True, False, False,
        True, False, False,  True,  True, False,  True,  True, False,
        True,  True, False, False, False, False,  True, False])

二、多级索引

1. 创建多级索引

(a)通过from_tuple或from_arrays

① 直接创建元组

tuples = [('A','a'),('A','b'),('B','a'),('B','b')]
mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
mul_index


tuples=[('A','a'),('A','b'),('B','a'),('B','b')]
mul_index=pd.MultiIndex.from_tuples(tuples,names=('Upper','lower'))
mul_index
MultiIndex([('A', 'a'),
            ('A', 'b'),
            ('B', 'a'),
            ('B', 'b')],
           names=['Upper', 'lower'])
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)
Score
Upper lower
A a perfect
b good
B a fair
b bad

② 利用zip创建元组

L1 = list('AABB')
L2 = list('abab')
tuples = list(zip(L1,L2))
mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)

L1=list('AABB')
L2=list('aabb')
tuples=list(zip(L1,L2))
mul_index=pd.MultiIndex.from_tuples(tuples,names=('Upper','Lower'))
pd.DataFrame({'Score':['oerfect','good','fair','bad']},index=mul_index)
Score
Upper Lower
A a oerfect
a good
B b fair
b bad

③ 通过Array创建

arrays = [['A','a'],['A','b'],['B','a'],['B','b']]
mul_index = pd.MultiIndex.from_tuples(arrays, names=('Upper', 'Lower'))
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)

arrays=[['A','a'],['A','b'],['B','a'],['B','b']]
mul_index=pd.MultiIndex.from_tuples(arrays,names=('Upper','Lower'))
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)

Score
Upper Lower
A a perfect
b good
B a fair
b bad
mul_index
#由此看出内部自动转成元组
MultiIndex([('A', 'a'),
            ('A', 'b'),
            ('B', 'a'),
            ('B', 'b')],
           names=['Upper', 'Lower'])

(b)通过from_product

L1 = ['A','B']
L2 = ['a','b']
pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
#两两相乘


L1=['A','B']
L2=['a','b']
pd.MultiIndex.from_product([L1,L2],names=('Upper','lower'))
MultiIndex([('A', 'a'),
            ('A', 'b'),
            ('B', 'a'),
            ('B', 'b')],
           names=['Upper', 'lower'])

(c)指定df中的列创建(set_index方法)

df_using_mul = df.set_index(['Class','Address'])
df_using_mul.head()

# df_using_mul=df.reset_index().set_index(['Height','Weight'])
# df_using_mul.head()
School Gender Height Weight Math Physics
Class Address
C_1 street_1 S_1 M 173 63 34.0 A+
street_2 S_1 F 192 73 32.5 B+
street_2 S_1 M 186 82 87.2 B+
street_2 S_1 F 167 81 80.4 B-
street_4 S_1 F 159 64 84.8 B+

2. 多层索引切片

df_using_mul.head()
School Gender Height Weight Math Physics
Class Address
C_1 street_1 S_1 M 173 63 34.0 A+
street_2 S_1 F 192 73 32.5 B+
street_2 S_1 M 186 82 87.2 B+
street_2 S_1 F 167 81 80.4 B-
street_4 S_1 F 159 64 84.8 B+

(a)一般切片

display(df_using_mul.loc['C_2'])
display(df_using_mul.index.is_lexsorted())
df_using_mul.index.is_lexsorted()
#该函数检查是否排序
display(df_using_mul.sort_index().loc['C_2'])
display(df_using_mul.sort_index().index.is_lexsorted())
School Gender Height Weight Math Physics
Address
street_5 S_1 M 188 68 97.0 A-
street_4 S_1 F 176 94 63.5 B-
street_6 S_1 M 160 53 58.8 A+
street_5 S_1 F 162 63 33.8 B
street_6 S_1 F 167 63 68.4 B-
street_5 S_2 M 193 100 39.1 B
street_7 S_2 F 194 77 68.5 B+
street_4 S_2 M 155 91 73.8 A+
street_1 S_2 M 175 74 47.2 B-
street_7 S_2 F 183 76 85.4 B
False
School Gender Height Weight Math Physics
Address
street_1 S_2 M 175 74 47.2 B-
street_4 S_1 F 176 94 63.5 B-
street_4 S_2 M 155 91 73.8 A+
street_5 S_1 M 188 68 97.0 A-
street_5 S_1 F 162 63 33.8 B
street_5 S_2 M 193 100 39.1 B
street_6 S_1 M 160 53 58.8 A+
street_6 S_1 F 167 63 68.4 B-
street_7 S_2 F 194 77 68.5 B+
street_7 S_2 F 183 76 85.4 B
True
#df_using_mul.loc['C_2','street_5']
#当索引不排序时,单个索引会报出性能警告
# df_using_mul.index.is_lexsorted()
#该函数检查是否排序
display(df_using_mul.sort_index().loc['C_2'])
df_using_mul.sort_index().loc['C_2','street_5']
# df_using_mul.sort_index().index.is_lexsorted()
School Gender Height Weight Math Physics
Address
street_1 S_2 M 175 74 47.2 B-
street_4 S_1 F 176 94 63.5 B-
street_4 S_2 M 155 91 73.8 A+
street_5 S_1 M 188 68 97.0 A-
street_5 S_1 F 162 63 33.8 B
street_5 S_2 M 193 100 39.1 B
street_6 S_1 M 160 53 58.8 A+
street_6 S_1 F 167 63 68.4 B-
street_7 S_2 F 194 77 68.5 B+
street_7 S_2 F 183 76 85.4 B
School Gender Height Weight Math Physics
Class Address
C_2 street_5 S_1 M 188 68 97.0 A-
street_5 S_1 F 162 63 33.8 B
street_5 S_2 M 193 100 39.1 B
#df_using_mul.loc[('C_2','street_5'):] 报错
#当不排序时,不能使用多层切片
df_using_mul.sort_index().loc[('C_2','street_6'):('C_3','street_4')]
#注意此处由于使用了loc,因此仍然包含右端点
df_using_mul.sort_index().loc[('C_2','street_6'):('C_3','street_4')]
School Gender Height Weight Math Physics
Class Address
C_2 street_6 S_1 M 160 53 58.8 A+
street_6 S_1 F 167 63 68.4 B-
street_7 S_2 F 194 77 68.5 B+
street_7 S_2 F 183 76 85.4 B
C_3 street_1 S_1 F 175 57 87.7 A-
street_2 S_1 M 195 70 85.2 A
street_4 S_1 M 161 68 31.5 B+
street_4 S_2 F 157 78 72.3 B+
street_4 S_2 M 187 73 48.9 B
df_using_mul.sort_index().loc[('C_2','street_7'):'C_3'].head()
#非元组也是合法的,表示选中该层所有元素
# df_using_mul.sort_index().loc[("C_2",'street_7'):'C_3','School']
Class  Address 
C_2    street_7    S_2
       street_7    S_2
C_3    street_1    S_1
       street_2    S_1
       street_4    S_1
       street_4    S_2
       street_4    S_2
       street_5    S_1
       street_5    S_2
       street_6    S_2
       street_7    S_1
       street_7    S_2
Name: School, dtype: object

(b)第一类特殊情况:由元组构成列表

df_using_mul.sort_index().loc[[('C_2','street_7'),('C_3','street_2')]]
#表示选出某几个元素,精确到最内层索引
# df_using_mul.sort_index().loc[[('C_2','street_7'),('C_3','street_2')],'Math':]
School Gender Height Weight Math Physics
Class Address
C_2 street_7 S_2 F 194 77 68.5 B+
street_7 S_2 F 183 76 85.4 B
C_3 street_2 S_1 M 195 70 85.2 A

(c)第二类特殊情况:由列表构成元组

df_using_mul.sort_index().loc[(['C_2','C_3'],['street_4','street_7']),:]
#选出第一层在‘C_2’和'C_3'中且第二层在'street_4'和'street_7'中的行
# df_using_mul.sort_index().loc[(['C_2','C_3'],['street_4','street_7']),:]
School Gender Height Weight Math Physics
Class Address
C_2 street_4 S_1 F 176 94 63.5 B-
street_4 S_2 M 155 91 73.8 A+
street_7 S_2 F 194 77 68.5 B+
street_7 S_2 F 183 76 85.4 B
C_3 street_4 S_1 M 161 68 31.5 B+
street_4 S_2 F 157 78 72.3 B+
street_4 S_2 M 187 73 48.9 B
street_7 S_1 M 188 82 49.7 B
street_7 S_2 F 190 99 65.9 C

3. 多层索引中的slice对象

np.random.rand(9,9)
array([[0.00157292, 0.3521813 , 0.57414635, 0.7402888 , 0.04642124,
        0.66619584, 0.92281977, 0.03996369, 0.65053856],
       [0.08477172, 0.88126615, 0.99788337, 0.7568739 , 0.48090064,
        0.80362277, 0.3358786 , 0.37949268, 0.01001926],
       [0.59348037, 0.69988113, 0.17082218, 0.99141101, 0.12301197,
        0.54402624, 0.62715081, 0.31135379, 0.21957649],
       [0.66512945, 0.9106095 , 0.25563962, 0.07774486, 0.88233467,
        0.96112214, 0.94485486, 0.57277522, 0.54257246],
       [0.62443182, 0.20058009, 0.01188025, 0.22773041, 0.00204232,
        0.67665924, 0.32854489, 0.35794382, 0.6500615 ],
       [0.68680186, 0.97791972, 0.21070428, 0.40517635, 0.96220453,
        0.58241295, 0.77434508, 0.01228948, 0.02735097],
       [0.75966199, 0.79494387, 0.45020853, 0.93519435, 0.26549524,
        0.35259274, 0.04451145, 0.65311556, 0.65076007],
       [0.11867561, 0.78091157, 0.82840483, 0.36694628, 0.27188274,
        0.86574978, 0.49296668, 0.6349486 , 0.10321662],
       [0.31004969, 0.9974477 , 0.72932672, 0.84089302, 0.24069463,
        0.88026845, 0.65472614, 0.90057078, 0.94011198]])
L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_s = pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
df_s



# L1,L2=['A','B','C'],['a','b','c']
# mul_index1=pd.MultiIndex.from_product([L1,L2],names=('Upper','Lower'))
# L3,L4=['D','E','F'],['d','e','f']
# mul_index2=pd.MultiIndex.from_product([L3,L4],names=('Big','Small'))
# df_s=pd.DataFrame(np.random.randn(9,9),index=mul_index1,columns=mul_index2)
# df_s
Big D E F
Small d e f d e f d e f
Upper Lower
A a 0.727690 0.349177 0.462134 0.756151 0.595001 0.854182 0.564885 0.032514 0.584503
b 0.456652 0.064605 0.162614 0.943543 0.358184 0.320619 0.255107 0.848212 0.608838
c 0.527055 0.771120 0.354109 0.020767 0.305929 0.062331 0.315193 0.953083 0.354281
B a 0.810611 0.222145 0.167663 0.398043 0.533769 0.798719 0.942496 0.895687 0.200293
b 0.914262 0.014343 0.725297 0.145157 0.661077 0.998086 0.986214 0.350872 0.799577
c 0.657973 0.002625 0.593380 0.782111 0.023865 0.848250 0.163945 0.969662 0.919264
C a 0.378590 0.826283 0.422204 0.554652 0.843198 0.362767 0.726439 0.425965 0.914429
b 0.413699 0.189651 0.134559 0.165747 0.348509 0.922714 0.973793 0.348467 0.743193
c 0.552581 0.273073 0.620954 0.650415 0.235731 0.418648 0.170382 0.526475 0.677653
idx=pd.IndexSlice
idx=pd.IndexSlice

IndexSlice 本质上是对多个Slice对象的包装

idx[1:9:2,'A':'C','start':'end':2]
(slice(1, 9, 2), slice('A', 'C', None), slice('start', 'end', 2))

索引Slice可以与loc一起完成切片操作,主要有两种用法

(a)loc[idx[*,*]]型

第一个星号表示行,第二个表示列,且使用布尔索引时,需要索引对齐

#例子1
df_s.loc[idx['B':,df_s.iloc[0]>0.6]]
#df_s.loc[idx['B':,df_s.iloc[:,0]>0.6]] #索引没有对齐报错
Big D E
Small d d f
Upper Lower
B a 0.810611 0.398043 0.798719
b 0.914262 0.145157 0.998086
c 0.657973 0.782111 0.848250
C a 0.378590 0.554652 0.362767
b 0.413699 0.165747 0.922714
c 0.552581 0.650415 0.418648
#例子2
df_s.loc[idx[df_s.iloc[:,0]>0.6,:('E','f')]]
Big D E
Small d e f d e f
Upper Lower
A a 0.727690 0.349177 0.462134 0.756151 0.595001 0.854182
B a 0.810611 0.222145 0.167663 0.398043 0.533769 0.798719
b 0.914262 0.014343 0.725297 0.145157 0.661077 0.998086
c 0.657973 0.002625 0.593380 0.782111 0.023865 0.848250

(b)loc[idx[*,*],idx[*,*]]型

这里与上面的区别在于(a)中的loc是没有逗号隔开的,但(b)是用逗号隔开,前面一个idx表示行索引,后面一个idx为列索引

这种用法非常灵活,因此多举几个例子方便理解

#例子1
df_s.loc[idx['A'],idx['D':]]
#后面的层出现,则前面的层必须出现
#df_s.loc[idx['a'],idx['D':]] #报错
Big D E F
Small d e f d e f d e f
Lower
a 0.727690 0.349177 0.462134 0.756151 0.595001 0.854182 0.564885 0.032514 0.584503
b 0.456652 0.064605 0.162614 0.943543 0.358184 0.320619 0.255107 0.848212 0.608838
c 0.527055 0.771120 0.354109 0.020767 0.305929 0.062331 0.315193 0.953083 0.354281
#例子2
df_s.loc[idx[:'B','b':],:] #举这个例子是为了说明①可以在相应level使用切片②某一个idx可以用:代替表示全选
Big D E F
Small d e f d e f d e f
Upper Lower
A b 0.456652 0.064605 0.162614 0.943543 0.358184 0.320619 0.255107 0.848212 0.608838
c 0.527055 0.771120 0.354109 0.020767 0.305929 0.062331 0.315193 0.953083 0.354281
B b 0.914262 0.014343 0.725297 0.145157 0.661077 0.998086 0.986214 0.350872 0.799577
c 0.657973 0.002625 0.593380 0.782111 0.023865 0.848250 0.163945 0.969662 0.919264
#例子3
df_s.iloc[:,0]>0.6
Upper  Lower
A      a         True
       b        False
       c        False
B      a         True
       b         True
       c         True
C      a        False
       b        False
       c        False
Name: (D, d), dtype: bool
df_s.loc[idx[:'B',df_s.iloc[:,0]>0.6],:] #这个例子表示相应位置还可以使用布尔索引
Big D E F
Small d e f d e f d e f
Upper Lower
A a 0.727690 0.349177 0.462134 0.756151 0.595001 0.854182 0.564885 0.032514 0.584503
B a 0.810611 0.222145 0.167663 0.398043 0.533769 0.798719 0.942496 0.895687 0.200293
b 0.914262 0.014343 0.725297 0.145157 0.661077 0.998086 0.986214 0.350872 0.799577
c 0.657973 0.002625 0.593380 0.782111 0.023865 0.848250 0.163945 0.969662 0.919264
L1,L2 = ['A','B'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_s = pd.DataFrame(np.random.rand(6,9),index=mul_index1,columns=mul_index2)
df_s
Big D E F
Small d e f d e f d e f
Upper Lower
A a 0.582610 0.641514 0.560705 0.910591 0.599434 0.208767 0.515083 0.884898 0.403068
b 0.294533 0.924421 0.124904 0.880993 0.002462 0.785282 0.519103 0.719144 0.867035
c 0.904616 0.315742 0.313072 0.376997 0.474177 0.317675 0.591629 0.857103 0.345019
B a 0.372711 0.389083 0.756115 0.504690 0.380259 0.743078 0.235606 0.477790 0.864240
b 0.381907 0.088245 0.858773 0.801386 0.140712 0.363459 0.582477 0.592419 0.935077
c 0.764331 0.021464 0.543272 0.819539 0.334704 0.771924 0.766925 0.054824 0.016175
#例子4
#特别要注意,(b)中的布尔索引是可以索引不对齐的,只需要长度一样,比如下面这个例子
df_s.loc[idx[:'B',(df_s.iloc[0]>0.6)[:6]],:]
Big D E F
Small d e f d e f d e f
Upper Lower
A b 0.294533 0.924421 0.124904 0.880993 0.002462 0.785282 0.519103 0.719144 0.867035
B a 0.372711 0.389083 0.756115 0.504690 0.380259 0.743078 0.235606 0.477790 0.864240
#例子5
df_s.loc[idx[:'B','c':,(df_s.iloc[:,0]>0.6)],:]
#idx中层数k1大于df层数k2时,idx前k2个参数若相应位置是元素或者元素切片,则表示相应df层的元素筛选,同时也可以选择用同长度bool序列
#idx后面多出来的参数只能选择同bool序列,这样设计的目的是可以将元素筛选和条件筛选同时运用
Big D E F
Small d e f d e f d e f
Upper Lower
A c 0.904616 0.315742 0.313072 0.376997 0.474177 0.317675 0.591629 0.857103 0.345019
B c 0.764331 0.021464 0.543272 0.819539 0.334704 0.771924 0.766925 0.054824 0.016175
#例子6
df_s.loc[idx[:'B',(df_s.iloc[:,0]>0.6),(df_s.iloc[:,0]>0.6)],:] #这个就不是元素筛选而是条件筛选
#df_s.loc[idx[:'B',(df_s.iloc[:,0]>0.6),'c',:]] #报错
#df_s.loc[idx[:'c','B',(df_s.iloc[:,0]>0.6),:]] #报错
Big D E F
Small d e f d e f d e f
Upper Lower
A c 0.904616 0.315742 0.313072 0.376997 0.474177 0.317675 0.591629 0.857103 0.345019
B c 0.764331 0.021464 0.543272 0.819539 0.334704 0.771924 0.766925 0.054824 0.016175

索引Slice的使用非常灵活:

loc加逗号,idx索引不对齐(loc不加都好,idx索引对齐)注意!

加逗号的时候,前面的idx表示行,后面idx的表示列

不加逗号的时候,idx里面前面的表示行,后面的表示列而且要索引对齐

display(df_s.loc[idx['B':,df_s['D']['d']>0.3],idx[df_s.sum()>4]])#
#df_s.sum()默认为对列求和,因此返回一个长度为9的数值列表
# type(df_s.sum())
#df_s.loc['B':,idx[df_s['D']['d']>0.5]]#,idx[df_s.sum()>4]
display(df_s['D']['d']>0.3)
display(df_s.sum()>4)
# display(df_s.loc[idx['B':,df_s['D']['d']>0.3],idx[df_s.sum()>4]])#
a=df_s['D']['d']>0.3
df_s.loc[idx['B':,df_s.sum()>4]]
#df_s.loc[idx['B':,df_s['D']['d']>0.3]]#索引没对齐所以报错
Big D E F
Small d d f d e f
Upper Lower
B a 0.810611 0.398043 0.798719 0.942496 0.895687 0.200293
b 0.914262 0.145157 0.998086 0.986214 0.350872 0.799577
c 0.657973 0.782111 0.848250 0.163945 0.969662 0.919264
C a 0.378590 0.554652 0.362767 0.726439 0.425965 0.914429
b 0.413699 0.165747 0.922714 0.973793 0.348467 0.743193
c 0.552581 0.650415 0.418648 0.170382 0.526475 0.677653
Upper  Lower
A      a        True
       b        True
       c        True
B      a        True
       b        True
       c        True
C      a        True
       b        True
       c        True
Name: d, dtype: bool



Big  Small
D    d         True
     e        False
     f        False
E    d         True
     e        False
     f         True
F    d         True
     e         True
     f         True
dtype: bool
Big D E F
Small d d f d e f
Upper Lower
B a 0.810611 0.398043 0.798719 0.942496 0.895687 0.200293
b 0.914262 0.145157 0.998086 0.986214 0.350872 0.799577
c 0.657973 0.782111 0.848250 0.163945 0.969662 0.919264
C a 0.378590 0.554652 0.362767 0.726439 0.425965 0.914429
b 0.413699 0.165747 0.922714 0.973793 0.348467 0.743193
c 0.552581 0.650415 0.418648 0.170382 0.526475 0.677653

4. 索引层的交换

(a)swaplevel方法(两层交换)

df_using_mul.head()
School Gender Height Weight Math Physics
Class Address
C_1 street_1 S_1 M 173 63 34.0 A+
street_2 S_1 F 192 73 32.5 B+
street_2 S_1 M 186 82 87.2 B+
street_2 S_1 F 167 81 80.4 B-
street_4 S_1 F 159 64 84.8 B+
df_using_mul.swaplevel(i=1,j=0,axis=0).sort_index().head()
df_using_mul.swaplevel(i=1,j=0,axis=0).sort_index().head()
School Gender Height Weight Math Physics
Address Class
street_1 C_1 S_1 M 173 63 34.0 A+
C_2 S_2 M 175 74 47.2 B-
C_3 S_1 F 175 57 87.7 A-
street_2 C_1 S_1 F 192 73 32.5 B+
C_1 S_1 M 186 82 87.2 B+

(b)reorder_levels方法(多层交换)

df_muls = df.set_index(['School','Class','Address'])
df_muls.head()
df_mul=df.set_index(['Physics','School','Class'])
df_mul.head()
Gender Address Height Weight Math
Physics School Class
A+ S_1 C_1 M street_1 173 63 34.0
B+ S_1 C_1 F street_2 192 73 32.5
C_1 M street_2 186 82 87.2
B- S_1 C_1 F street_2 167 81 80.4
B+ S_1 C_1 F street_4 159 64 84.8
df_muls.reorder_levels([2,0,1],axis=0).sort_index().head()
df_mul.reorder_levels([1,2,0],axis=0).sort_index().head()
Gender Address Height Weight Math
School Class Physics
S_1 C_1 A+ M street_1 173 63 34.0
B+ F street_2 192 73 32.5
B+ M street_2 186 82 87.2
B+ F street_4 159 64 84.8
B- F street_2 167 81 80.4
#如果索引有name,可以直接使用name
df_muls.reorder_levels(['Address','School','Class'],axis=0).sort_index().head()
df_mul.reorder_levels(['School','Class','Physics'],axis=0).sort_index().head()
Gender Address Height Weight Math
School Class Physics
S_1 C_1 A+ M street_1 173 63 34.0
B+ F street_2 192 73 32.5
B+ M street_2 186 82 87.2
B+ F street_4 159 64 84.8
B- F street_2 167 81 80.4

三、索引设定

1. index_col参数

index_col是read_csv中的一个参数,而不是某一个方法:

pd.read_csv('data/table.csv',index_col=['Address','School']).head()
pd.read_csv('data/table.csv',index_col=['Address','School']).head()
Class ID Gender Height Weight Math Physics
Address School
street_1 S_1 C_1 1101 M 173 63 34.0 A+
street_2 S_1 C_1 1102 F 192 73 32.5 B+
S_1 C_1 1103 M 186 82 87.2 B+
S_1 C_1 1104 F 167 81 80.4 B-
street_4 S_1 C_1 1105 F 159 64 84.8 B+

2. reindex和reindex_like

reindex是指重新索引,它的重要特性在于索引对齐,很多时候用于重新排序

df
# df[df['ID']==1106]
# df['ID']
#df.info()
df.loc[1105:2402]
School Class Gender Address Height Weight Math Physics
ID
1105 S_1 C_1 F street_4 159 64 84.8 B+
1201 S_1 C_2 M street_5 188 68 97.0 A-
1202 S_1 C_2 F street_4 176 94 63.5 B-
1203 S_1 C_2 M street_6 160 53 58.8 A+
1204 S_1 C_2 F street_5 162 63 33.8 B
1205 S_1 C_2 F street_6 167 63 68.4 B-
1301 S_1 C_3 M street_4 161 68 31.5 B+
1302 S_1 C_3 F street_1 175 57 87.7 A-
1303 S_1 C_3 M street_7 188 82 49.7 B
1304 S_1 C_3 M street_2 195 70 85.2 A
1305 S_1 C_3 F street_5 187 69 61.7 B-
2101 S_2 C_1 M street_7 174 84 83.3 C
2102 S_2 C_1 F street_6 161 61 50.6 B+
2103 S_2 C_1 M street_4 157 61 52.5 B-
2104 S_2 C_1 F street_5 159 97 72.2 B+
2105 S_2 C_1 M street_4 170 81 34.2 A
2201 S_2 C_2 M street_5 193 100 39.1 B
2202 S_2 C_2 F street_7 194 77 68.5 B+
2203 S_2 C_2 M street_4 155 91 73.8 A+
2204 S_2 C_2 M street_1 175 74 47.2 B-
2205 S_2 C_2 F street_7 183 76 85.4 B
2301 S_2 C_3 F street_4 157 78 72.3 B+
2302 S_2 C_3 M street_5 171 88 32.7 A
2303 S_2 C_3 F street_7 190 99 65.9 C
2304 S_2 C_3 F street_6 164 81 95.5 A-
2305 S_2 C_3 M street_4 187 73 48.9 B
2401 S_2 C_4 F street_2 192 62 45.3 A
2402 S_2 C_4 M street_7 166 82 48.7 B
df.reindex(index=[1101,1203,1206,2402])
df.reindex(index=[1101,1203,1205,2402])
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1203 S_1 C_2 M street_6 160 53 58.8 A+
1205 S_1 C_2 F street_6 167 63 68.4 B-
2402 S_2 C_4 M street_7 166 82 48.7 B
df.reindex(columns=['Height','Gender','Average']).head()
df.reindex(columns=['Height','Gender','Avergae']).head()
Height Gender Avergae
ID
1101 173 M NaN
1102 192 F NaN
1103 186 M NaN
1104 167 F NaN
1105 159 F NaN

可以选择缺失值的填充方法:fill_value和method(bfill/ffill/nearest),其中method参数必须索引单调

a=df.reindex(index=[1101,1203,1206,2402],method='bfill')
#bfill表示用所在索引1206的后一个有效行填充,ffill为前一个有效行,nearest是指最近的
display(a.index.is_monotonic)#判断索引是否单调(单级索引的时候用is_monotonic,多级索引用is_lexsorted)
# type(a)
display(a.sort_index().index.is_monotonic)
a
True



True
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1203 S_1 C_2 M street_6 160 53 58.8 A+
1206 S_1 C_3 M street_4 161 68 31.5 B+
2402 S_2 C_4 M street_7 166 82 48.7 B
df.reindex(index=[1101,1203,1206,2402],method='nearest')
df.reindex(index=[1101,1203,1206,2402],method='nearest')
#数值上1205比1301更接近1206,因此用前者填充
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1203 S_1 C_2 M street_6 160 53 58.8 A+
1206 S_1 C_2 F street_6 167 63 68.4 B-
2402 S_2 C_4 M street_7 166 82 48.7 B

reindex_like的作用为生成一个横纵索引完全与参数列表一致的DataFrame,数据使用被调用的表

df_temp = pd.DataFrame({'Weight':np.zeros(5),
                        'Height':np.zeros(5),
                        'ID':[1101,1104,1103,1106,1102]}).set_index('ID')
df_temp.reindex_like(df[0:5][['Weight','Height']])


df_temp1=pd.DataFrame({'Weight':np.zeros(5),
                     'Height':np.zeros(5),
                     'ID':[1101,1104,1103,1106,1102]}).set_index('ID')
display(df_temp1)
display(df[0:5][['Weight','Height']])
df_temp1.reindex_like(df[0:5][['Weight','Height']])
Weight Height
ID
1101 0.0 0.0
1104 0.0 0.0
1103 0.0 0.0
1106 0.0 0.0
1102 0.0 0.0
Weight Height
ID
1101 63 173
1102 73 192
1103 82 186
1104 81 167
1105 64 159
Weight Height
ID
1101 0.0 0.0
1102 0.0 0.0
1103 0.0 0.0
1104 0.0 0.0
1105 NaN NaN

如果df_temp单调还可以使用method参数:

df_temp = pd.DataFrame({'Weight':range(5),
                        'Height':range(5),
                        'ID':[1101,1104,1103,1106,1102]}).set_index('ID').sort_index()
df_temp.reindex_like(df[0:5][['Weight','Height']],method='bfill')
#可以自行检验这里的1105的值是否是由bfill规则填充

df_temp=pd.DataFrame({'Weight':range(5),
                     'Height':range(5),
                     'ID':[1101,1104,1103,1106,1102]}).set_index('ID').sort_index()
df_temp.reindex_like(df[:5][['Weight','Height']],method='bfill')
Weight Height
ID
1101 0 0
1102 4 4
1103 2 2
1104 1 1
1105 3 3

3. set_index和reset_index

先介绍set_index:从字面意思看,就是将某些列作为索引

使用表内列作为索引:

df.head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
df.set_index('Class').head()
df.set_index('Class').head()
School Gender Address Height Weight Math Physics
Class
C_1 S_1 M street_1 173 63 34.0 A+
C_1 S_1 F street_2 192 73 32.5 B+
C_1 S_1 M street_2 186 82 87.2 B+
C_1 S_1 F street_2 167 81 80.4 B-
C_1 S_1 F street_4 159 64 84.8 B+

利用append参数可以将当前索引维持不变

df.set_index('Class',append=True).head()
df.set_index('Class',append=True).head()
School Gender Address Height Weight Math Physics
ID Class
1101 C_1 S_1 M street_1 173 63 34.0 A+
1102 C_1 S_1 F street_2 192 73 32.5 B+
1103 C_1 S_1 M street_2 186 82 87.2 B+
1104 C_1 S_1 F street_2 167 81 80.4 B-
1105 C_1 S_1 F street_4 159 64 84.8 B+

当使用与表长相同的列作为索引(需要先转化为Series,否则报错):

df.set_index(pd.Series(range(df.shape[0]))).head()
df.set_index(pd.Series(range(df.shape[0]))).head()
# df.reset_index().head()
School Class Gender Address Height Weight Math Physics
0 S_1 C_1 M street_1 173 63 34.0 A+
1 S_1 C_1 F street_2 192 73 32.5 B+
2 S_1 C_1 M street_2 186 82 87.2 B+
3 S_1 C_1 F street_2 167 81 80.4 B-
4 S_1 C_1 F street_4 159 64 84.8 B+

可以直接添加多级索引:

df.set_index([pd.Series(range(df.shape[0])),pd.Series(np.ones(df.shape[0]))]).head()
df.set_index([pd.Series(range(df.shape[0])),pd.Series(np.ones(df.shape[0]))]).head()
School Class Gender Address Height Weight Math Physics
0 1.0 S_1 C_1 M street_1 173 63 34.0 A+
1 1.0 S_1 C_1 F street_2 192 73 32.5 B+
2 1.0 S_1 C_1 M street_2 186 82 87.2 B+
3 1.0 S_1 C_1 F street_2 167 81 80.4 B-
4 1.0 S_1 C_1 F street_4 159 64 84.8 B+

下面介绍reset_index方法,它的主要功能是将索引重置

默认状态直接恢复到自然数索引:

df.reset_index().head()
df.reset_index().head()
ID School Class Gender Address Height Weight Math Physics
0 1101 S_1 C_1 M street_1 173 63 34.0 A+
1 1102 S_1 C_1 F street_2 192 73 32.5 B+
2 1103 S_1 C_1 M street_2 186 82 87.2 B+
3 1104 S_1 C_1 F street_2 167 81 80.4 B-
4 1105 S_1 C_1 F street_4 159 64 84.8 B+

用level参数指定哪一层被reset,用col_level参数指定set到哪一层:

L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_temp = pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
df_temp.head()

# L1,L2=['A','B','C'],['a','b','c']
# mul_index1=pd.MultiIndex.from_product([L1,L2],names=['Upper','Lower'])
# L3,L4=['D','E','F'],['d','e','f']
# mul_index2=pd.MultiIndex.from_product([L1,L2],names=('Big','Small'))
# df_temp=pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
# df_temp.head()
Big A B C
Small a b c a b c a b c
Upper Lower
A a 0.779649 0.850942 0.498576 0.384237 0.070980 0.154522 0.029674 0.566804 0.607491
b 0.192239 0.861111 0.473844 0.375134 0.071124 0.954574 0.811846 0.130554 0.621900
c 0.241607 0.969242 0.425643 0.745032 0.400758 0.316587 0.510980 0.099285 0.442423
B a 0.152618 0.079949 0.427708 0.091412 0.050334 0.504813 0.562932 0.033970 0.098759
b 0.722760 0.914694 0.061265 0.946915 0.699833 0.858146 0.235059 0.637032 0.672221
df_temp1 = df_temp.reset_index(level=1,col_level=1)
display(df_temp1.head())


df_temp2=df_temp.reset_index(level=1,col_level=0)
display(df_temp2.head())

Big A B C
Small Lower a b c a b c a b c
Upper
A a 0.779649 0.850942 0.498576 0.384237 0.070980 0.154522 0.029674 0.566804 0.607491
A b 0.192239 0.861111 0.473844 0.375134 0.071124 0.954574 0.811846 0.130554 0.621900
A c 0.241607 0.969242 0.425643 0.745032 0.400758 0.316587 0.510980 0.099285 0.442423
B a 0.152618 0.079949 0.427708 0.091412 0.050334 0.504813 0.562932 0.033970 0.098759
B b 0.722760 0.914694 0.061265 0.946915 0.699833 0.858146 0.235059 0.637032 0.672221
Big Lower A B C
Small a b c a b c a b c
Upper
A a 0.779649 0.850942 0.498576 0.384237 0.070980 0.154522 0.029674 0.566804 0.607491
A b 0.192239 0.861111 0.473844 0.375134 0.071124 0.954574 0.811846 0.130554 0.621900
A c 0.241607 0.969242 0.425643 0.745032 0.400758 0.316587 0.510980 0.099285 0.442423
B a 0.152618 0.079949 0.427708 0.091412 0.050334 0.504813 0.562932 0.033970 0.098759
B b 0.722760 0.914694 0.061265 0.946915 0.699833 0.858146 0.235059 0.637032 0.672221
display(df_temp1.columns)
#看到的确插入了level2
display(df_temp2.columns)
MultiIndex([( '', 'Lower'),
            ('A',     'a'),
            ('A',     'b'),
            ('A',     'c'),
            ('B',     'a'),
            ('B',     'b'),
            ('B',     'c'),
            ('C',     'a'),
            ('C',     'b'),
            ('C',     'c')],
           names=['Big', 'Small'])



MultiIndex([('Lower',  ''),
            (    'A', 'a'),
            (    'A', 'b'),
            (    'A', 'c'),
            (    'B', 'a'),
            (    'B', 'b'),
            (    'B', 'c'),
            (    'C', 'a'),
            (    'C', 'b'),
            (    'C', 'c')],
           names=['Big', 'Small'])
df_temp1.index
#最内层索引被移出
# df_temp.index
df_temp1.index
Index(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'], dtype='object', name='Upper')

4. rename_axis和rename

rename_axis是针对多级索引的方法,作用是修改某一层的索引名,而不是索引标签

df_temp.rename_axis(index={'Lower':'LowerLower'},columns={'Big':'BigBig'})
df_temp.rename_axis(index={'Lower':'LowerLower'},columns={'Big':'BigBig'})
BigBig A B C
Small a b c a b c a b c
Upper LowerLower
A a 0.779649 0.850942 0.498576 0.384237 0.070980 0.154522 0.029674 0.566804 0.607491
b 0.192239 0.861111 0.473844 0.375134 0.071124 0.954574 0.811846 0.130554 0.621900
c 0.241607 0.969242 0.425643 0.745032 0.400758 0.316587 0.510980 0.099285 0.442423
B a 0.152618 0.079949 0.427708 0.091412 0.050334 0.504813 0.562932 0.033970 0.098759
b 0.722760 0.914694 0.061265 0.946915 0.699833 0.858146 0.235059 0.637032 0.672221
c 0.300491 0.622385 0.441322 0.039829 0.643909 0.055289 0.783965 0.311105 0.030763
C a 0.203551 0.101941 0.954500 0.845587 0.690844 0.801937 0.290936 0.780660 0.475275
b 0.320449 0.726794 0.613314 0.827977 0.646678 0.059252 0.422571 0.942435 0.692632
c 0.511147 0.142134 0.861073 0.667099 0.307494 0.327941 0.858620 0.464931 0.394346

rename方法用于修改列或者行索引标签,而不是索引名:

df_temp.rename(index={'A':'T'},columns={'e':'changed_e'}).head()
df_temp.rename(index={'A':'T'},columns={'e':'changed_e'}).head()
Big A B C
Small a b c a b c a b c
Upper Lower
T a 0.779649 0.850942 0.498576 0.384237 0.070980 0.154522 0.029674 0.566804 0.607491
b 0.192239 0.861111 0.473844 0.375134 0.071124 0.954574 0.811846 0.130554 0.621900
c 0.241607 0.969242 0.425643 0.745032 0.400758 0.316587 0.510980 0.099285 0.442423
B a 0.152618 0.079949 0.427708 0.091412 0.050334 0.504813 0.562932 0.033970 0.098759
b 0.722760 0.914694 0.061265 0.946915 0.699833 0.858146 0.235059 0.637032 0.672221

四、常用索引型函数

1. where函数

当对条件为False的单元进行填充:

df.head()
df.head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
df.where(df['Gender']=='M').head()
#不满足条件的行全部被设置为NaN
df.where(df['Gender']=='M').head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173.0 63.0 34.0 A+
1102 NaN NaN NaN NaN NaN NaN NaN NaN
1103 S_1 C_1 M street_2 186.0 82.0 87.2 B+
1104 NaN NaN NaN NaN NaN NaN NaN NaN
1105 NaN NaN NaN NaN NaN NaN NaN NaN

通过这种方法筛选结果和[]操作符的结果完全一致:

df.where(df['Gender']=='M').dropna().head()
df.where(df['Gender']=='M').dropna().head()
df[df['Gender']=='M'].head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1201 S_1 C_2 M street_5 188 68 97.0 A-
1203 S_1 C_2 M street_6 160 53 58.8 A+
1301 S_1 C_3 M street_4 161 68 31.5 B+

第一个参数为布尔条件,第二个参数为填充值:

a=np.random.rand(df.shape[0],df.shape[1])
print(a)
df.where(df['Gender']=='M',a).head()
[[0.7161581  0.10276669 0.40238542 0.08958898 0.81354208 0.93618131
  0.07624998 0.76951042]
 [0.5989863  0.33505363 0.02580452 0.97710051 0.66610509 0.87578929
  0.10579943 0.32587477]
 [0.72323026 0.54804755 0.6574625  0.29000164 0.56393993 0.60193394
  0.13987508 0.66423202]
 [0.18904524 0.72877605 0.6523432  0.36142768 0.69567451 0.39747439
  0.30469684 0.37125112]
 [0.00551586 0.88153884 0.54566294 0.97107711 0.48850192 0.10998194
  0.90682808 0.41378463]
 [0.50033165 0.7738362  0.40988636 0.5300674  0.48312333 0.8504542
  0.04959469 0.09917488]
 [0.02734842 0.96474427 0.2802373  0.51467892 0.28559232 0.31821509
  0.66709526 0.27763044]
 [0.51644131 0.32083044 0.64969486 0.89021328 0.67345517 0.04336458
  0.09367256 0.75744686]
 [0.22493445 0.00813298 0.22216915 0.86470618 0.43776919 0.28244148
  0.30980727 0.58315116]
 [0.48258786 0.84567059 0.69134007 0.34076216 0.62695586 0.63140201
  0.84808381 0.37466675]
 [0.56076384 0.58164287 0.46504719 0.47365642 0.19272048 0.58133442
  0.24137399 0.05412933]
 [0.86008694 0.47449294 0.99349784 0.41395056 0.21880338 0.57870774
  0.08266912 0.61736266]
 [0.92363794 0.65247997 0.23419656 0.49394314 0.39020457 0.9076912
  0.1787334  0.44567342]
 [0.75460692 0.87818376 0.9506545  0.0726399  0.19348714 0.99663363
  0.15222477 0.45247023]
 [0.61724088 0.93745891 0.27255907 0.39972334 0.426046   0.32858071
  0.76331933 0.89152397]
 [0.34580174 0.80287412 0.43827906 0.80077107 0.04982336 0.20480553
  0.76536398 0.96840615]
 [0.46040455 0.75916605 0.07480359 0.22483045 0.51732291 0.38761611
  0.89792331 0.21509696]
 [0.06037899 0.54175144 0.69026867 0.37691624 0.63080779 0.47401334
  0.29484837 0.60367724]
 [0.51877573 0.13529959 0.03847122 0.56332936 0.43954205 0.14320149
  0.53932022 0.94058724]
 [0.48122222 0.70688747 0.33347702 0.22922328 0.63617473 0.12761284
  0.01558933 0.42228307]
 [0.30437386 0.97890045 0.87982765 0.43573368 0.96842477 0.7905138
  0.78644784 0.65063758]
 [0.82491619 0.84071325 0.45118883 0.52083531 0.61488637 0.61067191
  0.49898609 0.28876092]
 [0.22509524 0.28846881 0.46573631 0.47336345 0.92948157 0.77863736
  0.39102284 0.50983845]
 [0.52390029 0.09511125 0.37396754 0.90084271 0.68217899 0.64254994
  0.74261885 0.58969606]
 [0.16566242 0.07908791 0.75588982 0.57909687 0.73887746 0.10960639
  0.27713074 0.41753722]
 [0.07614926 0.10933826 0.94807557 0.29773441 0.01351866 0.39391314
  0.22296291 0.38958623]
 [0.01210252 0.1967451  0.80815367 0.17967125 0.54335179 0.8655415
  0.07346332 0.69454703]
 [0.5345165  0.16660525 0.40307645 0.52590277 0.86212669 0.03425073
  0.65680325 0.14333681]
 [0.58388521 0.58904123 0.47712613 0.27945041 0.06372401 0.75644174
  0.14670172 0.5711495 ]
 [0.43603091 0.2733335  0.10495188 0.67239001 0.57908462 0.15874856
  0.42475036 0.90589905]
 [0.8144674  0.9944558  0.68966555 0.09715326 0.24935761 0.95897933
  0.30676529 0.31195192]
 [0.64092077 0.27853017 0.76853506 0.37189139 0.57513649 0.7483777
  0.66384054 0.19440796]
 [0.87374265 0.19244049 0.52948927 0.91302972 0.45222102 0.08765404
  0.12991008 0.50151027]
 [0.17282369 0.57290435 0.48890098 0.94771301 0.12483773 0.10294902
  0.2430428  0.92204093]
 [0.47430317 0.49779933 0.68934569 0.7553649  0.82349556 0.96427937
  0.17411753 0.8015759 ]]
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173.000000 63.000000 34.000000 A+
1102 0.598986 0.335054 0.0258045 0.977101 0.666105 0.875789 0.105799 0.325875
1103 S_1 C_1 M street_2 186.000000 82.000000 87.200000 B+
1104 0.189045 0.728776 0.652343 0.361428 0.695675 0.397474 0.304697 0.371251
1105 0.00551586 0.881539 0.545663 0.971077 0.488502 0.109982 0.906828 0.413785

2. mask函数

mask函数与where功能上相反,其余完全一致,即对条件为True的单元进行填充

df.mask(df['Gender']=='M').dropna().head()
df.mask(df['Gender']=='M').dropna().head()
df[df["Gender"]!='M'].head()
df.where(df['Gender']!='M').dropna().head()
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192.0 73.0 32.5 B+
1104 S_1 C_1 F street_2 167.0 81.0 80.4 B-
1105 S_1 C_1 F street_4 159.0 64.0 84.8 B+
1202 S_1 C_2 F street_4 176.0 94.0 63.5 B-
1204 S_1 C_2 F street_5 162.0 63.0 33.8 B
df.mask(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()
df.mask(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()
df.where(df['Gender']!='M',np.random.rand(df.shape[0],df.shape[1])).head()

School Class Gender Address Height Weight Math Physics
ID
1101 0.0524246 0.778497 0.906421 0.693452 0.196364 0.155836 0.594141 0.337383
1102 S_1 C_1 F street_2 192.000000 73.000000 32.500000 B+
1103 0.848831 0.444878 0.914591 0.36934 0.332417 0.605245 0.649709 0.613518
1104 S_1 C_1 F street_2 167.000000 81.000000 80.400000 B-
1105 S_1 C_1 F street_4 159.000000 64.000000 84.800000 B+

3. query函数

df.head()
df.head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+

query函数中的布尔表达式中,下面的符号都是合法的:行列索引名、字符串、and/not/or/&/|/~/not in/in/==/!=、四则运算符

# df.query('(Address in ["street_6","street_7"])&(Weight>(70+10))&(ID in [1303,2304,2402])')
df.query('(Address in ["street_6","street_7"])&(Weight>(70+10))&(ID in [1303,2304,2402])')
School Class Gender Address Height Weight Math Physics
ID
1303 S_1 C_3 M street_7 188 82 49.7 B
2304 S_2 C_3 F street_6 164 81 95.5 A-
2402 S_2 C_4 M street_7 166 82 48.7 B

五、重复元素处理

1. duplicated方法

该方法返回了是否重复的布尔列表

df.duplicated('Class').head()
df.duplicated('Class').head()
ID
1101    False
1102     True
1103     True
1104     True
1105     True
dtype: bool

可选参数keep默认为first,即首次出现设为不重复,若为last,则最后一次设为不重复,若为False,则所有重复项为True

df.duplicated('Class',keep='last').tail()
df.duplicate('Class',keep='last').tail()
ID
2401     True
2402     True
2403     True
2404     True
2405    False
dtype: bool
df.duplicated('Class',keep=False).head()
df.duplicated('Class',keep=False ).tail()
ID
2401    True
2402    True
2403    True
2404    True
2405    True
dtype: bool

2. drop_duplicates方法

从名字上看出为剔除重复项,这在后面章节中的分组操作中可能是有用的,例如需要保留每组的第一个值:

df.drop_duplicates('Class')
df.drop_duplicates('Class')
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1201 S_1 C_2 M street_5 188 68 97.0 A-
1301 S_1 C_3 M street_4 161 68 31.5 B+
2401 S_2 C_4 F street_2 192 62 45.3 A

参数与duplicate函数类似:

df.drop_duplicates('Class',keep='last')
df.drop_duplicates('Class',keep='last')
School Class Gender Address Height Weight Math Physics
ID
2105 S_2 C_1 M street_4 170 81 34.2 A
2205 S_2 C_2 F street_7 183 76 85.4 B
2305 S_2 C_3 M street_4 187 73 48.9 B
2405 S_2 C_4 F street_6 193 54 47.6 B

在传入多列时等价于将多列共同视作一个多级索引,比较重复项:

df.drop_duplicates(['School','Class'])
df.drop_duplicates(['School','Class'])
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1201 S_1 C_2 M street_5 188 68 97.0 A-
1301 S_1 C_3 M street_4 161 68 31.5 B+
2101 S_2 C_1 M street_7 174 84 83.3 C
2201 S_2 C_2 M street_5 193 100 39.1 B
2301 S_2 C_3 F street_4 157 78 72.3 B+
2401 S_2 C_4 F street_2 192 62 45.3 A

六、抽样函数

这里的抽样函数指的就是sample函数

(a)n为样本量

df.sample(n=5)
df.sample(n=5)
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
1201 S_1 C_2 M street_5 188 68 97.0 A-
1101 S_1 C_1 M street_1 173 63 34.0 A+
2204 S_2 C_2 M street_1 175 74 47.2 B-
2405 S_2 C_4 F street_6 193 54 47.6 B

(b)frac为抽样比

df.sample(frac=0.05)
df.sample(frac=0.05)
School Class Gender Address Height Weight Math Physics
ID
1105 S_1 C_1 F street_4 159 64 84.8 B+
2402 S_2 C_4 M street_7 166 82 48.7 B

(c)replace为是否放回

df.sample(n=df.shape[0],replace=True).head()
df.sample(n=df.shape[0],replace=True).head()
School Class Gender Address Height Weight Math Physics
ID
2402 S_2 C_4 M street_7 166 82 48.7 B
2101 S_2 C_1 M street_7 174 84 83.3 C
2302 S_2 C_3 M street_5 171 88 32.7 A
2401 S_2 C_4 F street_2 192 62 45.3 A
2405 S_2 C_4 F street_6 193 54 47.6 B
df.sample(n=35,replace=True).index.is_unique
df.sample(n=35,replace=True).index.is_unique

False

(d)axis为抽样维度,默认为0,即抽行

df.sample(n=3,axis=1).head()
df.sample(n=3,axis=1).head()
Math Weight School
ID
1101 34.0 63 S_1
1102 32.5 73 S_1
1103 87.2 82 S_1
1104 80.4 81 S_1
1105 84.8 64 S_1

(e)weights为样本权重,自动归一化

就是比如五个样本,权重分别是1 2 3 4 5
那个抽到的概率就是1/15 2/15 3/15,…
pandas这个sample的weight参数提供语法糖,只要你给1 2 3 4 5的绝对量,自动会把你输入转成归一化之后的概率进行抽样

df.sample(n=3,weights=np.random.rand(df.shape[0])).head()
df.sample(n=3,weights=list(range(df.shape[0]))).head()
# list(range(5))
# np.random.rand(df.shape[0])
array([0.42302361, 0.48301209, 0.07106683, 0.20566321, 0.04810366,
       0.24624529, 0.68941873, 0.3931274 , 0.86344223, 0.5209619 ,
       0.70827892, 0.60694938, 0.55769355, 0.8964586 , 0.58146323,
       0.32480769, 0.40078874, 0.24710269, 0.13704178, 0.23662958,
       0.289439  , 0.08274463, 0.1015726 , 0.3779845 , 0.76932914,
       0.69591814, 0.08994818, 0.87528269, 0.24571314, 0.0634573 ,
       0.81463873, 0.67125488, 0.65952737, 0.34445748, 0.72742361])
,#以某一列为权重,这在抽样理论中很常见
#抽到的概率与Math数值成正比
df.sample(n=3,weights=df['Math']).head(),
School Class Gender Address Height Weight Math Physics
ID
1305 S_1 C_3 F street_5 187 69 61.7 B-
2103 S_2 C_1 M street_4 157 61 52.5 B-
2105 S_2 C_1 M street_4 170 81 34.2 A

七、问题与练习

1. 问题

【问题一】 如何更改列或行的顺序?如何交换奇偶行(列)的顺序?

【问题二】 如果要选出DataFrame的某个子集,请给出尽可能多的方法实现。

【问题三】 query函数比其他索引方法的速度更慢吗?在什么场合使用什么索引最高效?

【问题四】 单级索引能使用Slice对象吗?能的话怎么使用,请给出一个例子。

【问题五】 如何快速找出某一列的缺失值所在索引?

【问题六】 索引设定中的所有方法分别适用于哪些场合?怎么直接把某个DataFrame的索引换成任意给定同长度的索引?

【问题七】 多级索引有什么适用场合?

【问题八】 什么时候需要重复元素处理?

df=pd.read_csv('data/UFO.csv')
df.head()
datetime shape duration (seconds) latitude longitude
0 10/10/1949 20:30 cylinder 2700.0 29.883056 -97.941111
1 10/10/1949 21:00 light 7200.0 29.384210 -98.581082
2 10/10/1955 17:00 circle 20.0 53.200000 -2.916667
3 10/10/1956 21:00 circle 20.0 28.978333 -96.645833
4 10/10/1960 20:00 light 900.0 21.418056 -157.803611

2. 练习

【练习一】 现有一份关于UFO的数据集,请解决下列问题:

pd.read_csv('data/UFO.csv').head()
datetime shape duration (seconds) latitude longitude
0 10/10/1949 20:30 cylinder 2700.0 29.883056 -97.941111
1 10/10/1949 21:00 light 7200.0 29.384210 -98.581082
2 10/10/1955 17:00 circle 20.0 53.200000 -2.916667
3 10/10/1956 21:00 circle 20.0 28.978333 -96.645833
4 10/10/1960 20:00 light 900.0 21.418056 -157.803611

(a)在所有被观测时间超过60s的时间中,哪个形状最多?

(b)对经纬度进行划分:-180°至180°以30°为一个划分,-90°至90°以18°为一个划分,请问哪个区域中报告的UFO事件数量最多?

(a)

df=pd.read_csv('data/UFO.csv')
df.rename(columns={'duration (seconds)':'duration'},inplace=True)
df.head()
df['duration'].astype('float')
df.query('duration>60')['shape'].value_counts().nlargest(1)
light    10713
Name: shape, dtype: int64

(b)

bins_long=list(range(-180,180,13))
bins_la=list(range(-90,90,18))
cuts_long=pd.cut(df['longitude'],bins=bins_long)
df['cuts_long']=cuts_long
cuts_la=pd.cut(df['latitude'],bins=bins_la)
df['cuts_la']=cuts_la
df.head()

datetime shape duration latitude longitude cuts_long cuts_la
0 10/10/1949 20:30 cylinder 2700.0 29.883056 -97.941111 (-102, -89] (18, 36]
1 10/10/1949 21:00 light 7200.0 29.384210 -98.581082 (-102, -89] (18, 36]
2 10/10/1955 17:00 circle 20.0 53.200000 -2.916667 (-11, 2] (36, 54]
3 10/10/1956 21:00 circle 20.0 28.978333 -96.645833 (-102, -89] (18, 36]
4 10/10/1960 20:00 light 900.0 21.418056 -157.803611 (-167, -154] (18, 36]
# df[['cuts_long','cuts_la']].value_counts()错
pd.Series(list(zip(df['cuts_long'],df['cuts_la']))).value_counts().head()
((-89.0, -76.0], (36.0, 54.0])      16685
((-128.0, -115.0], (36.0, 54.0])    12109
((-76.0, -63.0], (36.0, 54.0])      10188
((-89.0, -76.0], (18.0, 36.0])       9499
((-102.0, -89.0], (36.0, 54.0])      6663
dtype: int64
df.set_index(['cuts_long','cuts_la']).index.value_counts().head()
((-89.0, -76.0], (36.0, 54.0])      16685
((-128.0, -115.0], (36.0, 54.0])    12109
((-76.0, -63.0], (36.0, 54.0])      10188
((-89.0, -76.0], (18.0, 36.0])       9499
((-102.0, -89.0], (36.0, 54.0])      6663
dtype: int64

【练习二】 现有一份关于口袋妖怪的数据集,请解决下列问题:

pd.read_csv('data/Pokemon.csv').head()
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False

(a)双属性的Pokemon占总体比例的多少?

(b)在所有种族值(Total)不小于580的Pokemon中,非神兽(Legendary=False)的比例为多少?

(c)在第一属性为格斗系(Fighting)的Pokemon中,物攻排名前三高的是哪些?

(d)请问六项种族指标(HP、物攻、特攻、物防、特防、速度)极差的均值最大的是哪个属性(只考虑第一属性,且均值是对属性而言)?

(e)哪个属性(只考虑第一属性)的神兽比例最高?该属性神兽的种族值也是最高的吗?

(a)

df=pd.read_csv('data/Pokemon.csv')
df['Type 2'].count()/df.shape[0]
0.5175

(b)

n1=df.query('(Total>=580)&(Legendary==False)')['Legendary'].count()#/df.shape[0]
n2=df.query('(Total>=580)&(Legendary==True)')['Legendary'].count()
n1/(n1+n2)
0.4247787610619469
df.query('Total>=580')['Legendary'].value_counts(normalize=True)
True     0.575221
False    0.424779
Name: Legendary, dtype: float64

©

df[df['Type 1']=='Fighting'].sort_values(by='Attack',ascending=False).iloc[:3]
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
498 448 LucarioMega Lucario Fighting Steel 625 70 145 88 140 70 112 4 False
594 534 Conkeldurr Fighting NaN 505 105 140 95 55 65 45 5 False
74 68 Machamp Fighting NaN 505 90 130 80 65 85 55 1 False

(d)

df['range']=df.iloc[:,5:11].max(axis=1)-df.iloc[:,5:11].min(axis=1)
# df.iloc[:,5:11].max(axis=1)
attribute=df[['Type 1','range']].set_index('Type 1')
max_range=0
result=''
for i in attribute.index.unique():
    temp=attribute.loc[i].mean()
    if temp[0]>max_range:
        max_range=temp[0]
        result=i
# attribute.loc['Grass'].mean()
result

'Steel'

(e)

df.query('Legendary==True')['Type 1'].value_counts().index[0]
'Psychic'
attribute=df.query('Legendary==True')[['Type 1','Total']].set_index('Type 1')
max_range=0
result=''
for i in attribute.index.unique()[:-1]:
    temp=attribute.loc[i].mean()
    if temp[0]>max_range:
        max_range=temp[0]
        result=i
attribute.loc['Grass'].mean()
result
# attribute.index.unique()
# attribute.tail()
# attribute.index
# attribute.loc['Fairy',:]
'Normal'
attribute=df.query('Legendary==True')[['Type 1','Total']].set_index('Type 1')
max_range=0
result=''
for i in attribute.index.unique():
    temp=float(attribute.loc[i].mean())
    if temp>max_range:
        max_range=temp
        result=i
result
'Normal'

猜你喜欢

转载自blog.csdn.net/weixin_45569785/article/details/105711019