Pandas study notes (2) - Pandas index

leading


For more article code details, please check the blogger’s personal website: https://www.iwtmbtly.com/


Import the required libraries and files:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_csv('data/table.csv',index_col='ID')
>>> df.head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

1. Single-level index

(1) loc method, iloc method, [] operator

The most commonly used index methods may be these three types, among which iloc means position index, loc means label index, [] also has great convenience and each has its own characteristics.

1. loc method

(a) Single-row indexing:

>>> df.loc[1103]
School          S_1
Class           C_1
Gender            M
Address    street_2
Height          186
Weight           82
Math           87.2
Physics          B+
Name: 1103, dtype: object

(b) Multi-row index:

>>> df.loc[[1102, 2304]]
     School Class Gender   Address  Height  Weight  Math Physics
ID
1102    S_1   C_1      F  street_2     192      73  32.5      B+
2304    S_2   C_3      F  street_6     164      81  95.5      A-

Note: All slices used in loc include the right endpoint! This is because if you are a Pandas user, you definitely don’t care much about what the last label is and what is next, but if it is left closed and right opened, then it is very troublesome. You must first know what the name of the next column is. It is very inconvenient, so loc is designed to be fully closed left and right in Pandas

>>> df.loc[1304:2103].head
	 School Class Gender   Address  Height  Weight  Math Physics
ID
1304    S_1   C_3      M  street_2     195      70  85.2       A
1305    S_1   C_3      F  street_5     187      69  61.7      B-
2101    S_2   C_1      M  street_7     174      84  83.3       C
2102    S_2   C_1      F  street_6     161      61  50.6      B+
2103    S_2   C_1      M  street_4     157      61  52.5      B->
>>> df.loc[2402::-1].head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
2402    S_2   C_4      M  street_7     166      82  48.7       B
2401    S_2   C_4      F  street_2     192      62  45.3       A
2305    S_2   C_3      M  street_4     187      73  48.9       B
2304    S_2   C_3      F  street_6     164      81  95.5      A-
2303    S_2   C_3      F  street_7     190      99  65.9       C

(c) Single-column index

>>> df.loc[:,'Height'].head()
ID
1101    173
1102    192
1103    186
1104    167
1105    159
Name: Height, dtype: int64

(d) Multi-column index

>>> df.loc[:,['Height', 'Math']].head()
      Height  Math
ID
1101     173  34.0
1102     192  32.5
1103     186  87.2
1104     167  80.4
1105     159  84.8
>>> df.loc[:,'Height':'Math'].head()
      Height  Weight  Math
ID
1101     173      63  34.0
1102     192      73  32.5
1103     186      82  87.2
1104     167      81  80.4
1105     159      64  84.8

(e) Joint index

>>> df.loc[1102:2401:3,'Height':'Math'].head()
      Height  Weight  Math
ID
1102     192      73  32.5
1105     159      64  84.8
1203     160      53  58.8
1301     161      68  31.5
1304     195      70  85.2

(f) Functional indexing

>>> df.loc[lambda x:x['Gender']=='M'].head() # loc中使用的函数,传入参数就是前面的df
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1201    S_1   C_2      M  street_5     188      68  97.0      A-
1203    S_1   C_2      M  street_6     160      53  58.8      A+
1301    S_1   C_3      M  street_4     161      68  31.5      B+
# 这里的例子表示,loc中能够传入函数,并且函数的输入值是整张表,
# 输出为标量、切片、合法列表(元素出现在索引中)、合法索引
>>> def f(x):
...     return [1101, 1105]
>>> df.loc[f]
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1105    S_1   C_1      F  street_4     159      64  84.8      B+

(g) Boolean indexing

>>> df.loc[df['Address'].isin(['street_7', 'street_4'])].head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1105    S_1   C_1      F  street_4     159      64  84.8      B+
1202    S_1   C_2      F  street_4     176      94  63.5      B-
1301    S_1   C_3      M  street_4     161      68  31.5      B+
1303    S_1   C_3      M  street_7     188      82  49.7       B
2101    S_2   C_1      M  street_7     174      84  83.3       C
>>> df.loc[[True if i[-1]=='4' or i[-1]=='7' else False for i in df['Address'].values]].head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1105    S_1   C_1      F  street_4     159      64  84.8      B+
1202    S_1   C_2      F  street_4     176      94  63.5      B-
1301    S_1   C_3      M  street_4     161      68  31.5      B+
1303    S_1   C_3      M  street_7     188      82  49.7       B
2101    S_2   C_1      M  street_7     174      84  83.3       C

Summary: Essentially, only Boolean lists and index subsets can be passed in to loc. As long as you grasp this principle, it is easy to understand the above operations

2. iloc method (note that unlike loc, the right endpoint of the slice is not included)

(a) Single row index

>>> df.iloc[3]
School          S_1
Class           C_1
Gender            F
Address    street_2
Height          167
Weight           81
Math           80.4
Physics          B-
Name: 1104, dtype: object

(b) Multi-row index

>>> df.iloc[3:5]
     School Class Gender   Address  Height  Weight  Math Physics
ID
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

(c) Multi-column index

>>> df.iloc[:,3].head()
ID
1101    street_1
1102    street_2
1103    street_2
1104    street_2
1105    street_4
Name: Address, dtype: object

(d) Multi-column index

>>> df.iloc[:,7::-2].head()
     Physics  Weight   Address Class
ID
1101      A+      63  street_1   C_1
1102      B+      73  street_2   C_1
1103      B+      82  street_2   C_1
1104      B-      81  street_2   C_1
1105      B+      64  street_4   C_1

(e) Hybrid index

>>> df.iloc[3::4,7::-2].head()
     Physics  Weight   Address Class
ID
1104      B-      81  street_2   C_1
1203      A+      53  street_6   C_2
1302      A-      57  street_1   C_3
2101       C      84  street_7   C_1
2105       A      81  street_4   C_1

(f) Functional indexing

>>> df.iloc[lambda x:[3]].head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1104    S_1   C_1      F  street_2     167      81  80.4      B-

Summary: The parameters received in iloc can only be integers or integer lists or Boolean lists, and Boolean Series cannot be used. If you want to use it, you must take out the values ​​​​as follows

# df.iloc[df['School']=='S_1'].head() #报错
>>> df.iloc[(df['School']=='S_1').values].head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

3. [] operator

3.1 [] operation of Series

(a) Single-element indexing:

>>> s = pd.Series(df['Math'], index=df.index)
>>> s[1101] # 使用的是索引标签
34.0

(b) Multi-row index

>>> s[0:4]	# 使用的是绝对位置的整数切片,与元素无关,这里容易混淆
ID
1101    34.0
1102    32.5
1103    87.2
1104    80.4
Name: Math, dtype: float64

(c) Functional indexing

# 注意使用lambda函数时,直接切片(如:s[lambda x: 16::-6])就报错,此时使用的不是绝对位置切片,而是元素切片,非常易错
>>> s[lambda x: x.index[16::-6]]
ID
2102    50.6
1301    31.5
1105    84.8
Name: Math, dtype: float64

(d) Boolean index

>>> s[s > 80]
ID
1103    87.2
1104    80.4
1105    84.8
1201    97.0
1302    87.7
1304    85.2
2101    83.3
2205    85.4
2304    95.5
Name: Math, dtype: float64

Note: If you don't want to get into trouble, please don't use the [] operator when the row index is a floating-point value, because the floating-point slice of [] in Series is not a position comparison, but a value comparison, which is very special

>>> s_int = pd.Series([1, 2, 3, 4], index=[1, 3, 5, 6])
>>> s_float = pd.Series([1,2,3,4],index=[1.,3.,5.,6.])
>>> s_int
1    1
3    2
5    3
6    4
dtype: int64
>>> s_int[2:]
5    3
6    4
dtype: int64
>>> s_float
1.0    1
3.0    2
5.0    3
6.0    4
dtype: int64
# 注意和s_int[2:]结果不一样了,因为2这里是元素而不是位置
>>> s_float[2:]
3.0    2
5.0    3
6.0    4
dtype: int64

3.2 [] operation of DataFrame

(a) Single-row indexing:

# 这里非常容易写成df['label'],会报错
# 同Series使用了绝对位置切片
# 如果想要获得某一个元素,可用如下get_loc方法:
>>> df[1:2]
     School Class Gender   Address  Height  Weight  Math Physics
ID
1102    S_1   C_1      F  street_2     192      73  32.5      B+
>>> row = df.index.get_loc(1102)
>>> df[row:row+1]
     School Class Gender   Address  Height  Weight  Math Physics
ID
1102    S_1   C_1      F  street_2     192      73  32.5      B+

(b) Multi-row index

# 用切片,如果是选取指定的某几行,推荐使用loc,否则很可能报错
>>> df[3:5]
     School Class Gender   Address  Height  Weight  Math Physics
ID
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

(c) Single-column index

>>> df['School'].head()
ID
1101    S_1
1102    S_1
1103    S_1
1104    S_1
1105    S_1
Name: School, dtype: object

(d) Multi-column index

>>> df[['School', 'Math']].head()
     School  Math
ID
1101    S_1  34.0
1102    S_1  32.5
1103    S_1  87.2
1104    S_1  80.4
1105    S_1  84.8

(e) Functional indexing

>>> df[lambda x: ['Math', 'Physics']].head()
      Math Physics
ID
1101  34.0      A+
1102  32.5      B+
1103  87.2      B+
1104  80.4      B-
1105  84.8      B+

(f) Boolean index

>>> df[df['Gender']=='F'].head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+
1202    S_1   C_2      F  street_4     176      94  63.5      B-
1204    S_1   C_2      F  street_5     162      63  33.8       B

Summary: Generally speaking, the [] operator is often used for column selection or Boolean selection, try to avoid row selection

(2) Boolean index

(a) Boolean symbols: '&', '|', '~': represent and and, or or, negate not

>>> df[(df['Gender']=='F')&(df['Address']=='street_2')].head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
2401    S_2   C_4      F  street_2     192      62  45.3       A
2404    S_2   C_4      F  street_2     160      84  67.7       B
>>> df[df['Math']>85 | (df['Address']=='street_2')].head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+
>>> df[~((df['Math']>75)|(df['Address']=='street_1'))].head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1202    S_1   C_2      F  street_4     176      94  63.5      B-
1203    S_1   C_2      M  street_6     160      53  58.8      A+
1204    S_1   C_2      F  street_5     162      63  33.8       B
1205    S_1   C_2      F  street_6     167      63  68.4      B-

The corresponding positions in loc and [] can be selected using a Boolean list:

# 思考:为什么df.loc[df['Math']>60,(df[:8]['Address']=='street_6').values].head()得到和下述结果一样?
# values能去掉吗?
>>> df.loc[df['Math']>60, df.columns=='Physics'].head()
     Physics
ID
1103      B+
1104      B-
1105      B+
1201      A-
1202      B-

(b) isin method

>>> df[df['Address'].isin(['street_1', 'street_4']) & df['Physics'].isin(['A', 'A+'])]
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
2105    S_2   C_1      M  street_4     170      81  34.2       A
2203    S_2   C_2      M  street_4     155      91  73.8      A+
# 上面也可以用字典的方式写:
# all与&的思路是类似的,其中的1代表按照跨列方向判断是否全为True
>>> df[df[['Address','Physics']].isin({
    
    'Address':['street_1','street_4'],'Physics':['A','A+']}).all(1)]
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
2105    S_2   C_1      M  street_4     170      81  34.2       A
2203    S_2   C_2      M  street_4     155      91  73.8      A+

(3) Fast scalar index

When only one element needs to be fetched, the at and iat methods can provide a faster implementation:

>>> print(df.at)
df.at       df.at_time( df.attrs
S_1
df.loc     df.lookup(
>>> print(df.loc[1101, 'School'])
S_1
>>> print(df.iat[0, 0])
S_1
>>> print(df.iloc[0, 0])
S_1

(4) Interval index

The introduction here does not mean that interval indexes can only be used in single-level indexes, but as a special type of index method, which is introduced here first

(a) Using the interval_range method

# closed参数可选'left''right''both''neither',默认左开右闭
>>> pd.interval_range(start=0, end=5)
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')

(b) Use cut to convert numerical columns into categorical variables with intervals as elements, such as the interval of statistical mathematics scores:

# 注意,如果没有类型转换,此时并不是区间类型,而是category类型
>>> math_interval = pd.cut(df['Math'], bins=[0, 40, 60, 80, 100])
>>> math_interval.head()
ID
1101      (0, 40]
1102      (0, 40]
1103    (80, 100]
1104    (80, 100]
1105    (80, 100]
Name: Math, dtype: category
Categories (4, interval[int64, right]): [(0, 40] < (40, 60] < (60, 80] < (80, 100]]

(c) Selection of interval index

>>> df_i = df.join(math_interval, rsuffix='_interval')[['Math', 'Math_interval']].reset_index().set_index('Math_interval')
>>> df_i.head()
                 ID  Math
Math_interval
(0, 40]        1101  34.0
(0, 40]        1102  32.5
(80, 100]      1103  87.2
(80, 100]      1104  80.4
(80, 100]      1105  84.8
>>> df_i.loc[65].head()    # 包含该值就会被选中
                 ID  Math
Math_interval
(60, 80]       1202  63.5
(60, 80]       1205  68.4
(60, 80]       1305  61.7
(60, 80]       2104  72.2
(60, 80]       2202  68.5
>>> df_i.loc[[65, 90]].head()
                 ID  Math
Math_interval
(60, 80]       1202  63.5
(60, 80]       1205  68.4
(60, 80]       1305  61.7
(60, 80]       2104  72.2
(60, 80]       2202  68.5

If you want to select a certain interval, you must first convert the categorical variable into an interval variable, and then use overlapthe method

# df_i.loc[pd.Interval(70,75)].head() 报错
>>> df_i[df_i.index.astype('interval').overlaps(pd.Interval(70, 85))].head()
                 ID  Math
Math_interval
(80, 100]      1103  87.2
(80, 100]      1104  80.4
(80, 100]      1105  84.8
(80, 100]      1201  97.0
(60, 80]       1202  63.5

Two, multi-level index

(1) Create a multi-level index

1. By from_tuple or from_arrays

(a) Create tuples directly

>>> tuples = [('A','a'),('A','b'),('B','a'),('B','b')]
>>> mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
>>> mul_index
MultiIndex([('A', 'a'),
            ('A', 'b'),
            ('B', 'a'),
            ('B', 'b')],
           names=['Upper', 'Lower'])
>>> pd.DataFrame({
    
    'Score':['perfect','good','fair','bad']},index=mul_index)
               Score
Upper Lower
A     a      perfect
      b         good
B     a         fair
      b          bad

(b) Use zip to create tuples

>>> L1 = list('AABB')
>>> L2 = list('abab')
>>> tuples = list(zip(L1,L2))
>>> mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
>>> pd.DataFrame({
    
    'Score':['perfect','good','fair','bad']},index=mul_index)
               Score
Upper Lower
A     a      perfect
      b         good
B     a         fair
      b          bad

(c) Created by Array

>>> arrays = [['A','a'],['A','b'],['B','a'],['B','b']]
>>> mul_index = pd.MultiIndex.from_tuples(arrays, names=('Upper', 'Lower'))
>>> pd.DataFrame({
    
    'Score':['perfect','good','fair','bad']},index=mul_index)
               Score
Upper Lower
A     a      perfect
      b         good
B     a         fair
      b          bad
# 由此看出内部自动转成元组
>>> mul_index
MultiIndex([('A', 'a'),
            ('A', 'b'),
            ('B', 'a'),
            ('B', 'b')],
           names=['Upper', 'Lower'])

2. By from_product

>>> L1 = ['A','B']
>>> L2 = ['a','b']
>>> pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))	# 两两相乘
MultiIndex([('A', 'a'),
            ('A', 'b'),
            ('B', 'a'),
            ('B', 'b')],
           names=['Upper', 'Lower'])

3. Specify the column creation in df (set_index method)

>>> df_using_mul = df.set_index(['Class','Address'])
>>> df_using_mul.head()
               School Gender  Height  Weight  Math Physics
Class Address
C_1   street_1    S_1      M     173      63  34.0      A+
      street_2    S_1      F     192      73  32.5      B+
      street_2    S_1      M     186      82  87.2      B+
      street_2    S_1      F     167      81  80.4      B-
      street_4    S_1      F     159      64  84.8      B+

(2) Multi-layer index slicing

>>> df_using_mul.head()
               School Gender  Height  Weight  Math Physics
Class Address
C_1   street_1    S_1      M     173      63  34.0      A+
      street_2    S_1      F     192      73  32.5      B+
      street_2    S_1      M     186      82  87.2      B+
      street_2    S_1      F     167      81  80.4      B-
      street_4    S_1      F     159      64  84.8      B+

1. General slice

# df_using_mul.loc['C_2','street_5']
# 当索引不排序时,单个索引会报出性能警告
# df_using_mul.index.is_lexsorted()
# 该函数检查是否排序
# df_using_mul.sort_index().index.is_lexsorted()
>>> df_using_mul.sort_index().loc['C_2','street_5']
               School Gender  Height  Weight  Math Physics
Class Address
C_2   street_5    S_1      M     188      68  97.0      A-
      street_5    S_1      F     162      63  33.8       B
      street_5    S_2      M     193     100  39.1       B
# df_using_mul.loc[('C_2','street_5'):] 报错
# 当不排序时,不能使用多层切片
# 注意此处由于使用了loc,因此仍然包含右端点
>>> df_using_mul.sort_index().loc[('C_2','street_6'):('C_3','street_4')]
               School Gender  Height  Weight  Math Physics
Class Address
C_2   street_6    S_1      M     160      53  58.8      A+
      street_6    S_1      F     167      63  68.4      B-
      street_7    S_2      F     194      77  68.5      B+
      street_7    S_2      F     183      76  85.4       B
C_3   street_1    S_1      F     175      57  87.7      A-
      street_2    S_1      M     195      70  85.2       A
      street_4    S_1      M     161      68  31.5      B+
      street_4    S_2      F     157      78  72.3      B+
      street_4    S_2      M     187      73  48.9       B
# 非元组也是合法的,表示选中该层所有元素
>>> df_using_mul.sort_index().loc[('C_2','street_7'):'C_3'].head()
               School Gender  Height  Weight  Math Physics
Class Address
C_2   street_7    S_2      F     194      77  68.5      B+
      street_7    S_2      F     183      76  85.4       B
C_3   street_1    S_1      F     175      57  87.7      A-
      street_2    S_1      M     195      70  85.2       A
      street_4    S_1      M     161      68  31.5      B+

2. The first type of special case: a list made of tuples

# 表示选出某几个元素,精确到最内层索引
>>> df_using_mul.sort_index().loc[[('C_2','street_7'),('C_3','street_2')]]
               School Gender  Height  Weight  Math Physics
Class Address
C_2   street_7    S_2      F     194      77  68.5      B+
      street_7    S_2      F     183      76  85.4       B
C_3   street_2    S_1      M     195      70  85.2       A

3. The second special case: a tuple formed from a list

# 选出第一层在‘C_2’和'C_3'中且第二层在'street_4'和'street_7'中的行
>>> df_using_mul.sort_index().loc[(['C_2','C_3'],['street_4','street_7']),:]
               School Gender  Height  Weight  Math Physics
Class Address
C_2   street_4    S_1      F     176      94  63.5      B-
      street_4    S_2      M     155      91  73.8      A+
      street_7    S_2      F     194      77  68.5      B+
      street_7    S_2      F     183      76  85.4       B
C_3   street_4    S_1      M     161      68  31.5      B+
      street_4    S_2      F     157      78  72.3      B+
      street_4    S_2      M     187      73  48.9       B
      street_7    S_1      M     188      82  49.7       B
      street_7    S_2      F     190      99  65.9       C

(3) slice object in multi-layer index

>>> L1,L2 = ['A','B','C'],['a','b','c']
>>> mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
>>> L3,L4 = ['D','E','F'],['d','e','f']
>>> mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
>>> df_s = pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
>>> df_s
Big                 D                             E                             F
Small               d         e         f         d         e         f         d         e         f
Upper Lower
A     a      0.865934  0.752678  0.992263  0.471948  0.101374  0.750520  0.029240  0.841838  0.202736
      b      0.358436  0.315506  0.141048  0.179118  0.579804  0.387298  0.731970  0.504881  0.664886
      c      0.688227  0.076362  0.447927  0.897414  0.990657  0.577089  0.885058  0.242146  0.551289
B     a      0.622251  0.583529  0.970421  0.798430  0.075585  0.453897  0.196744  0.243493  0.407374
      b      0.660225  0.383329  0.884619  0.646215  0.251076  0.753128  0.857983  0.240076  0.391556
      c      0.336650  0.051452  0.472089  0.750627  0.920971  0.131141  0.160800  0.567003  0.608006
C     a      0.875975  0.545787  0.449710  0.062922  0.931482  0.595037  0.124742  0.016393  0.221201
      b      0.608213  0.789482  0.744773  0.768816  0.364518  0.787751  0.536297  0.282383  0.828840
      c      0.972491  0.477164  0.000541  0.236157  0.951343  0.572702  0.270309  0.225364  0.027862

The use of index slices is very flexible:

>>> idx=pd.IndexSlice
# df_s.sum()默认为对列求和,因此返回一个长度为9的数值列表
>>> df_s.loc[idx['B':,df_s['D']['d']>0.3],idx[df_s.sum()>4]]
Big                 D                   E
Small               d         f         d         e         f
Upper Lower
B     a      0.622251  0.970421  0.798430  0.075585  0.453897
      b      0.660225  0.884619  0.646215  0.251076  0.753128
      c      0.336650  0.472089  0.750627  0.920971  0.131141
C     a      0.875975  0.449710  0.062922  0.931482  0.595037
      b      0.608213  0.744773  0.768816  0.364518  0.787751
      c      0.972491  0.000541  0.236157  0.951343  0.572702

(4) Exchange of index layer

1. The swaplevel method (two-layer exchange)

>>> df_using_mul.head()
               School Gender  Height  Weight  Math Physics
Class Address
C_1   street_1    S_1      M     173      63  34.0      A+
      street_2    S_1      F     192      73  32.5      B+
      street_2    S_1      M     186      82  87.2      B+
      street_2    S_1      F     167      81  80.4      B-
      street_4    S_1      F     159      64  84.8      B+
>>> df_using_mul.swaplevel(i=1,j=0,axis=0).sort_index().head()
               School Gender  Height  Weight  Math Physics
Address  Class
street_1 C_1      S_1      M     173      63  34.0      A+
         C_2      S_2      M     175      74  47.2      B-
         C_3      S_1      F     175      57  87.7      A-
street_2 C_1      S_1      F     192      73  32.5      B+
         C_1      S_1      M     186      82  87.2      B+

2. reorder_levels method (multi-layer exchange)

>>> df_muls = df.set_index(['School','Class','Address'])
>>> df_muls.head()
                      Gender  Height  Weight  Math Physics
School Class Address
S_1    C_1   street_1      M     173      63  34.0      A+
             street_2      F     192      73  32.5      B+
             street_2      M     186      82  87.2      B+
             street_2      F     167      81  80.4      B-
             street_4      F     159      64  84.8      B+
>>> df_muls.reorder_levels([2,0,1],axis=0).sort_index().head()
                      Gender  Height  Weight  Math Physics
Address  School Class
street_1 S_1    C_1        M     173      63  34.0      A+
                C_3        F     175      57  87.7      A-
         S_2    C_2        M     175      74  47.2      B-
street_2 S_1    C_1        F     192      73  32.5      B+
                C_1        M     186      82  87.2      B+
# 如果索引有name,可以直接使用name
>>> df_muls.reorder_levels(['Address','School','Class'],axis=0).sort_index().head()
                      Gender  Height  Weight  Math Physics
Address  School Class
street_1 S_1    C_1        M     173      63  34.0      A+
                C_3        F     175      57  87.7      A-
         S_2    C_2        M     175      74  47.2      B-
street_2 S_1    C_1        F     192      73  32.5      B+
                C_1        M     186      82  87.2      B+

3. Index setting

(1) index_col parameter

index_col is a parameter in read_csv, not a method:

>>> pd.read_csv('data/table.csv',index_col=['Address','School']).head()
                Class    ID Gender  Height  Weight  Math Physics
Address  School
street_1 S_1      C_1  1101      M     173      63  34.0      A+
street_2 S_1      C_1  1102      F     192      73  32.5      B+
         S_1      C_1  1103      M     186      82  87.2      B+
         S_1      C_1  1104      F     167      81  80.4      B-
street_4 S_1      C_1  1105      F     159      64  84.8      B+

(2) reindex and reindex_like

Reindex refers to reindexing. Its important feature is index alignment, which is often used for reordering:

>>> df.head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+
>>> df.reindex(index=[1101,1203,1206,2402])
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1   173.0    63.0  34.0      A+
1203    S_1   C_2      M  street_6   160.0    53.0  58.8      A+
1206    NaN   NaN    NaN       NaN     NaN     NaN   NaN     NaN
2402    S_2   C_4      M  street_7   166.0    82.0  48.7       B
>>> df.reindex(columns=['Height','Gender','Average']).head()
      Height Gender  Average
ID
1101     173      M      NaN
1102     192      F      NaN
1103     186      M      NaN
1104     167      F      NaN
1105     159      F      NaN

You can choose the filling method of missing values: fill_value and method (bfill/ffill/nearest), where the method parameter must be monotonic in index:

# bfill表示用所在索引1206的后一个有效行填充,ffill为前一个有效行,nearest是指最近的
>>> df.reindex(index=[1101,1203,1206,2402],method='bfill')
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1203    S_1   C_2      M  street_6     160      53  58.8      A+
1206    S_1   C_3      M  street_4     161      68  31.5      B+
2402    S_2   C_4      M  street_7     166      82  48.7       B
# 数值上1205比1301更接近1206,因此用前者填充
>>> df.reindex(index=[1101,1203,1206,2402],method='nearest')
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1203    S_1   C_2      M  street_6     160      53  58.8      A+
1206    S_1   C_2      F  street_6     167      63  68.4      B-
2402    S_2   C_4      M  street_7     166      82  48.7       B

The function of reindex_like is to generate a DataFrame whose horizontal and vertical indexes are completely consistent with the parameter list, and the data uses the called table

>>> df_temp = pd.DataFrame({
    
    'Weight':np.zeros(5),
...                         'Height':np.zeros(5),
...                         'ID':[1101,1104,1103,1106,1102]}).set_index('ID')
>>> df_temp.reindex_like(df[0:5][['Weight','Height']])
      Weight  Height
ID
1101     0.0     0.0
1102     0.0     0.0
1103     0.0     0.0
1104     0.0     0.0
1105     NaN     NaN

If df_temp is monotonous, you can also use the method parameter:

>>> df_temp = pd.DataFrame({
    
    'Weight':range(5),
...                         'Height':range(5),
...                         'ID':[1101,1104,1103,1106,1102]}).set_index('ID').sort_index()
# 可以自行检验这里的1105的值是否是由bfill规则填充
>>> df_temp.reindex_like(df[0:5][['Weight','Height']],method='bfill') 
      Weight  Height
ID
1101       0       0
1102       4       4
1103       2       2
1104       1       1
1105       3       3

(3) set_index and reset_index

First introduce set_index: literally, it is to use certain columns as indexes

Use table columns as indexes:

>>> df.head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+
>>> df.set_index('Class').head()
      School Gender   Address  Height  Weight  Math Physics
Class
C_1      S_1      M  street_1     173      63  34.0      A+
C_1      S_1      F  street_2     192      73  32.5      B+
C_1      S_1      M  street_2     186      82  87.2      B+
C_1      S_1      F  street_2     167      81  80.4      B-
C_1      S_1      F  street_4     159      64  84.8      B+

Use the append parameter to keep the current index unchanged:

>>> df.set_index('Class', append=True).head()
           School Gender   Address  Height  Weight  Math Physics
ID   Class
1101 C_1      S_1      M  street_1     173      63  34.0      A+
1102 C_1      S_1      F  street_2     192      73  32.5      B+
1103 C_1      S_1      M  street_2     186      82  87.2      B+
1104 C_1      S_1      F  street_2     167      81  80.4      B-
1105 C_1      S_1      F  street_4     159      64  84.8      B+

When using the column with the same length as the table as the index time (need to be converted to Series first, otherwise an error will be reported):

>>> df.set_index(pd.Series(range(df.shape[0]))).head()
  School Class Gender   Address  Height  Weight  Math Physics
0    S_1   C_1      M  street_1     173      63  34.0      A+
1    S_1   C_1      F  street_2     192      73  32.5      B+
2    S_1   C_1      M  street_2     186      82  87.2      B+
3    S_1   C_1      F  street_2     167      81  80.4      B-
4    S_1   C_1      F  street_4     159      64  84.8      B+

You can add multi-level indexes directly:

>>> df.set_index([pd.Series(range(df.shape[0])),pd.Series(np.ones(df.shape[0]))]).head()
      School Class Gender   Address  Height  Weight  Math Physics
0 1.0    S_1   C_1      M  street_1     173      63  34.0      A+
1 1.0    S_1   C_1      F  street_2     192      73  32.5      B+
2 1.0    S_1   C_1      M  street_2     186      82  87.2      B+
3 1.0    S_1   C_1      F  street_2     167      81  80.4      B-
4 1.0    S_1   C_1      F  street_4     159      64  84.8      B+

The reset_index method is introduced below, its main function is to reset the index

The default state reverts directly to natural indexing:

>>> df.reset_index().head()
     ID School Class Gender   Address  Height  Weight  Math Physics
0  1101    S_1   C_1      M  street_1     173      63  34.0      A+
1  1102    S_1   C_1      F  street_2     192      73  32.5      B+
2  1103    S_1   C_1      M  street_2     186      82  87.2      B+
3  1104    S_1   C_1      F  street_2     167      81  80.4      B-
4  1105    S_1   C_1      F  street_4     159      64  84.8      B+

Use the level parameter to specify which layer is reset, and the col_level parameter to specify which layer to set to:

>>> L1,L2 = ['A','B','C'],['a','b','c']
>>> mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
>>> L3,L4 = ['D','E','F'],['d','e','f']
>>> mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
>>> df_temp = pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
>>> df_temp.head()
Big                 D                             E                             F
Small               d         e         f         d         e         f         d         e         f
Upper Lower
A     a      0.996924  0.779796  0.198003  0.876215  0.801679  0.740366  0.072776  0.172737  0.103133
      b      0.856929  0.384369  0.988760  0.300426  0.109809  0.445339  0.735657  0.109474  0.632733
      c      0.631834  0.748637  0.378666  0.696078  0.404629  0.747714  0.237205  0.988239  0.260963
B     a      0.740106  0.995469  0.005640  0.204483  0.958359  0.737188  0.696751  0.900894  0.275091
      b      0.026315  0.251426  0.594558  0.313601  0.145479  0.433199  0.704520  0.366411  0.473218
>>> df_temp1 = df_temp.reset_index(level=1,col_level=1)
>>> df_temp1.head()
Big                 D                             E                             F
Small Lower         d         e         f         d         e         f         d         e         f
Upper
A         a  0.996924  0.779796  0.198003  0.876215  0.801679  0.740366  0.072776  0.172737  0.103133
A         b  0.856929  0.384369  0.988760  0.300426  0.109809  0.445339  0.735657  0.109474  0.632733
A         c  0.631834  0.748637  0.378666  0.696078  0.404629  0.747714  0.237205  0.988239  0.260963
B         a  0.740106  0.995469  0.005640  0.204483  0.958359  0.737188  0.696751  0.900894  0.275091
B         b  0.026315  0.251426  0.594558  0.313601  0.145479  0.433199  0.704520  0.366411  0.473218
# 看到的确插入了level2
>>> df_temp1.columns
MultiIndex([( '', 'Lower'),
            ('D',     'd'),
            ('D',     'e'),
            ('D',     'f'),
            ('E',     'd'),
            ('E',     'e'),
            ('E',     'f'),
            ('F',     'd'),
            ('F',     'e'),
            ('F',     'f')],
           names=['Big', 'Small'])
# 最内层索引被移出
>>> df_temp1.index
Index(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'], dtype='object', name='Upper')

(4) rename_axis and rename

rename_axis is a method for multi-level indexes. Its function is to modify the index name of a certain layer, not the index label

>>> df_temp.rename_axis(index={
    
    'Lower':'LowerLower'},columns={
    
    'Big':'BigBig'})[['D', 'E']]
BigBig                   D                             E
Small                    d         e         f         d         e         f
Upper LowerLower
A     a           0.996924  0.779796  0.198003  0.876215  0.801679  0.740366
      b           0.856929  0.384369  0.988760  0.300426  0.109809  0.445339
      c           0.631834  0.748637  0.378666  0.696078  0.404629  0.747714
B     a           0.740106  0.995469  0.005640  0.204483  0.958359  0.737188
      b           0.026315  0.251426  0.594558  0.313601  0.145479  0.433199
      c           0.642152  0.803119  0.869278  0.643841  0.933842  0.373142
C     a           0.419632  0.187484  0.420311  0.136625  0.512117  0.167024
      b           0.123571  0.571580  0.201483  0.788676  0.067141  0.955275
      c           0.075575  0.832965  0.934871  0.549695  0.511443  0.286503

The rename method is used to modify column or row index labels, not index names:

>>> df_temp.rename(index={
    
    'A':'T'},columns={
    
    'e':'changed_e'}).head()
Big                 D                             E                             F
Small               d changed_e         f         d changed_e         f         d changed_e         f
Upper Lower
T     a      0.996924  0.779796  0.198003  0.876215  0.801679  0.740366  0.072776  0.172737  0.103133
      b      0.856929  0.384369  0.988760  0.300426  0.109809  0.445339  0.735657  0.109474  0.632733
      c      0.631834  0.748637  0.378666  0.696078  0.404629  0.747714  0.237205  0.988239  0.260963
B     a      0.740106  0.995469  0.005640  0.204483  0.958359  0.737188  0.696751  0.900894  0.275091
      b      0.026315  0.251426  0.594558  0.313601  0.145479  0.433199  0.704520  0.366411  0.473218

4. Commonly used index functions

(1) where function

When filling cells with condition False:

>>> df.head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+
# 不满足条件的行全部被设置为NaN
>>> df.where(df['Gender']=='M').head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1   173.0    63.0  34.0      A+
1102    NaN   NaN    NaN       NaN     NaN     NaN   NaN     NaN
1103    S_1   C_1      M  street_2   186.0    82.0  87.2      B+
1104    NaN   NaN    NaN       NaN     NaN     NaN   NaN     NaN
1105    NaN   NaN    NaN       NaN     NaN     NaN   NaN     NaN

The result of filtering in this way is exactly the same as that of the [] operator:

>>> df.where(df['Gender']=='M').dropna().head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1   173.0    63.0  34.0      A+
1103    S_1   C_1      M  street_2   186.0    82.0  87.2      B+
1201    S_1   C_2      M  street_5   188.0    68.0  97.0      A-
1203    S_1   C_2      M  street_6   160.0    53.0  58.8      A+
1301    S_1   C_3      M  street_4   161.0    68.0  31.5      B+

The first parameter is the boolean condition and the second parameter is the fill value:

>>> df.where(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()
        School     Class    Gender   Address      Height     Weight       Math   Physics
ID
1101       S_1       C_1         M  street_1  173.000000  63.000000  34.000000        A+
1102  0.147943  0.670993   0.93367   0.16424    0.314864   0.121429   0.433781  0.074907
1103       S_1       C_1         M  street_2  186.000000  82.000000  87.200000        B+
1104  0.749106    0.9844  0.184485  0.674807    0.738321   0.525289   0.019779   0.19905
1105  0.534726  0.657987  0.370359   0.89066    0.613029   0.456765   0.389943  0.756956

(2) mask function

The mask function is functionally opposite to where, and the rest is exactly the same, that is, to fill the cells whose condition is True:

>>> df.mask(df['Gender']=='M').dropna().head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1102    S_1   C_1      F  street_2   192.0    73.0  32.5      B+
1104    S_1   C_1      F  street_2   167.0    81.0  80.4      B-
1105    S_1   C_1      F  street_4   159.0    64.0  84.8      B+
1202    S_1   C_2      F  street_4   176.0    94.0  63.5      B-
1204    S_1   C_2      F  street_5   162.0    63.0  33.8       B
>>> df.mask(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()
        School     Class    Gender   Address      Height     Weight       Math   Physics
ID
1101  0.532138  0.841576  0.885163  0.169569    0.983056   0.714640   0.820599  0.012835
1102       S_1       C_1         F  street_2  192.000000  73.000000  32.500000        B+
1103  0.538961  0.155097  0.401648  0.283565    0.617196   0.260921   0.395324  0.478259
1104       S_1       C_1         F  street_2  167.000000  81.000000  80.400000        B-
1105       S_1       C_1         F  street_4  159.000000  64.000000  84.800000        B+

(3) query function

>>> df.head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

In the Boolean expression in the query function, the following symbols are legal: row and column index names, strings, and/not/or/&/|/~/not in/in/==/!=, four arithmetic operators

>>> df.query('(Address in ["street_6","street_7"])&(Weight>(70+10))&(ID in [1303,2304,2402])')
     School Class Gender   Address  Height  Weight  Math Physics
ID
1303    S_1   C_3      M  street_7     188      82  49.7       B
2304    S_2   C_3      F  street_6     164      81  95.5      A-
2402    S_2   C_4      M  street_7     166      82  48.7       B

Five, repeated element processing

(1) Duplicated method

This method returns a boolean list of duplicates:

>>> df.duplicated('Class').head()
ID
1101    False
1102     True
1103     True
1104     True
1105     True
dtype: bool

The optional parameter keep defaults to first, that is, the first occurrence is set as non-repeating, if it is last, the last time is set as non-repeating, if it is False, all repetitions are True

>>> df.duplicated('Class',keep='last').tail()
ID
2401     True
2402     True
2403     True
2404     True
2405    False
dtype: bool
>>> df.duplicated('Class',keep=False).head()
ID
1101    True
1102    True
1103    True
1104    True
1105    True
dtype: bool

(2) drop_duplicates method

As the name suggests, it is to eliminate duplicates, which may be useful in grouping operations in later chapters, for example, the first value of each group needs to be retained:

>>> df.drop_duplicates('Class')
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1201    S_1   C_2      M  street_5     188      68  97.0      A-
1301    S_1   C_3      M  street_4     161      68  31.5      B+
2401    S_2   C_4      F  street_2     192      62  45.3       A

The parameters are similar to the duplicate function:

>>> df.drop_duplicates('Class',keep='last')
     School Class Gender   Address  Height  Weight  Math Physics
ID
2105    S_2   C_1      M  street_4     170      81  34.2       A
2205    S_2   C_2      F  street_7     183      76  85.4       B
2305    S_2   C_3      M  street_4     187      73  48.9       B
2405    S_2   C_4      F  street_6     193      54  47.6       B

When multiple columns are passed in, it is equivalent to treating multiple columns as a multi-level index and comparing duplicates:

>>> df.drop_duplicates(['School','Class'])
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1201    S_1   C_2      M  street_5     188      68  97.0      A-
1301    S_1   C_3      M  street_4     161      68  31.5      B+
2101    S_2   C_1      M  street_7     174      84  83.3       C
2201    S_2   C_2      M  street_5     193     100  39.1       B
2301    S_2   C_3      F  street_4     157      78  72.3      B+
2401    S_2   C_4      F  street_2     192      62  45.3       A

6. Sampling function

The sampling function here refers to the sample function

(1) n is the sample size

>>> df.sample(n=5)
     School Class Gender   Address  Height  Weight  Math Physics
ID
2205    S_2   C_2      F  street_7     183      76  85.4       B
2202    S_2   C_2      F  street_7     194      77  68.5      B+
2101    S_2   C_1      M  street_7     174      84  83.3       C
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1301    S_1   C_3      M  street_4     161      68  31.5      B+

(2) frac is the sample ratio

>>> df.sample(frac=0.05)
     School Class Gender   Address  Height  Weight  Math Physics
ID
1104    S_1   C_1      F  street_2     167      81  80.4      B-
2105    S_2   C_1      M  street_4     170      81  34.2       A

(3) replace is whether to put it back

>>> df.sample(n=df.shape[0],replace=True).head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
2302    S_2   C_3      M  street_5     171      88  32.7       A
1101    S_1   C_1      M  street_1     173      63  34.0      A+
2305    S_2   C_3      M  street_4     187      73  48.9       B
2101    S_2   C_1      M  street_7     174      84  83.3       C
1304    S_1   C_3      M  street_2     195      70  85.2       A
>>> df.sample(n=35,replace=True).index.is_unique
False

(4) axis is the sampling dimension, the default is 0, that is, the sampling line

>>> df.sample(n=3,axis=1).head()	# 次数axis为1,则抽取3列
       Address Class  Weight
ID
1101  street_1   C_1      63
1102  street_2   C_1      73
1103  street_2   C_1      82
1104  street_2   C_1      81
1105  street_4   C_1      64

(5) weights is the sample weight, which is automatically normalized

>>> df.sample(n=3,weights=np.random.rand(df.shape[0])).head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
2302    S_2   C_3      M  street_5     171      88  32.7       A
1105    S_1   C_1      F  street_4     159      64  84.8      B+
# 以某一列为权重,这在抽样理论中很常见
# 抽到的概率与Math数值成正比
>>> df.sample(n=3,weights=df['Math']).head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
2304    S_2   C_3      F  street_6     164      81  95.5      A-
1201    S_1   C_2      M  street_5     188      68  97.0      A-
2205    S_2   C_2      F  street_7     183      76  85.4       B

Guess you like

Origin blog.csdn.net/qq_43300880/article/details/124985663