leading
For more article code details, please check the blogger’s personal website: https://www.iwtmbtly.com/
Import the required libraries and files:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_csv('data/table.csv',index_col='ID')
>>> df.head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
1. Single-level index
(1) loc method, iloc method, [] operator
The most commonly used index methods may be these three types, among which iloc means position index, loc means label index, [] also has great convenience and each has its own characteristics.
1. loc method
(a) Single-row indexing:
>>> df.loc[1103]
School S_1
Class C_1
Gender M
Address street_2
Height 186
Weight 82
Math 87.2
Physics B+
Name: 1103, dtype: object
(b) Multi-row index:
>>> df.loc[[1102, 2304]]
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
2304 S_2 C_3 F street_6 164 81 95.5 A-
Note: All slices used in loc include the right endpoint! This is because if you are a Pandas user, you definitely don’t care much about what the last label is and what is next, but if it is left closed and right opened, then it is very troublesome. You must first know what the name of the next column is. It is very inconvenient, so loc is designed to be fully closed left and right in Pandas
>>> df.loc[1304:2103].head
School Class Gender Address Height Weight Math Physics
ID
1304 S_1 C_3 M street_2 195 70 85.2 A
1305 S_1 C_3 F street_5 187 69 61.7 B-
2101 S_2 C_1 M street_7 174 84 83.3 C
2102 S_2 C_1 F street_6 161 61 50.6 B+
2103 S_2 C_1 M street_4 157 61 52.5 B->
>>> df.loc[2402::-1].head()
School Class Gender Address Height Weight Math Physics
ID
2402 S_2 C_4 M street_7 166 82 48.7 B
2401 S_2 C_4 F street_2 192 62 45.3 A
2305 S_2 C_3 M street_4 187 73 48.9 B
2304 S_2 C_3 F street_6 164 81 95.5 A-
2303 S_2 C_3 F street_7 190 99 65.9 C
(c) Single-column index
>>> df.loc[:,'Height'].head()
ID
1101 173
1102 192
1103 186
1104 167
1105 159
Name: Height, dtype: int64
(d) Multi-column index
>>> df.loc[:,['Height', 'Math']].head()
Height Math
ID
1101 173 34.0
1102 192 32.5
1103 186 87.2
1104 167 80.4
1105 159 84.8
>>> df.loc[:,'Height':'Math'].head()
Height Weight Math
ID
1101 173 63 34.0
1102 192 73 32.5
1103 186 82 87.2
1104 167 81 80.4
1105 159 64 84.8
(e) Joint index
>>> df.loc[1102:2401:3,'Height':'Math'].head()
Height Weight Math
ID
1102 192 73 32.5
1105 159 64 84.8
1203 160 53 58.8
1301 161 68 31.5
1304 195 70 85.2
(f) Functional indexing
>>> df.loc[lambda x:x['Gender']=='M'].head() # loc中使用的函数,传入参数就是前面的df
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1201 S_1 C_2 M street_5 188 68 97.0 A-
1203 S_1 C_2 M street_6 160 53 58.8 A+
1301 S_1 C_3 M street_4 161 68 31.5 B+
# 这里的例子表示,loc中能够传入函数,并且函数的输入值是整张表,
# 输出为标量、切片、合法列表(元素出现在索引中)、合法索引
>>> def f(x):
... return [1101, 1105]
>>> df.loc[f]
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1105 S_1 C_1 F street_4 159 64 84.8 B+
(g) Boolean indexing
>>> df.loc[df['Address'].isin(['street_7', 'street_4'])].head()
School Class Gender Address Height Weight Math Physics
ID
1105 S_1 C_1 F street_4 159 64 84.8 B+
1202 S_1 C_2 F street_4 176 94 63.5 B-
1301 S_1 C_3 M street_4 161 68 31.5 B+
1303 S_1 C_3 M street_7 188 82 49.7 B
2101 S_2 C_1 M street_7 174 84 83.3 C
>>> df.loc[[True if i[-1]=='4' or i[-1]=='7' else False for i in df['Address'].values]].head()
School Class Gender Address Height Weight Math Physics
ID
1105 S_1 C_1 F street_4 159 64 84.8 B+
1202 S_1 C_2 F street_4 176 94 63.5 B-
1301 S_1 C_3 M street_4 161 68 31.5 B+
1303 S_1 C_3 M street_7 188 82 49.7 B
2101 S_2 C_1 M street_7 174 84 83.3 C
Summary: Essentially, only Boolean lists and index subsets can be passed in to loc. As long as you grasp this principle, it is easy to understand the above operations
2. iloc method (note that unlike loc, the right endpoint of the slice is not included)
(a) Single row index
>>> df.iloc[3]
School S_1
Class C_1
Gender F
Address street_2
Height 167
Weight 81
Math 80.4
Physics B-
Name: 1104, dtype: object
(b) Multi-row index
>>> df.iloc[3:5]
School Class Gender Address Height Weight Math Physics
ID
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
(c) Multi-column index
>>> df.iloc[:,3].head()
ID
1101 street_1
1102 street_2
1103 street_2
1104 street_2
1105 street_4
Name: Address, dtype: object
(d) Multi-column index
>>> df.iloc[:,7::-2].head()
Physics Weight Address Class
ID
1101 A+ 63 street_1 C_1
1102 B+ 73 street_2 C_1
1103 B+ 82 street_2 C_1
1104 B- 81 street_2 C_1
1105 B+ 64 street_4 C_1
(e) Hybrid index
>>> df.iloc[3::4,7::-2].head()
Physics Weight Address Class
ID
1104 B- 81 street_2 C_1
1203 A+ 53 street_6 C_2
1302 A- 57 street_1 C_3
2101 C 84 street_7 C_1
2105 A 81 street_4 C_1
(f) Functional indexing
>>> df.iloc[lambda x:[3]].head()
School Class Gender Address Height Weight Math Physics
ID
1104 S_1 C_1 F street_2 167 81 80.4 B-
Summary: The parameters received in iloc can only be integers or integer lists or Boolean lists, and Boolean Series cannot be used. If you want to use it, you must take out the values as follows
# df.iloc[df['School']=='S_1'].head() #报错
>>> df.iloc[(df['School']=='S_1').values].head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
3. [] operator
3.1 [] operation of Series
(a) Single-element indexing:
>>> s = pd.Series(df['Math'], index=df.index)
>>> s[1101] # 使用的是索引标签
34.0
(b) Multi-row index
>>> s[0:4] # 使用的是绝对位置的整数切片,与元素无关,这里容易混淆
ID
1101 34.0
1102 32.5
1103 87.2
1104 80.4
Name: Math, dtype: float64
(c) Functional indexing
# 注意使用lambda函数时,直接切片(如:s[lambda x: 16::-6])就报错,此时使用的不是绝对位置切片,而是元素切片,非常易错
>>> s[lambda x: x.index[16::-6]]
ID
2102 50.6
1301 31.5
1105 84.8
Name: Math, dtype: float64
(d) Boolean index
>>> s[s > 80]
ID
1103 87.2
1104 80.4
1105 84.8
1201 97.0
1302 87.7
1304 85.2
2101 83.3
2205 85.4
2304 95.5
Name: Math, dtype: float64
Note: If you don't want to get into trouble, please don't use the [] operator when the row index is a floating-point value, because the floating-point slice of [] in Series is not a position comparison, but a value comparison, which is very special
>>> s_int = pd.Series([1, 2, 3, 4], index=[1, 3, 5, 6])
>>> s_float = pd.Series([1,2,3,4],index=[1.,3.,5.,6.])
>>> s_int
1 1
3 2
5 3
6 4
dtype: int64
>>> s_int[2:]
5 3
6 4
dtype: int64
>>> s_float
1.0 1
3.0 2
5.0 3
6.0 4
dtype: int64
# 注意和s_int[2:]结果不一样了,因为2这里是元素而不是位置
>>> s_float[2:]
3.0 2
5.0 3
6.0 4
dtype: int64
3.2 [] operation of DataFrame
(a) Single-row indexing:
# 这里非常容易写成df['label'],会报错
# 同Series使用了绝对位置切片
# 如果想要获得某一个元素,可用如下get_loc方法:
>>> df[1:2]
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
>>> row = df.index.get_loc(1102)
>>> df[row:row+1]
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
(b) Multi-row index
# 用切片,如果是选取指定的某几行,推荐使用loc,否则很可能报错
>>> df[3:5]
School Class Gender Address Height Weight Math Physics
ID
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
(c) Single-column index
>>> df['School'].head()
ID
1101 S_1
1102 S_1
1103 S_1
1104 S_1
1105 S_1
Name: School, dtype: object
(d) Multi-column index
>>> df[['School', 'Math']].head()
School Math
ID
1101 S_1 34.0
1102 S_1 32.5
1103 S_1 87.2
1104 S_1 80.4
1105 S_1 84.8
(e) Functional indexing
>>> df[lambda x: ['Math', 'Physics']].head()
Math Physics
ID
1101 34.0 A+
1102 32.5 B+
1103 87.2 B+
1104 80.4 B-
1105 84.8 B+
(f) Boolean index
>>> df[df['Gender']=='F'].head()
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
1202 S_1 C_2 F street_4 176 94 63.5 B-
1204 S_1 C_2 F street_5 162 63 33.8 B
Summary: Generally speaking, the [] operator is often used for column selection or Boolean selection, try to avoid row selection
(2) Boolean index
(a) Boolean symbols: '&', '|', '~': represent and and, or or, negate not
>>> df[(df['Gender']=='F')&(df['Address']=='street_2')].head()
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
2401 S_2 C_4 F street_2 192 62 45.3 A
2404 S_2 C_4 F street_2 160 84 67.7 B
>>> df[df['Math']>85 | (df['Address']=='street_2')].head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
>>> df[~((df['Math']>75)|(df['Address']=='street_1'))].head()
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192 73 32.5 B+
1202 S_1 C_2 F street_4 176 94 63.5 B-
1203 S_1 C_2 M street_6 160 53 58.8 A+
1204 S_1 C_2 F street_5 162 63 33.8 B
1205 S_1 C_2 F street_6 167 63 68.4 B-
The corresponding positions in loc and [] can be selected using a Boolean list:
# 思考:为什么df.loc[df['Math']>60,(df[:8]['Address']=='street_6').values].head()得到和下述结果一样?
# values能去掉吗?
>>> df.loc[df['Math']>60, df.columns=='Physics'].head()
Physics
ID
1103 B+
1104 B-
1105 B+
1201 A-
1202 B-
(b) isin method
>>> df[df['Address'].isin(['street_1', 'street_4']) & df['Physics'].isin(['A', 'A+'])]
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
2105 S_2 C_1 M street_4 170 81 34.2 A
2203 S_2 C_2 M street_4 155 91 73.8 A+
# 上面也可以用字典的方式写:
# all与&的思路是类似的,其中的1代表按照跨列方向判断是否全为True
>>> df[df[['Address','Physics']].isin({
'Address':['street_1','street_4'],'Physics':['A','A+']}).all(1)]
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
2105 S_2 C_1 M street_4 170 81 34.2 A
2203 S_2 C_2 M street_4 155 91 73.8 A+
(3) Fast scalar index
When only one element needs to be fetched, the at and iat methods can provide a faster implementation:
>>> print(df.at)
df.at df.at_time( df.attrs
S_1
df.loc df.lookup(
>>> print(df.loc[1101, 'School'])
S_1
>>> print(df.iat[0, 0])
S_1
>>> print(df.iloc[0, 0])
S_1
(4) Interval index
The introduction here does not mean that interval indexes can only be used in single-level indexes, but as a special type of index method, which is introduced here first
(a) Using the interval_range method
# closed参数可选'left''right''both''neither',默认左开右闭
>>> pd.interval_range(start=0, end=5)
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')
(b) Use cut to convert numerical columns into categorical variables with intervals as elements, such as the interval of statistical mathematics scores:
# 注意,如果没有类型转换,此时并不是区间类型,而是category类型
>>> math_interval = pd.cut(df['Math'], bins=[0, 40, 60, 80, 100])
>>> math_interval.head()
ID
1101 (0, 40]
1102 (0, 40]
1103 (80, 100]
1104 (80, 100]
1105 (80, 100]
Name: Math, dtype: category
Categories (4, interval[int64, right]): [(0, 40] < (40, 60] < (60, 80] < (80, 100]]
(c) Selection of interval index
>>> df_i = df.join(math_interval, rsuffix='_interval')[['Math', 'Math_interval']].reset_index().set_index('Math_interval')
>>> df_i.head()
ID Math
Math_interval
(0, 40] 1101 34.0
(0, 40] 1102 32.5
(80, 100] 1103 87.2
(80, 100] 1104 80.4
(80, 100] 1105 84.8
>>> df_i.loc[65].head() # 包含该值就会被选中
ID Math
Math_interval
(60, 80] 1202 63.5
(60, 80] 1205 68.4
(60, 80] 1305 61.7
(60, 80] 2104 72.2
(60, 80] 2202 68.5
>>> df_i.loc[[65, 90]].head()
ID Math
Math_interval
(60, 80] 1202 63.5
(60, 80] 1205 68.4
(60, 80] 1305 61.7
(60, 80] 2104 72.2
(60, 80] 2202 68.5
If you want to select a certain interval, you must first convert the categorical variable into an interval variable, and then use overlap
the method
# df_i.loc[pd.Interval(70,75)].head() 报错
>>> df_i[df_i.index.astype('interval').overlaps(pd.Interval(70, 85))].head()
ID Math
Math_interval
(80, 100] 1103 87.2
(80, 100] 1104 80.4
(80, 100] 1105 84.8
(80, 100] 1201 97.0
(60, 80] 1202 63.5
Two, multi-level index
(1) Create a multi-level index
1. By from_tuple or from_arrays
(a) Create tuples directly
>>> tuples = [('A','a'),('A','b'),('B','a'),('B','b')]
>>> mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
>>> mul_index
MultiIndex([('A', 'a'),
('A', 'b'),
('B', 'a'),
('B', 'b')],
names=['Upper', 'Lower'])
>>> pd.DataFrame({
'Score':['perfect','good','fair','bad']},index=mul_index)
Score
Upper Lower
A a perfect
b good
B a fair
b bad
(b) Use zip to create tuples
>>> L1 = list('AABB')
>>> L2 = list('abab')
>>> tuples = list(zip(L1,L2))
>>> mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
>>> pd.DataFrame({
'Score':['perfect','good','fair','bad']},index=mul_index)
Score
Upper Lower
A a perfect
b good
B a fair
b bad
(c) Created by Array
>>> arrays = [['A','a'],['A','b'],['B','a'],['B','b']]
>>> mul_index = pd.MultiIndex.from_tuples(arrays, names=('Upper', 'Lower'))
>>> pd.DataFrame({
'Score':['perfect','good','fair','bad']},index=mul_index)
Score
Upper Lower
A a perfect
b good
B a fair
b bad
# 由此看出内部自动转成元组
>>> mul_index
MultiIndex([('A', 'a'),
('A', 'b'),
('B', 'a'),
('B', 'b')],
names=['Upper', 'Lower'])
2. By from_product
>>> L1 = ['A','B']
>>> L2 = ['a','b']
>>> pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower')) # 两两相乘
MultiIndex([('A', 'a'),
('A', 'b'),
('B', 'a'),
('B', 'b')],
names=['Upper', 'Lower'])
3. Specify the column creation in df (set_index method)
>>> df_using_mul = df.set_index(['Class','Address'])
>>> df_using_mul.head()
School Gender Height Weight Math Physics
Class Address
C_1 street_1 S_1 M 173 63 34.0 A+
street_2 S_1 F 192 73 32.5 B+
street_2 S_1 M 186 82 87.2 B+
street_2 S_1 F 167 81 80.4 B-
street_4 S_1 F 159 64 84.8 B+
(2) Multi-layer index slicing
>>> df_using_mul.head()
School Gender Height Weight Math Physics
Class Address
C_1 street_1 S_1 M 173 63 34.0 A+
street_2 S_1 F 192 73 32.5 B+
street_2 S_1 M 186 82 87.2 B+
street_2 S_1 F 167 81 80.4 B-
street_4 S_1 F 159 64 84.8 B+
1. General slice
# df_using_mul.loc['C_2','street_5']
# 当索引不排序时,单个索引会报出性能警告
# df_using_mul.index.is_lexsorted()
# 该函数检查是否排序
# df_using_mul.sort_index().index.is_lexsorted()
>>> df_using_mul.sort_index().loc['C_2','street_5']
School Gender Height Weight Math Physics
Class Address
C_2 street_5 S_1 M 188 68 97.0 A-
street_5 S_1 F 162 63 33.8 B
street_5 S_2 M 193 100 39.1 B
# df_using_mul.loc[('C_2','street_5'):] 报错
# 当不排序时,不能使用多层切片
# 注意此处由于使用了loc,因此仍然包含右端点
>>> df_using_mul.sort_index().loc[('C_2','street_6'):('C_3','street_4')]
School Gender Height Weight Math Physics
Class Address
C_2 street_6 S_1 M 160 53 58.8 A+
street_6 S_1 F 167 63 68.4 B-
street_7 S_2 F 194 77 68.5 B+
street_7 S_2 F 183 76 85.4 B
C_3 street_1 S_1 F 175 57 87.7 A-
street_2 S_1 M 195 70 85.2 A
street_4 S_1 M 161 68 31.5 B+
street_4 S_2 F 157 78 72.3 B+
street_4 S_2 M 187 73 48.9 B
# 非元组也是合法的,表示选中该层所有元素
>>> df_using_mul.sort_index().loc[('C_2','street_7'):'C_3'].head()
School Gender Height Weight Math Physics
Class Address
C_2 street_7 S_2 F 194 77 68.5 B+
street_7 S_2 F 183 76 85.4 B
C_3 street_1 S_1 F 175 57 87.7 A-
street_2 S_1 M 195 70 85.2 A
street_4 S_1 M 161 68 31.5 B+
2. The first type of special case: a list made of tuples
# 表示选出某几个元素,精确到最内层索引
>>> df_using_mul.sort_index().loc[[('C_2','street_7'),('C_3','street_2')]]
School Gender Height Weight Math Physics
Class Address
C_2 street_7 S_2 F 194 77 68.5 B+
street_7 S_2 F 183 76 85.4 B
C_3 street_2 S_1 M 195 70 85.2 A
3. The second special case: a tuple formed from a list
# 选出第一层在‘C_2’和'C_3'中且第二层在'street_4'和'street_7'中的行
>>> df_using_mul.sort_index().loc[(['C_2','C_3'],['street_4','street_7']),:]
School Gender Height Weight Math Physics
Class Address
C_2 street_4 S_1 F 176 94 63.5 B-
street_4 S_2 M 155 91 73.8 A+
street_7 S_2 F 194 77 68.5 B+
street_7 S_2 F 183 76 85.4 B
C_3 street_4 S_1 M 161 68 31.5 B+
street_4 S_2 F 157 78 72.3 B+
street_4 S_2 M 187 73 48.9 B
street_7 S_1 M 188 82 49.7 B
street_7 S_2 F 190 99 65.9 C
(3) slice object in multi-layer index
>>> L1,L2 = ['A','B','C'],['a','b','c']
>>> mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
>>> L3,L4 = ['D','E','F'],['d','e','f']
>>> mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
>>> df_s = pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
>>> df_s
Big D E F
Small d e f d e f d e f
Upper Lower
A a 0.865934 0.752678 0.992263 0.471948 0.101374 0.750520 0.029240 0.841838 0.202736
b 0.358436 0.315506 0.141048 0.179118 0.579804 0.387298 0.731970 0.504881 0.664886
c 0.688227 0.076362 0.447927 0.897414 0.990657 0.577089 0.885058 0.242146 0.551289
B a 0.622251 0.583529 0.970421 0.798430 0.075585 0.453897 0.196744 0.243493 0.407374
b 0.660225 0.383329 0.884619 0.646215 0.251076 0.753128 0.857983 0.240076 0.391556
c 0.336650 0.051452 0.472089 0.750627 0.920971 0.131141 0.160800 0.567003 0.608006
C a 0.875975 0.545787 0.449710 0.062922 0.931482 0.595037 0.124742 0.016393 0.221201
b 0.608213 0.789482 0.744773 0.768816 0.364518 0.787751 0.536297 0.282383 0.828840
c 0.972491 0.477164 0.000541 0.236157 0.951343 0.572702 0.270309 0.225364 0.027862
The use of index slices is very flexible:
>>> idx=pd.IndexSlice
# df_s.sum()默认为对列求和,因此返回一个长度为9的数值列表
>>> df_s.loc[idx['B':,df_s['D']['d']>0.3],idx[df_s.sum()>4]]
Big D E
Small d f d e f
Upper Lower
B a 0.622251 0.970421 0.798430 0.075585 0.453897
b 0.660225 0.884619 0.646215 0.251076 0.753128
c 0.336650 0.472089 0.750627 0.920971 0.131141
C a 0.875975 0.449710 0.062922 0.931482 0.595037
b 0.608213 0.744773 0.768816 0.364518 0.787751
c 0.972491 0.000541 0.236157 0.951343 0.572702
(4) Exchange of index layer
1. The swaplevel method (two-layer exchange)
>>> df_using_mul.head()
School Gender Height Weight Math Physics
Class Address
C_1 street_1 S_1 M 173 63 34.0 A+
street_2 S_1 F 192 73 32.5 B+
street_2 S_1 M 186 82 87.2 B+
street_2 S_1 F 167 81 80.4 B-
street_4 S_1 F 159 64 84.8 B+
>>> df_using_mul.swaplevel(i=1,j=0,axis=0).sort_index().head()
School Gender Height Weight Math Physics
Address Class
street_1 C_1 S_1 M 173 63 34.0 A+
C_2 S_2 M 175 74 47.2 B-
C_3 S_1 F 175 57 87.7 A-
street_2 C_1 S_1 F 192 73 32.5 B+
C_1 S_1 M 186 82 87.2 B+
2. reorder_levels method (multi-layer exchange)
>>> df_muls = df.set_index(['School','Class','Address'])
>>> df_muls.head()
Gender Height Weight Math Physics
School Class Address
S_1 C_1 street_1 M 173 63 34.0 A+
street_2 F 192 73 32.5 B+
street_2 M 186 82 87.2 B+
street_2 F 167 81 80.4 B-
street_4 F 159 64 84.8 B+
>>> df_muls.reorder_levels([2,0,1],axis=0).sort_index().head()
Gender Height Weight Math Physics
Address School Class
street_1 S_1 C_1 M 173 63 34.0 A+
C_3 F 175 57 87.7 A-
S_2 C_2 M 175 74 47.2 B-
street_2 S_1 C_1 F 192 73 32.5 B+
C_1 M 186 82 87.2 B+
# 如果索引有name,可以直接使用name
>>> df_muls.reorder_levels(['Address','School','Class'],axis=0).sort_index().head()
Gender Height Weight Math Physics
Address School Class
street_1 S_1 C_1 M 173 63 34.0 A+
C_3 F 175 57 87.7 A-
S_2 C_2 M 175 74 47.2 B-
street_2 S_1 C_1 F 192 73 32.5 B+
C_1 M 186 82 87.2 B+
3. Index setting
(1) index_col parameter
index_col is a parameter in read_csv, not a method:
>>> pd.read_csv('data/table.csv',index_col=['Address','School']).head()
Class ID Gender Height Weight Math Physics
Address School
street_1 S_1 C_1 1101 M 173 63 34.0 A+
street_2 S_1 C_1 1102 F 192 73 32.5 B+
S_1 C_1 1103 M 186 82 87.2 B+
S_1 C_1 1104 F 167 81 80.4 B-
street_4 S_1 C_1 1105 F 159 64 84.8 B+
(2) reindex and reindex_like
Reindex refers to reindexing. Its important feature is index alignment, which is often used for reordering:
>>> df.head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
>>> df.reindex(index=[1101,1203,1206,2402])
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173.0 63.0 34.0 A+
1203 S_1 C_2 M street_6 160.0 53.0 58.8 A+
1206 NaN NaN NaN NaN NaN NaN NaN NaN
2402 S_2 C_4 M street_7 166.0 82.0 48.7 B
>>> df.reindex(columns=['Height','Gender','Average']).head()
Height Gender Average
ID
1101 173 M NaN
1102 192 F NaN
1103 186 M NaN
1104 167 F NaN
1105 159 F NaN
You can choose the filling method of missing values: fill_value and method (bfill/ffill/nearest), where the method parameter must be monotonic in index:
# bfill表示用所在索引1206的后一个有效行填充,ffill为前一个有效行,nearest是指最近的
>>> df.reindex(index=[1101,1203,1206,2402],method='bfill')
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1203 S_1 C_2 M street_6 160 53 58.8 A+
1206 S_1 C_3 M street_4 161 68 31.5 B+
2402 S_2 C_4 M street_7 166 82 48.7 B
# 数值上1205比1301更接近1206,因此用前者填充
>>> df.reindex(index=[1101,1203,1206,2402],method='nearest')
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1203 S_1 C_2 M street_6 160 53 58.8 A+
1206 S_1 C_2 F street_6 167 63 68.4 B-
2402 S_2 C_4 M street_7 166 82 48.7 B
The function of reindex_like is to generate a DataFrame whose horizontal and vertical indexes are completely consistent with the parameter list, and the data uses the called table
>>> df_temp = pd.DataFrame({
'Weight':np.zeros(5),
... 'Height':np.zeros(5),
... 'ID':[1101,1104,1103,1106,1102]}).set_index('ID')
>>> df_temp.reindex_like(df[0:5][['Weight','Height']])
Weight Height
ID
1101 0.0 0.0
1102 0.0 0.0
1103 0.0 0.0
1104 0.0 0.0
1105 NaN NaN
If df_temp is monotonous, you can also use the method parameter:
>>> df_temp = pd.DataFrame({
'Weight':range(5),
... 'Height':range(5),
... 'ID':[1101,1104,1103,1106,1102]}).set_index('ID').sort_index()
# 可以自行检验这里的1105的值是否是由bfill规则填充
>>> df_temp.reindex_like(df[0:5][['Weight','Height']],method='bfill')
Weight Height
ID
1101 0 0
1102 4 4
1103 2 2
1104 1 1
1105 3 3
(3) set_index and reset_index
First introduce set_index: literally, it is to use certain columns as indexes
Use table columns as indexes:
>>> df.head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
>>> df.set_index('Class').head()
School Gender Address Height Weight Math Physics
Class
C_1 S_1 M street_1 173 63 34.0 A+
C_1 S_1 F street_2 192 73 32.5 B+
C_1 S_1 M street_2 186 82 87.2 B+
C_1 S_1 F street_2 167 81 80.4 B-
C_1 S_1 F street_4 159 64 84.8 B+
Use the append parameter to keep the current index unchanged:
>>> df.set_index('Class', append=True).head()
School Gender Address Height Weight Math Physics
ID Class
1101 C_1 S_1 M street_1 173 63 34.0 A+
1102 C_1 S_1 F street_2 192 73 32.5 B+
1103 C_1 S_1 M street_2 186 82 87.2 B+
1104 C_1 S_1 F street_2 167 81 80.4 B-
1105 C_1 S_1 F street_4 159 64 84.8 B+
When using the column with the same length as the table as the index time (need to be converted to Series first, otherwise an error will be reported):
>>> df.set_index(pd.Series(range(df.shape[0]))).head()
School Class Gender Address Height Weight Math Physics
0 S_1 C_1 M street_1 173 63 34.0 A+
1 S_1 C_1 F street_2 192 73 32.5 B+
2 S_1 C_1 M street_2 186 82 87.2 B+
3 S_1 C_1 F street_2 167 81 80.4 B-
4 S_1 C_1 F street_4 159 64 84.8 B+
You can add multi-level indexes directly:
>>> df.set_index([pd.Series(range(df.shape[0])),pd.Series(np.ones(df.shape[0]))]).head()
School Class Gender Address Height Weight Math Physics
0 1.0 S_1 C_1 M street_1 173 63 34.0 A+
1 1.0 S_1 C_1 F street_2 192 73 32.5 B+
2 1.0 S_1 C_1 M street_2 186 82 87.2 B+
3 1.0 S_1 C_1 F street_2 167 81 80.4 B-
4 1.0 S_1 C_1 F street_4 159 64 84.8 B+
The reset_index method is introduced below, its main function is to reset the index
The default state reverts directly to natural indexing:
>>> df.reset_index().head()
ID School Class Gender Address Height Weight Math Physics
0 1101 S_1 C_1 M street_1 173 63 34.0 A+
1 1102 S_1 C_1 F street_2 192 73 32.5 B+
2 1103 S_1 C_1 M street_2 186 82 87.2 B+
3 1104 S_1 C_1 F street_2 167 81 80.4 B-
4 1105 S_1 C_1 F street_4 159 64 84.8 B+
Use the level parameter to specify which layer is reset, and the col_level parameter to specify which layer to set to:
>>> L1,L2 = ['A','B','C'],['a','b','c']
>>> mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
>>> L3,L4 = ['D','E','F'],['d','e','f']
>>> mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
>>> df_temp = pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
>>> df_temp.head()
Big D E F
Small d e f d e f d e f
Upper Lower
A a 0.996924 0.779796 0.198003 0.876215 0.801679 0.740366 0.072776 0.172737 0.103133
b 0.856929 0.384369 0.988760 0.300426 0.109809 0.445339 0.735657 0.109474 0.632733
c 0.631834 0.748637 0.378666 0.696078 0.404629 0.747714 0.237205 0.988239 0.260963
B a 0.740106 0.995469 0.005640 0.204483 0.958359 0.737188 0.696751 0.900894 0.275091
b 0.026315 0.251426 0.594558 0.313601 0.145479 0.433199 0.704520 0.366411 0.473218
>>> df_temp1 = df_temp.reset_index(level=1,col_level=1)
>>> df_temp1.head()
Big D E F
Small Lower d e f d e f d e f
Upper
A a 0.996924 0.779796 0.198003 0.876215 0.801679 0.740366 0.072776 0.172737 0.103133
A b 0.856929 0.384369 0.988760 0.300426 0.109809 0.445339 0.735657 0.109474 0.632733
A c 0.631834 0.748637 0.378666 0.696078 0.404629 0.747714 0.237205 0.988239 0.260963
B a 0.740106 0.995469 0.005640 0.204483 0.958359 0.737188 0.696751 0.900894 0.275091
B b 0.026315 0.251426 0.594558 0.313601 0.145479 0.433199 0.704520 0.366411 0.473218
# 看到的确插入了level2
>>> df_temp1.columns
MultiIndex([( '', 'Lower'),
('D', 'd'),
('D', 'e'),
('D', 'f'),
('E', 'd'),
('E', 'e'),
('E', 'f'),
('F', 'd'),
('F', 'e'),
('F', 'f')],
names=['Big', 'Small'])
# 最内层索引被移出
>>> df_temp1.index
Index(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'], dtype='object', name='Upper')
(4) rename_axis and rename
rename_axis is a method for multi-level indexes. Its function is to modify the index name of a certain layer, not the index label
>>> df_temp.rename_axis(index={
'Lower':'LowerLower'},columns={
'Big':'BigBig'})[['D', 'E']]
BigBig D E
Small d e f d e f
Upper LowerLower
A a 0.996924 0.779796 0.198003 0.876215 0.801679 0.740366
b 0.856929 0.384369 0.988760 0.300426 0.109809 0.445339
c 0.631834 0.748637 0.378666 0.696078 0.404629 0.747714
B a 0.740106 0.995469 0.005640 0.204483 0.958359 0.737188
b 0.026315 0.251426 0.594558 0.313601 0.145479 0.433199
c 0.642152 0.803119 0.869278 0.643841 0.933842 0.373142
C a 0.419632 0.187484 0.420311 0.136625 0.512117 0.167024
b 0.123571 0.571580 0.201483 0.788676 0.067141 0.955275
c 0.075575 0.832965 0.934871 0.549695 0.511443 0.286503
The rename method is used to modify column or row index labels, not index names:
>>> df_temp.rename(index={
'A':'T'},columns={
'e':'changed_e'}).head()
Big D E F
Small d changed_e f d changed_e f d changed_e f
Upper Lower
T a 0.996924 0.779796 0.198003 0.876215 0.801679 0.740366 0.072776 0.172737 0.103133
b 0.856929 0.384369 0.988760 0.300426 0.109809 0.445339 0.735657 0.109474 0.632733
c 0.631834 0.748637 0.378666 0.696078 0.404629 0.747714 0.237205 0.988239 0.260963
B a 0.740106 0.995469 0.005640 0.204483 0.958359 0.737188 0.696751 0.900894 0.275091
b 0.026315 0.251426 0.594558 0.313601 0.145479 0.433199 0.704520 0.366411 0.473218
4. Commonly used index functions
(1) where function
When filling cells with condition False:
>>> df.head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
# 不满足条件的行全部被设置为NaN
>>> df.where(df['Gender']=='M').head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173.0 63.0 34.0 A+
1102 NaN NaN NaN NaN NaN NaN NaN NaN
1103 S_1 C_1 M street_2 186.0 82.0 87.2 B+
1104 NaN NaN NaN NaN NaN NaN NaN NaN
1105 NaN NaN NaN NaN NaN NaN NaN NaN
The result of filtering in this way is exactly the same as that of the [] operator:
>>> df.where(df['Gender']=='M').dropna().head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173.0 63.0 34.0 A+
1103 S_1 C_1 M street_2 186.0 82.0 87.2 B+
1201 S_1 C_2 M street_5 188.0 68.0 97.0 A-
1203 S_1 C_2 M street_6 160.0 53.0 58.8 A+
1301 S_1 C_3 M street_4 161.0 68.0 31.5 B+
The first parameter is the boolean condition and the second parameter is the fill value:
>>> df.where(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173.000000 63.000000 34.000000 A+
1102 0.147943 0.670993 0.93367 0.16424 0.314864 0.121429 0.433781 0.074907
1103 S_1 C_1 M street_2 186.000000 82.000000 87.200000 B+
1104 0.749106 0.9844 0.184485 0.674807 0.738321 0.525289 0.019779 0.19905
1105 0.534726 0.657987 0.370359 0.89066 0.613029 0.456765 0.389943 0.756956
(2) mask function
The mask function is functionally opposite to where, and the rest is exactly the same, that is, to fill the cells whose condition is True:
>>> df.mask(df['Gender']=='M').dropna().head()
School Class Gender Address Height Weight Math Physics
ID
1102 S_1 C_1 F street_2 192.0 73.0 32.5 B+
1104 S_1 C_1 F street_2 167.0 81.0 80.4 B-
1105 S_1 C_1 F street_4 159.0 64.0 84.8 B+
1202 S_1 C_2 F street_4 176.0 94.0 63.5 B-
1204 S_1 C_2 F street_5 162.0 63.0 33.8 B
>>> df.mask(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()
School Class Gender Address Height Weight Math Physics
ID
1101 0.532138 0.841576 0.885163 0.169569 0.983056 0.714640 0.820599 0.012835
1102 S_1 C_1 F street_2 192.000000 73.000000 32.500000 B+
1103 0.538961 0.155097 0.401648 0.283565 0.617196 0.260921 0.395324 0.478259
1104 S_1 C_1 F street_2 167.000000 81.000000 80.400000 B-
1105 S_1 C_1 F street_4 159.000000 64.000000 84.800000 B+
(3) query function
>>> df.head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
In the Boolean expression in the query function, the following symbols are legal: row and column index names, strings, and/not/or/&/|/~/not in/in/==/!=, four arithmetic operators
>>> df.query('(Address in ["street_6","street_7"])&(Weight>(70+10))&(ID in [1303,2304,2402])')
School Class Gender Address Height Weight Math Physics
ID
1303 S_1 C_3 M street_7 188 82 49.7 B
2304 S_2 C_3 F street_6 164 81 95.5 A-
2402 S_2 C_4 M street_7 166 82 48.7 B
Five, repeated element processing
(1) Duplicated method
This method returns a boolean list of duplicates:
>>> df.duplicated('Class').head()
ID
1101 False
1102 True
1103 True
1104 True
1105 True
dtype: bool
The optional parameter keep defaults to first, that is, the first occurrence is set as non-repeating, if it is last, the last time is set as non-repeating, if it is False, all repetitions are True
>>> df.duplicated('Class',keep='last').tail()
ID
2401 True
2402 True
2403 True
2404 True
2405 False
dtype: bool
>>> df.duplicated('Class',keep=False).head()
ID
1101 True
1102 True
1103 True
1104 True
1105 True
dtype: bool
(2) drop_duplicates method
As the name suggests, it is to eliminate duplicates, which may be useful in grouping operations in later chapters, for example, the first value of each group needs to be retained:
>>> df.drop_duplicates('Class')
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1201 S_1 C_2 M street_5 188 68 97.0 A-
1301 S_1 C_3 M street_4 161 68 31.5 B+
2401 S_2 C_4 F street_2 192 62 45.3 A
The parameters are similar to the duplicate function:
>>> df.drop_duplicates('Class',keep='last')
School Class Gender Address Height Weight Math Physics
ID
2105 S_2 C_1 M street_4 170 81 34.2 A
2205 S_2 C_2 F street_7 183 76 85.4 B
2305 S_2 C_3 M street_4 187 73 48.9 B
2405 S_2 C_4 F street_6 193 54 47.6 B
When multiple columns are passed in, it is equivalent to treating multiple columns as a multi-level index and comparing duplicates:
>>> df.drop_duplicates(['School','Class'])
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1201 S_1 C_2 M street_5 188 68 97.0 A-
1301 S_1 C_3 M street_4 161 68 31.5 B+
2101 S_2 C_1 M street_7 174 84 83.3 C
2201 S_2 C_2 M street_5 193 100 39.1 B
2301 S_2 C_3 F street_4 157 78 72.3 B+
2401 S_2 C_4 F street_2 192 62 45.3 A
6. Sampling function
The sampling function here refers to the sample function
(1) n is the sample size
>>> df.sample(n=5)
School Class Gender Address Height Weight Math Physics
ID
2205 S_2 C_2 F street_7 183 76 85.4 B
2202 S_2 C_2 F street_7 194 77 68.5 B+
2101 S_2 C_1 M street_7 174 84 83.3 C
1103 S_1 C_1 M street_2 186 82 87.2 B+
1301 S_1 C_3 M street_4 161 68 31.5 B+
(2) frac is the sample ratio
>>> df.sample(frac=0.05)
School Class Gender Address Height Weight Math Physics
ID
1104 S_1 C_1 F street_2 167 81 80.4 B-
2105 S_2 C_1 M street_4 170 81 34.2 A
(3) replace is whether to put it back
>>> df.sample(n=df.shape[0],replace=True).head()
School Class Gender Address Height Weight Math Physics
ID
2302 S_2 C_3 M street_5 171 88 32.7 A
1101 S_1 C_1 M street_1 173 63 34.0 A+
2305 S_2 C_3 M street_4 187 73 48.9 B
2101 S_2 C_1 M street_7 174 84 83.3 C
1304 S_1 C_3 M street_2 195 70 85.2 A
>>> df.sample(n=35,replace=True).index.is_unique
False
(4) axis is the sampling dimension, the default is 0, that is, the sampling line
>>> df.sample(n=3,axis=1).head() # 次数axis为1,则抽取3列
Address Class Weight
ID
1101 street_1 C_1 63
1102 street_2 C_1 73
1103 street_2 C_1 82
1104 street_2 C_1 81
1105 street_4 C_1 64
(5) weights is the sample weight, which is automatically normalized
>>> df.sample(n=3,weights=np.random.rand(df.shape[0])).head()
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
2302 S_2 C_3 M street_5 171 88 32.7 A
1105 S_1 C_1 F street_4 159 64 84.8 B+
# 以某一列为权重,这在抽样理论中很常见
# 抽到的概率与Math数值成正比
>>> df.sample(n=3,weights=df['Math']).head()
School Class Gender Address Height Weight Math Physics
ID
2304 S_2 C_3 F street_6 164 81 95.5 A-
1201 S_1 C_2 M street_5 188 68 97.0 A-
2205 S_2 C_2 F street_7 183 76 85.4 B