3.4 Pandas数值运算方法
通用函数:保留索引
np的通用函数同样适用于pd
import numpy as np
import pandas as pd
mg = np.random.RandomState(42)
ser = pd.Series(mg.randint(0, 10, 4))
ser
0 6
1 3
2 7
3 4
dtype: int32
df = pd.DataFrame(mg.randint(0, 10, (3, 4)), columns=['A', 'B', 'C', 'D'])
df
|
A |
B |
C |
D |
0 |
6 |
9 |
2 |
6 |
1 |
7 |
4 |
3 |
7 |
2 |
7 |
2 |
5 |
4 |
如果对这两个对象使用np的通用函数,结果是生成另一个保留索引的pd对象
np.exp(ser)
0 403.428793
1 20.085537
2 1096.633158
3 54.598150
dtype: float64
np.sin(df*np.pi/4)
|
A |
B |
C |
D |
0 |
-1.000000 |
7.071068e-01 |
1.000000 |
-1.000000e+00 |
1 |
-0.707107 |
1.224647e-16 |
0.707107 |
-7.071068e-01 |
2 |
-0.707107 |
1.000000e+00 |
-0.707107 |
1.224647e-16 |
通用函数:索引对齐
当两个对象进行二元运算时,pd会在计算中对齐两个对象的索引。当处理不完整的数据时,这一点非常方便。
Series索引对齐
运算后索引会得到并集,不可计算的数据设置为NaN
area = pd.Series({'Jinan': 720, 'Qingdao': 882, 'Linyi': 1021}, name='area')
popu = pd.Series({'Weihai': 554, 'Jinan': 800, 'Linyi': 998}, name='popu')
popu / area
Jinan 1.111111
Linyi 0.977473
Qingdao NaN
Weihai NaN
dtype: float64
扫描二维码关注公众号,回复:
2587769 查看本文章
# 上面结果的索引是并集,也可用集合运算得到
area.index | popu.index
Index(['Jinan', 'Linyi', 'Qingdao', 'Weihai'], dtype='object')
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B
0 NaN
1 5.0
2 9.0
3 NaN
dtype: float64
# 等价的语句
A.add(B)
0 NaN
1 5.0
2 9.0
3 NaN
dtype: float64
# 如果不想获得NaN,可以设置参数自定义缺省数据
A.add(B, fill_value=0)
0 2.0
1 5.0
2 9.0
3 5.0
dtype: float64
DataFrame索引对齐
运算后索引也会得到并集
A = pd.DataFrame(mg.randint(0, 20, (2, 2)), columns=list('AB'))
A
B = pd.DataFrame(mg.randint(0, 10, (3, 3)), columns=list('BAC'))
B
|
B |
A |
C |
0 |
4 |
0 |
9 |
1 |
5 |
8 |
0 |
2 |
9 |
2 |
6 |
A + B
|
A |
B |
C |
0 |
1.0 |
15.0 |
NaN |
1 |
13.0 |
6.0 |
NaN |
2 |
NaN |
NaN |
NaN |
# 计算A中元素的均值,作为填充缺省值与B运算
fill = A.stack().mean() # stack将A压缩为一维数组
A.add(B, fill_value=fill)
|
A |
B |
C |
0 |
1.0 |
15.0 |
13.5 |
1 |
13.0 |
6.0 |
4.5 |
2 |
6.5 |
13.5 |
10.5 |
通用函数:DF与Series的运算
运算规则与np中二维数组与一维数组的运算规则一样
# np中
A = mg.randint(10, size=(3, 4))
A
array([[3, 8, 2, 4],
[2, 6, 4, 8],
[6, 1, 3, 8]])
A - A[0] # 根据广播规则,会按行计算
array([[ 0, 0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])
# pd里默认也是按行计算
df = pd.DataFrame(A, columns=list('QRST'))
df
|
Q |
R |
S |
T |
0 |
3 |
8 |
2 |
4 |
1 |
2 |
6 |
4 |
8 |
2 |
6 |
1 |
3 |
8 |
df - df.iloc[0]
|
Q |
R |
S |
T |
0 |
0 |
0 |
0 |
0 |
1 |
-1 |
-2 |
2 |
4 |
2 |
3 |
-7 |
1 |
4 |
# 如果想按列运算,需要用运算符方法结合参数axis
df.subtract(df['R'], axis=0)
|
Q |
R |
S |
T |
0 |
-5 |
0 |
-6 |
-4 |
1 |
-4 |
0 |
-2 |
2 |
2 |
5 |
0 |
2 |
7 |
# 运算结果都会按索引对齐
halfrow = df.iloc[0, ::2] # 第0行,一半元素
halfrow
Q 3
S 2
Name: 0, dtype: int32
df - halfrow
|
Q |
R |
S |
T |
0 |
0.0 |
NaN |
0.0 |
NaN |
1 |
-1.0 |
NaN |
2.0 |
NaN |
2 |
3.0 |
NaN |
1.0 |
NaN |