【DA】pandas算数运算和数据对齐


1 算数运算和数据对齐

pandas最重要的一个功能是,它可以对不同索引的对象进行算数运算。在将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。

import pandas as pd

s1=pd.Series([7.3,-2.5,3.4,1.5],index=list('acde'))
s2=pd.Series([-2.1,3.6,-1.5,4,3.1],index=list('acefg'))

s1
# a    7.3
# c   -2.5
# d    3.4
# e    1.5
# dtype: float64

s2
# a   -2.1
# c    3.6
# e   -1.5
# f    4.0
# g    3.1
# dtype: float64

s1+s2
# a    5.2
# c    1.1
# d    NaN
# e    0.0
# f    NaN
# g    NaN
# dtype: float64

自动的数据对齐操作在不重叠的索引处引入了NaN值,缺失值会在算数运算过程中传播。

对于DataFrame,对齐操作会同时发生在行和列上:

import pandas as pd
import numpy as np

df1=pd.DataFrame(np.arange(9).reshape(3,3),columns=list('bcd'),index=['ohio','texas','colorado'])
df2=pd.DataFrame(np.arange(12.).reshape(4,3),columns=list('bde'),index=['utah','ohio','texas','oregon'])
df1
b c d
ohio 0 1 2
texas 3 4 5
colorado 6 7 8
df2
b d e
utah 0.0 1.0 2.0
ohio 3.0 4.0 5.0
texas 6.0 7.0 8.0
oregon 9.0 10.0 11.0
df1+df2
b c d e
colorado NaN NaN NaN NaN
ohio 3.0 NaN 6.0 NaN
oregon NaN NaN NaN NaN
texas 9.0 NaN 12.0 NaN
utah NaN NaN NaN NaN

2 在算数方法中填充值

对不同索引的对象进行算数运算时,我们希望在NaN时赋值成为0或其他特殊值进行计算。

df1.add(df2,fill_value=0)
b c d e
colorado 6.0 7.0 8.0 NaN
ohio 3.0 1.0 6.0 5.0
oregon 9.0 NaN 10.0 11.0
texas 9.0 4.0 12.0 8.0
utah 0.0 NaN 1.0 2.0

加和规律

  • 数字+NaN=数字+0=数字
  • NaN+NaN=NaN
方法 说明
add 用于加法(+)的方法
sub 用于减法(-)的方法
div 用于除法(/)的方法
mul 用于乘法(*)的方法

3 DataFrame和Series之间的运算

arr=np.arange(12.).reshape(3,4) # array([[ 0.,  1.,  2.,  3.],
                                #        [ 4.,  5.,  6.,  7.],
                                #        [ 8.,  9., 10., 11.]])
arr[0] # array([0., 1., 2., 3.])
arr-arr[0] # array([[0., 0., 0., 0.],
            #        [4., 4., 4., 4.],
            #        [8., 8., 8., 8.]])

这叫广播(broadcasting),第12章将此进行详细讲解。DataFrame和Series之间的运算差不多也是如此:

frame=pd.DataFrame(np.arange(12.).reshape(4,3),
                   columns=list('bde'),
                   index=['utah','ohio','texas','oregon'])
series=frame.ix[0]
# b    0.0
# d    1.0
# e    2.0
# Name: utah, dtype: float64

frame
b d e
utah 0.0 1.0 2.0
ohio 3.0 4.0 5.0
texas 6.0 7.0 8.0
oregon 9.0 10.0 11.0
frame-series
b d e
utah 0.0 0.0 0.0
ohio 3.0 3.0 3.0
texas 6.0 6.0 6.0
oregon 9.0 9.0 9.0
series2=pd.Series(range(3),index=['b','e','f'])
# b    0
# e    1
# f    2
# dtype: int64

frame+series2
b d e f
utah 0.0 NaN 3.0 NaN
ohio 3.0 NaN 6.0 NaN
texas 6.0 NaN 9.0 NaN
oregon 9.0 NaN 12.0 NaN

如果你希望匹配行且在列上广播,则必须使用算数运算方法:

series3=frame['d']
# utah       1.0
# ohio       4.0
# texas      7.0
# oregon    10.0
# Name: d, dtype: float64

frame.sub(series3,axis=0)
b d e
utah -1.0 0.0 1.0
ohio -1.0 0.0 1.0
texas -1.0 0.0 1.0
oregon -1.0 0.0 1.0

猜你喜欢

转载自blog.csdn.net/qq_36056219/article/details/113182411
今日推荐