pandas中 transform 函数和 apply 函数的区别

There are two major differences between the transform and apply groupby methods.

  • apply implicitly passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function
  • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.(transform必须返回与组合相同长度的序列(一维的序列、数组或列表))

So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

 from   :https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object#

transform 函数:

                     1.只允许在同一时间在一个Series上进行一次转换,如果定义列‘a’ 减去列‘b’,  则会出现异常;

                      2.必须返回与 group相同的单个维度的序列(行)

                      3. 返回单个标量对象也可以使用,如 . transform(sum)

apply函数:

                    1. 不同于transform只允许在Series上进行一次转换, apply对整个DataFrame 作用

                     2.apply隐式地将group 上所有的列作为自定义函数

 栗子:

#coding=gbk
import numpy as np
import pandas as pd
data  = pd.DataFrame({'state':['Florida','Florida','Texas','Texas'],
                      'a':[4,5,1,3],
                      'b':[6,10,3,11]
                      })
print(data)
#    a   b    state
# 0  4   6  Florida
# 1  5  10  Florida
# 2  1   3    Texas
# 3  3  11    Texas
def sub_two(X):
    return X['a'] - X['b']
data1 = data.groupby(data['state']).apply(sub_two) # 此处使用transform 则会出现错误
print(data1)
# state     
# Florida  0   -2
#          1   -5
# Texas    2   -2
#          3   -8
# dtype: int64

返回单个标量可以使用transform:

:我们可以看到使用transform 和apply 的输出结果形式是不一样的,transform返回与数据同样长度的行,而apply则进行了聚合

此时,使用apply说明的信息更明确

def group_sum(x):
    return x.sum()
data3 = data.groupby(data['state']).transform(group_sum)    #返回与数据一样的 行
print(data3)
#    a   b
# 0  9  16
# 1  9  16
# 2  4  14
# 3  4  14
#但是使用apply时
data4 = data.groupby(data['state']).apply(group_sum)
print(data4)
#          a   b           state
# state                         
# Florida  9  16  FloridaFlorida
# Texas    4  14      TexasTexas

The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:

栗子2:

np.random.seed(666)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8), 'D' : np.random.randn(8)})
print(df)
#      A      B         C         D
# 0  foo    one  0.824188  0.640573
# 1  bar    one  0.479966 -0.786443
# 2  foo    two  1.173468  0.608870
# 3  bar  three  0.909048 -0.931012
# 4  foo    two -0.571721  0.978222
# 5  bar    two -0.109497 -0.736918
# 6  foo    one  0.019028 -0.298733
# 7  foo  three -0.943761 -0.460587
def zscore(x):
    return (x - x.mean())/ x.var()  
print(df.groupby('A').transform(zscore))  #自动识别CD列
print(df.groupby('A')['C','D'].apply(zscore))   #此种形式则两种输出数据是一样的
# df.groupby('A').apply(zscore)  此种情况则会报错,apply对整个dataframe作用

df['sum_c'] = df.groupby('A')['C'].transform(sum)   #先对A列进行分组, 计算C列的和
df = df.sort_values('A')
print(df)
#      A      B         C         D     sum_c
# 1  bar    one  0.479966 -0.786443  1.279517
# 3  bar  three  0.909048 -0.931012  1.279517
# 5  bar    two -0.109497 -0.736918  1.279517
# 0  foo    one  0.824188  0.640573  0.501202
# 2  foo    two  1.173468  0.608870  0.501202
# 4  foo    two -0.571721  0.978222  0.501202
# 6  foo    one  0.019028 -0.298733  0.501202
# 7  foo  three -0.943761 -0.460587  0.501202
print(df.groupby('A')['C'].apply(sum))
# A
# bar    1.279517
# foo    0.501202
# Name: C, dtype: float64

The function passed to transform must return a number, a row, or the same shape as the argument. if it's a number then the number will be set to all the elements in the group, if it's a row, it will be broadcasted to all the rows in the group. 

函数传递给transform必须返回一个数字,一行,或者与参数相同的形状。 如果是一个数字,那么数字将被设置为组中的所有元素,如果是一行,它将会被广播到组中的所有行。

参考:https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object#

猜你喜欢

转载自blog.csdn.net/qq_40587575/article/details/81204514