使用combine_first、combine、update高效的处理DataFrame中的列

楔子

我们在用pandas处理数据的时候，经常会遇到用其中一列替换另一列的数据。比如A列和B列，对A列中不为空的数据不作处理，对A列中为空的数据使用B列对应的数据进行替换。这一类的需求估计很多人都遇到，当然还有其它更复杂的。

解决这类需求的办法有很多，比如效率不高的apply，或者使用向量化的loc等等。那么这次我们来看一下几个非常简便，同样高效率的办法。

combine_first

这个方法是专门用来针对空值处理的，我们来看一下用法

import pandas as pd

df = pd.DataFrame(
    {"A": ["001", None, "003", None, "005"],
     "B": ["1", "2", "3", "4", "5"]}
)
print(df)
"""
      A  B
0   001  1
1  None  2
2   003  3
3  None  4
4   005  5
"""

# 我们现在需求如下，如果A列中的数据不为空，那么不做处理。
# 为空，则用B列中对应的数据进行替换
df["A"] = df["A"].combine_first(df["B"])
print(df)
"""
     A  B
0  001  1
1    2  2
2  003  3
3    4  4
4  005  5
"""

使用方法很简单，首先是两个Series对象，假设叫s1和s2，那么s1.combine_first(s2)就表示用s2替换掉s1中为空的数据，如果s1和s2的某个相同索引对应的数据都是空，那么结果只能是空。当然这个方法不是在原地操作，而是会返回一个新的Series对象

另外这个方法的理想前提是两个Series对象的索引是一致的，因为替换是根据索引来指定位置的

比如s1中index为1的数据为空，那么就会使用s2中index为1的数据进行替换。但如果s2中没有index为1数据，那么就不会替换了。并且，如果假设s2中存在index为100的数据，但是s1中没有，那么结果就会多出一个index为100的数据。

下面来演示一下

import pandas as pd

s1 = pd.Series(["001", None, None, "004"], index=['a', 'b', 'c', 'd'])
s2 = pd.Series(["2", "3", "4"], index=['b', 'd', "e"])

print(s1)
"""
a     001
b    None
c    None
d     004
dtype: object
"""
print(s2)
"""
b    2
d    3
e    4
dtype: object
"""

print(s1.combine_first(s2))
"""
a    001
b      2
c    NaN
d    004
e      4
dtype: object
"""

解释一下，首先替换的都是s1中值为空的数据，如果不为空那么不做任何处理。s1中值为空的数据有两个，分别是索引为"b"、"c"，那么会用s2中索引为"b"、"c"的数据进行替换。但是s2中存在索引为"b"、却不存在索引为"c"的数据，那么就只能替换一个值。

另外我们看到结尾还多了个索引为e的数据，是的，我们说如果s2中的数据，s1中没有，那么会直接加上去。

import pandas as pd

s1 = pd.Series(['1', '2', '3', '4'], index=['a', 'b', 'c', 'd'])
s2 = pd.Series(['11', '22', '33'], index=['d', 'e', 'f'])

print(s1.combine_first(s2))
"""
a     1
b     2
c     3
d     4
e    22
f    33
dtype: object
"""

我们看到s2中，存在索引为'e'、'f'的数据，但是s1中没有，那么就直接加进去了。当然，如果s1中的数据在s2中没有，那么也会直接保留s1。如果两者都有，那么看s1的数据是否为空，如果为空，那么用s2对应索引的数据替换，不为空则保留s1、也就是不替换。

当然大部分情况下我们处理的都是同一个DataFrame的两列，对于同一个DataFrame中的两列，它们的索引显然是一致的，所以就是简单的从上到下，不会有太多花里胡哨的。

combine

combine和combine_first类似，只是需要指定一个函数。

import pandas as pd

df = pd.DataFrame(
    {"A": ["001", None, "003", None, "005"],
     "B": ["1", "2", "3", "4", "5"]}
)
print(df)
"""
      A  B
0   001  1
1  None  2
2   003  3
3  None  4
4   005  5
"""

df["A"] = df["A"].combine(df["B"], lambda a, b: a if pd.notna(a) else b)
print(df)
"""
     A  B
0  001  1
1    2  2
2  003  3
3    4  4
4  005  5
"""

我们指定了一个匿名函数，参数a、b就代表df["A"]和df["B"]中对应的每一个数据。如果a不为空，那么返回a，否则返回b。

所以我们看到，我们使用combine实现了和combine_first的效果。combine_first是专门对空值进行替换的，但是combine则是可以让我们自己指定逻辑。我们可以实现combine_first的功能，也可以实现其它的功能

import pandas as pd

s1 = pd.Series([1, 22, 3, 44])
s2 = pd.Series([11, 2, 33, 4])

# 哪个元素大就保留哪一个
print(s1.combine(s2, lambda a, b: a if a > b else b))
"""
0    11
1    22
2    33
3    44
dtype: int64
"""

# 两个元素进行相乘
print(s1.combine(s2, lambda a, b: a * b))
"""
0     11
1     44
2     99
3    176
dtype: int64
"""

combine的功能还是很强大的，当然它同样是针对索引来操作的。事实上combine和combine_first内部会先对索引进行处理，如果两个Series对象的索引不一样，那么会先将它们索引变得一致。

import pandas as pd

s1 = pd.Series([1, 22, 3, 44], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([11, 2, 33, 4], index=['c', 'd', 'e', 'f'])

# 先对两个索引取并集
index = s1.index.union(s2.index)
print(index)  # Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

# 然后通过reindex，获取指定索引的元素，当然索引不存在就用NaN代替
s1 = s1.reindex(index)
s2 = s2.reindex(index)
print(s1)
"""
a     1.0
b    22.0
c     3.0
d    44.0
e     NaN
f     NaN
dtype: float64
"""
print(s2)
"""
a     NaN
b     NaN
c    11.0
d     2.0
e    33.0
f     4.0
dtype: float64
"""
# 在将s1和s2的索引变得一致之后，依次进行操作。

再回过头看一下combine_first

import pandas as pd

s1 = pd.Series([1, 22, 3, 44], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([11, 2, 33, 4], index=['a', 'b', 'c', 'e'])
print(s1.combine_first(s2))
"""
a     1.0
b    22.0
c     3.0
d    44.0
e     4.0
dtype: float64
"""
# 一开始的话可能有人会好奇为什么类型变了，但是现在显然不会有疑问了
# 因为s1和s2的索引不一致，index='e'在s1中不存在，index='d'在s2中不存在
# 而reindex如果指定不存在索引，则用NaN代替
# 而如果出现了NaN,那么类型就由整型变成了浮点型。
# 但两个Series对象的index如果一样，那么reindex的结果也还是和原来一样，由于没有NaN,那么类型就不会变化

# 所以我们可以自己实现一个combine_first，当然pandas内部也是这么做的
s1 = s1.reindex(['a', 'b', 'c', 'd', 'e'])
s2 = s2.reindex(['a', 'b', 'c', 'd', 'e'])
print(s1)
"""
a     1.0
b    22.0
c     3.0
d    44.0
e     NaN
dtype: float64
"""
print(s2)
"""
a    11.0
b     2.0
c    33.0
d     NaN
e     4.0
dtype: float64
"""

# s1不为空，否则用s2替换
print(s1.where(pd.notna(s1), s2))
"""
a     1.0
b    22.0
c     3.0
d    44.0
e     4.0
dtype: float64
"""

再重新回过头看一下combine

import pandas as pd

s1 = pd.Series([1, 22, 3, 44], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([11, 2, 33, 4], index=['c', 'd', 'e', 'f'])

print(s1.combine(s2, lambda a, b: a if a > b else b))
"""
a     NaN
b     NaN
c    11.0
d    44.0
e    33.0
f     4.0
dtype: float64
"""
# 为什么出现这个结果，相信你很容易就分析出来
# reindex之后：
# s1的数据变成[1.0, 22.0, 33.0, 44.0, NaN, NaN]
# s2的数据变成[NaN, NaN, 11.0, 2.0, 33.0, 4.0]
# 然后依次比较，1.0 > NaN为False，那么保留b，而b为NaN, 所以结果的第一个元素为NaN，同理第二个也是如此。
# 同理最后两个元素，和NaN比较也是False，还是保留b，那么最后两个元素则是33.0和4.0
# 至于index为c、d的元素就没有必要分析了，显然是保留大的那个

所以还是那句话，对于combine和combine_first来说，它们是对相同索引的元素进行比较，如果两个Series对象的索引不一样，那么会先取并集，然后通过reindex，再进行比较。再举个栗子：

import pandas as pd

s1 = pd.Series([1, 2, 3, 4])
s2 = pd.Series([1, 2, 3, 4])

# 两个元素是否相等，相等返回True，否则返回False
print(s1.combine(s2, lambda a, b: True if a == b else False))
"""
0    True
1    True
2    True
3    True
dtype: bool
"""

s2.index = [0, 1, 3, 2]
print(s1.combine(s2, lambda a, b: True if a == b else False))
"""
0     True
1     True
2    False
3    False
dtype: bool
"""

# 当我们将s2的索引变成了[0, 1, 3, 2]结果就不对了
print(s1.index.union(s2.index))  # Int64Index([0, 1, 2, 3], dtype='int64')
# 此时reindex的结果，s1还是1、2、3、4，但是s2则变成了1、2、4、3

所以在使用combine和combine_first这两个方法的时候，一定要记住索引，否则可能会造成陷阱。事实上，包括pandas很多的其它操作也是，它们都是基于索引来的，并不是简单的依次从左到右或者从上到下

但还是那句话，我们很多时候都是对DataFrame中的两列进行操作，而它们索引是一样的，所以不需要想太多。

当然这两个方法除了针对Series对象，还可以针对DataFrame对象，比如：df1.combine(df2, func)，对相同的column进行替换，但不是很常用，有兴趣可以自己研究。我们主要还是作用于Series对象

update

update比较野蛮，我们来看一下。

import pandas as pd

s1 = pd.Series([1, 2, 3, 4])
s2 = pd.Series([11, 22, 33, 44])

s1.update(s2)
print(s1)
"""
0    11
1    22
2    33
3    44
dtype: int64
"""

首先我们看到这个方法是在本地进行操作的，功能还是用s2的元素替换s1的元素，并且只要s2中的元素不为空，那么就进行替换。

import pandas as pd

s1 = pd.Series([1, 2, 3, 4])
s2 = pd.Series([11, 22, None, 44])

s1.update(s2)
print(s1)
"""
0    11
1    22
2     3
3    44
dtype: int64
"""

所以这个函数叫update，意思就是更新。用s2中的元素换掉s1中的元素，但如果s2中的元素为空，那么可以认为'新版本'还没出来，那么还是使用老版本，所以s1中的3没有被换掉。

所以update和combine_first比较类似，但它们的区别在于：

combine_first：如果s1中的值为空，用s2的值替换，否则保留s1的值
update：如果s2中的值不为空，那么替换s1，否则保留s1的值

另外在combine_first的时候，我们反复强调了索引的问题，如果s1和s2索引不一样，那么生成的结果的元素个数会多。但是update不一样，因为它是在本地进行操作的，也就是直接本地修改s1。所以最终s1的元素个数是不为发生变化的。

import pandas as pd

s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([11, 22, 33, 44], index=['c', 'd', 'e', 'f'])

s1.update(s2)
print(s1)
"""
a     1
b     2
c    11
d    22
dtype: int64
"""

s2中不存在index为'a'、'b'的元素，那么可以认为'新版本'没有出现，因此不更新、保留原来的值。但是s2中存在index为'c'、'd'的元素，所以有'新版本'，那么就更新。所以s1由[1 2 3 4]变成了[1 2 11 22]，至于s2中index为'e'、'f'的元素，它们和s1没有关系，因为s1中压根没有index为'e'、'f'的元素，s2中提供了'新版本'也是没用的。所以使用update，是在s1本地操作的，操作前后s1的索引以及元素个数不会改变。

当然update也适用于对两个DataFrame进行操作，有兴趣可以自己去了解。