Use combine_first, combine, update to efficiently process the columns in the DataFrame

wedge

When we use pandas to process data, we often encounter replacing one column of data with another. For example, columns A and B do not process the data that is not empty in column A, and replace the data that is empty in column A with the data corresponding to column B. This type of demand is estimated to be encountered by many people, of course, there are other more complex.

There are many ways to solve this kind of demand, such as apply with low efficiency, or use vectorized loc and so on. So this time we take a look at a few very simple and equally efficient methods.

combine_first

This method is specifically used to deal with null values, let's take a look at the usage

import pandas as pd

df = pd.DataFrame(
    {"A": ["001", None, "003", None, "005"],
     "B": ["1", "2", "3", "4", "5"]}
)
print(df)
"""
      A  B
0   001  1
1  None  2
2   003  3
3  None  4
4   005  5
"""

# 我们现在需求如下,如果A列中的数据不为空,那么不做处理。
# 为空,则用B列中对应的数据进行替换
df["A"] = df["A"].combine_first(df["B"])
print(df)
"""
     A  B
0  001  1
1    2  2
2  003  3
3    4  4
4  005  5
"""

The method of use is very simple. The first is two Series objects. Suppose it is called s1 and s2, then s1.combine_first (s2) means to replace the empty data in s1 with s2. All are empty, then the result can only be empty. Of course, this method does not operate in place, but will return a new Series object

In addition, the ideal premise of this method is that the indexes of the two Series objects are consistent, because the replacement is based on the index to specify the location

For example, if the data with index 1 in s1 is empty, then the data with index 1 in s2 will be used for replacement. But if there is no index 1 data in s2, then it will not be replaced. And, if it is assumed that there is data with index 100 in s2, but not in s1, then the result will be one more data with index 100.

Let's demonstrate

import pandas as pd

s1 = pd.Series(["001", None, None, "004"], index=['a', 'b', 'c', 'd'])
s2 = pd.Series(["2", "3", "4"], index=['b', 'd', "e"])

print(s1)
"""
a     001
b    None
c    None
d     004
dtype: object
"""
print(s2)
"""
b    2
d    3
e    4
dtype: object
"""

print(s1.combine_first(s2))
"""
a    001
b      2
c    NaN
d    004
e      4
dtype: object
"""

To explain, the first thing to replace is the data with empty value in s1. If it is not empty, no processing is done. There are two empty data in s1, which are the indexes "b" and "c", then the data with the indexes "b" and "c" in s2 will be replaced. But there is data with index "b" in s2, but no data with index "c", so only one value can be replaced.

In addition, we see that there is more data with index e at the end. Yes, we say that if the data in s2 is not in s1, it will be added directly.

import pandas as pd

s1 = pd.Series(['1', '2', '3', '4'], index=['a', 'b', 'c', 'd'])
s2 = pd.Series(['11', '22', '33'], index=['d', 'e', 'f'])

print(s1.combine_first(s2))
"""
a     1
b     2
c     3
d     4
e    22
f    33
dtype: object
"""

We see that in s2, there are data with indexes 'e' and 'f', but not in s1, so we add it directly. Of course, if the data in s1 is not in s2, then s1 will also be kept directly. If there are both, then see if the data of s1 is empty, if it is empty, then replace it with the data corresponding to the index of s2, if it is not empty, keep s1, that is, do not replace.

Of course, in most cases, we are dealing with two columns of the same DataFrame. For two columns in the same DataFrame, their indexes are obviously the same, so it is simply from top to bottom, there will not be too much bells and whistles. .

combine

Combine is similar to combine_first, except that you need to specify a function.

import pandas as pd

df = pd.DataFrame(
    {"A": ["001", None, "003", None, "005"],
     "B": ["1", "2", "3", "4", "5"]}
)
print(df)
"""
      A  B
0   001  1
1  None  2
2   003  3
3  None  4
4   005  5
"""

df["A"] = df["A"].combine(df["B"], lambda a, b: a if pd.notna(a) else b)
print(df)
"""
     A  B
0  001  1
1    2  2
2  003  3
3    4  4
4  005  5
"""

We specified an anonymous function, and the parameters a and b represent each corresponding data in df ["A"] and df ["B"]. If a is not empty, then return a, otherwise return b.

So we see that we use combine to achieve the effect of combine_first. combine_first specifically replaces null values, but combine allows us to specify the logic ourselves. We can realize the function of combine_first, but also other functions

import pandas as pd

s1 = pd.Series([1, 22, 3, 44])
s2 = pd.Series([11, 2, 33, 4])

# 哪个元素大就保留哪一个
print(s1.combine(s2, lambda a, b: a if a > b else b))
"""
0    11
1    22
2    33
3    44
dtype: int64
"""

# 两个元素进行相乘
print(s1.combine(s2, lambda a, b: a * b))
"""
0     11
1     44
2     99
3    176
dtype: int64
"""

The function of combine is still very powerful, of course, it is also operated against the index. In fact, combine and combine_first will first process the index internally. If the indexes of the two Series objects are not the same, they will first be consistent.

import pandas as pd

s1 = pd.Series([1, 22, 3, 44], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([11, 2, 33, 4], index=['c', 'd', 'e', 'f'])

# 先对两个索引取并集
index = s1.index.union(s2.index)
print(index)  # Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

# 然后通过reindex,获取指定索引的元素,当然索引不存在就用NaN代替
s1 = s1.reindex(index)
s2 = s2.reindex(index)
print(s1)
"""
a     1.0
b    22.0
c     3.0
d    44.0
e     NaN
f     NaN
dtype: float64
"""
print(s2)
"""
a     NaN
b     NaN
c    11.0
d     2.0
e    33.0
f     4.0
dtype: float64
"""
# 在将s1和s2的索引变得一致之后,依次进行操作。

Look back at combine_first

import pandas as pd

s1 = pd.Series([1, 22, 3, 44], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([11, 2, 33, 4], index=['a', 'b', 'c', 'e'])
print(s1.combine_first(s2))
"""
a     1.0
b    22.0
c     3.0
d    44.0
e     4.0
dtype: float64
"""
# 一开始的话可能有人会好奇为什么类型变了,但是现在显然不会有疑问了
# 因为s1和s2的索引不一致,index='e'在s1中不存在,index='d'在s2中不存在
# 而reindex如果指定不存在索引,则用NaN代替
# 而如果出现了NaN,那么类型就由整型变成了浮点型。
# 但两个Series对象的index如果一样,那么reindex的结果也还是和原来一样,由于没有NaN,那么类型就不会变化

# 所以我们可以自己实现一个combine_first,当然pandas内部也是这么做的
s1 = s1.reindex(['a', 'b', 'c', 'd', 'e'])
s2 = s2.reindex(['a', 'b', 'c', 'd', 'e'])
print(s1)
"""
a     1.0
b    22.0
c     3.0
d    44.0
e     NaN
dtype: float64
"""
print(s2)
"""
a    11.0
b     2.0
c    33.0
d     NaN
e     4.0
dtype: float64
"""

# s1不为空,否则用s2替换
print(s1.where(pd.notna(s1), s2))
"""
a     1.0
b    22.0
c     3.0
d    44.0
e     4.0
dtype: float64
"""

Look back at combine again

import pandas as pd

s1 = pd.Series([1, 22, 3, 44], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([11, 2, 33, 4], index=['c', 'd', 'e', 'f'])

print(s1.combine(s2, lambda a, b: a if a > b else b))
"""
a     NaN
b     NaN
c    11.0
d    44.0
e    33.0
f     4.0
dtype: float64
"""
# 为什么出现这个结果,相信你很容易就分析出来
# reindex之后:
# s1的数据变成[1.0, 22.0, 33.0, 44.0, NaN, NaN]
# s2的数据变成[NaN, NaN, 11.0, 2.0, 33.0, 4.0]
# 然后依次比较,1.0 > NaN为False,那么保留b,而b为NaN, 所以结果的第一个元素为NaN,同理第二个也是如此。
# 同理最后两个元素,和NaN比较也是False,还是保留b,那么最后两个元素则是33.0和4.0
# 至于index为c、d的元素就没有必要分析了,显然是保留大的那个

So still that sentence, for combine and combine_first, they are comparing elements of the same index. If the indexes of the two Series objects are different, then the union will be taken first, and then reindex, and then compare. Give another chestnut:

import pandas as pd

s1 = pd.Series([1, 2, 3, 4])
s2 = pd.Series([1, 2, 3, 4])

# 两个元素是否相等,相等返回True,否则返回False
print(s1.combine(s2, lambda a, b: True if a == b else False))
"""
0    True
1    True
2    True
3    True
dtype: bool
"""

s2.index = [0, 1, 3, 2]
print(s1.combine(s2, lambda a, b: True if a == b else False))
"""
0     True
1     True
2    False
3    False
dtype: bool
"""

# 当我们将s2的索引变成了[0, 1, 3, 2]结果就不对了
print(s1.index.union(s2.index))  # Int64Index([0, 1, 2, 3], dtype='int64')
# 此时reindex的结果,s1还是1、2、3、4,但是s2则变成了1、2、4、3

So when using the two methods combine and combine_first, we must remember the index, otherwise it may cause a trap. In fact, there are many other operations including pandas, they are all based on the index, not simply from left to right or from top to bottom

But still that sentence, we often operate on the two columns in the DataFrame, and their indexes are the same, so do not need to think too much.

Of course, in addition to the Series object, these two methods can also target DataFrame objects, such as: df1.combine (df2, func), to replace the same column, but it is not very common, if you are interested, you can study it yourself. We mainly act on Series objects

update

Update is relatively brutal, let's take a look.

import pandas as pd

s1 = pd.Series([1, 2, 3, 4])
s2 = pd.Series([11, 22, 33, 44])

s1.update(s2)
print(s1)
"""
0    11
1    22
2    33
3    44
dtype: int64
"""

First of all, we see that this method is operated locally. The function is to replace the element of s1 with the element of s2, and as long as the element in s2 is not empty, then replace it.

import pandas as pd

s1 = pd.Series([1, 2, 3, 4])
s2 = pd.Series([11, 22, None, 44])

s1.update(s2)
print(s1)
"""
0    11
1    22
2     3
3    44
dtype: int64
"""

So this function is called update, which means update. Replace the element in s1 with the element in s2, but if the element in s2 is empty, then it can be considered that the 'new version' has not come out, then the old version is still used, so the 3 in s1 has not been replaced.

So update and combine_first are similar, but their differences are:

  • combine_first:如果s1中的值为空,用s2的值替换,否则保留s1的值
  • update:如果s2中的值不为空,那么替换s1,否则保留s1的值

In addition, during combine_first, we repeatedly emphasized the problem of indexing. If the indexes of s1 and s2 are different, the number of elements of the resulting result will be more. But update is different, because it is operated locally, that is, directly modify s1 locally. Therefore, the number of elements in s1 will not change.

import pandas as pd

s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([11, 22, 33, 44], index=['c', 'd', 'e', 'f'])

s1.update(s2)
print(s1)
"""
a     1
b     2
c    11
d    22
dtype: int64
"""

There are no elements with index 'a' and 'b' in s2, then it can be considered that the 'new version' does not appear, so the original value is not updated and retained. But there are elements with index 'c' and 'd' in s2, so if there is a 'new version', then update it. So s1 has changed from [1 2 3 4] to [1 2 11 22]. As for the elements with index 'e' and 'f' in s2, they have nothing to do with s1, because s1 has no index at all as 'e' , The element of 'f', it is useless to provide the 'new version' in s2. Therefore, using update is performed locally on s1. The index and the number of elements of s1 will not change before and after the operation.

当然update也适用于对两个DataFrame进行操作,有兴趣可以自己去了解。

Guess you like

Origin www.cnblogs.com/traditional/p/12727997.html