[Turn] Pands (3) — Merge method is introduced in detail!

Pands (3) — Merge method is introduced in detail!

The merge() function is used to merge two DataFrame objects or Series. This function is often used in data processing. The official website gives the definition of this function as follows:

pandas.merge(left, right, how: str = 'inner', on=None, left_on=None, right_on=None, left_index: bool = False, right_index: bool = False, sort: bool = False, suffixes='_x', '_y', copy: bool = True, indicator: bool = False, validate=None)

First introduce the meaning and function of each parameter;

left	DataFrame on the left

The above is some introduction to the parameters. The following will combine some example operations around these parameters to explain in detail the specific use of the merge() function, which is divided into several parts:

When the left and right DataFrames have the same key value;

Set the parameter on to achieve a simple merge of two DataFrames

In [1]: import pandas as pd

In [2]: data1 =pd.DataFrame({'key':['K0','K1','K2','K3'],
   ...:                 'A':['A0','A1','A2','A3'],
   ...:                 'B':['B0','B1','B2','B3']})

In [3]: data2 = pd.DataFrame({'key':['K0','K1','K2','K3'],})

In [4]: data2 = pd.DataFrame({'key':['K0','K1','K2','K3'],
   ...:                         'C':['C0','C1','C2','C3'],
   ...:                         'D':['D0','D1','D2','D3']})

In [5]: result = pd.merge(data1,data2,on = 'key')

In [6]: result
Out[6]:
  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3

merge four merge methods

how = "left"	Only the key on the left is used as the benchmark, and the matching failure element on the right is set to Null

In [7]: data1 =pd.DataFrame({'a':['a1','a2','a3'],
   ...:                     'b':['b1','b2','b3'],
   ...:                     'key':['a','b','c'],
   ...:                     'key1':['d','e','f']})
   ...:
   ...:

In [8]: data2 = pd.DataFrame({'c':['c1','c2','c3'],
   ...:                         'd':['d1','d2','d3'],
   ...:                         'key':['a','b','a'],
   ...:                         'key1':['d','e','e']})

how= "left" 合并

Take the key of the DataFrame on the left as the benchmark. If the match fails on the right, replace it with NaN, and delete the row where the extra Key appears.

Chart interpretation:
- red: indicates that the row is removed;
- blue: represents the row is reserved;
- green: indicates that the mismatched value is replaced by NaNs;

In [9]: # how = left,以左边键为基准

In [10]: pd.merge(data1,data2,how ="left",on = ['key','key1'])
Out[10]:
    a   b key key1    c    d
0  a1  b1   a    d   c1   d1
1  a2  b2   b    e   c2   d2
2  a3  b3   c    f  NaN  NaN

how ="right"

Based on the key of the right DataFrame, the usage is similar to how="left", but the direction is opposite;


In [11]: #how = right ,以右边为基准

In [12]: pd.merge(data1,data2,how = 'right',on =['key','key1'])
Out[12]:
     a    b key key1   c   d
0   a1   b1   a    d  c1  d1
1   a2   b2   b    e  c2  d2
2  NaN  NaN   a    e  c3  d3

how ="inner"

This merging method is used more frequently, based on the keys shared by the left and right DataFrames. Successful matching is retained, and all rows where the matching fails are deleted;

In [16]: # how = inner,取左右交集；

In [17]: pd.merge(data1,data2,how ='inner',on = ['key','key1'])
Out[17]:
    a   b key key1   c   d
0  a1  b1   a    d  c1  d1
1  a2  b2   b    e  c2  d2

how ="outer"

Corresponding to the usage of how="inner", based on the keys shared by the left and right DataFrames, the keys that match successfully are retained, and the key values that fail to match are replaced with Nan;

In [13]: # how = outer,r取左右两边并集
    
In [15]: pd.merge(data1,data2,how ='outer',on = ['key','key1'])
Out[15]:
     a    b key key1    c    d
0   a1   b1   a    d   c1   d1
1   a2   b2   b    e   c2   d2
2   a3   b3   c    f  NaN  NaN
3  NaN  NaN   a    e   c3   d3

DataFrame has different key values combined

When the two DataFrames to be merged have different key values, the left_on and right_on parameters need to be used here to specify the column names of the left and right DataFrames respectively;

left_on and right_on are keys as benchmarks

When selecting the key name for left_on, you need to set the corresponding key name for right_on, and you need to ensure that len(left_on) == len(right_on),

The suffixes parameter is added because the left and right have the same column name (value) to ensure that the combined column names are different

In [18]: df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
    ...:                     'value': [1, 2, 3, 5]})

In [19]: df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
    ...:                     'value': [5, 6, 7, 8]})

In [20]: df1
Out[20]:
  lkey  value
0  foo      1
1  bar      2
2  baz      3
3  foo      5

In [21]: df2
Out[21]:
  rkey  value
0  foo      5
1  bar      6
2  baz      7
3  foo      8

In [22]: pd.merge(df1,df2,left_on ='lkey')


In [23]: pd.merge(df1,df2,left_on ='lkey',right_on ='rkey')
Out[23]:
  lkey  value_x rkey  value_y
0  foo        1  foo        5
1  foo        1  foo        8
2  foo        5  foo        5
3  foo        5  foo        8
4  bar        2  bar        6
5  baz        3  baz        7

# 设置 suffixes 参数之后
In [24]: pd.merge(df1,df2,left_on ='lkey',right_on ='rkey',suffixes=("_lf","_rf"))
Out[24]:
  lkey  value_lf rkey  value_rf
0  foo         1  foo         5
1  foo         1  foo         8
2  foo         5  foo         5
3  foo         5  foo         8
4  bar         2  bar         6
5  baz         3  baz         7

Before operation, you need to ensure that the length of the key value is equal, len(left_on) == len(right_on) ; otherwise, the following error will occur:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-0660dac837b1> in <module>
----> 1 pd.merge(df1,df2,left_on ='lkey')

~\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     79         copy=copy,
     80         indicator=indicator,
---> 81         validate=validate,
     82     )
     83     return op.get_result()

~\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    617             warnings.warn(msg, UserWarning)
    618
--> 619         self._validate_specification()
    620
    621         # note this function has side effects

~\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in _validate_specification(self)
   1221                     )
   1222                 self.left_on = [None] * n
-> 1223         if len(self.right_on) != len(self.left_on):
   1224             raise ValueError("len(right_on) must equal len(left_on)")
   1225

TypeError: object of type 'NoneType' has no len()

Take the index column (index) as the merge benchmark

merge() can also use the index column as the merge benchmark. At this time, the two parameters left_on and right_on are used; both are set to True;

np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right

           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076



left.merge(right, left_index=True, right_index=True)


         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Combine multiple DataFrames at the same time

There are many ways to merge multiple DataFrames, here are the following:

Inefficient merge()

df1.merge(df2, ...).merge(df3, ...)

Multiple parameters need to be set when methods are merged, and they are relatively inefficient;

pd.concat() to merge

pd.concat() can merge multiple DataFrames at the same time. The merging method is the same as the four merge() methods mentioned above. The difference is that the keyword how is used to connect, and join is used as the connecting parameter here:

np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C] 

# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')


pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0

4. The indicator parameter

The indicator parameter (set to True); used to indicate that a new column will be added to the DataFrame, the column name is _merge; to indicate the merge type information of each row.

In [25]: pd.merge(data1,data2,how ='outer',on = ['key','key1'],indicator = True)
Out[25]:
     a    b key key1    c    d      _merge
0   a1   b1   a    d   c1   d1        both
1   a2   b2   b    e   c2   d2        both
2   a3   b3   c    f  NaN  NaN   left_only
3  NaN  NaN   a    e   c3   d3  right_only

indicator can also be set to String, custom column name

In [27]: pd.merge(data1,data2,how ='outer',on = ['key','key1'],indicator ="col_info")
Out[27]:
     a    b key key1    c    d    col_info
0   a1   b1   a    d   c1   d1        both
1   a2   b2   b    e   c2   d2        both
2   a3   b3   c    f  NaN  NaN   left_only
3  NaN  NaN   a    e   c3   d3  right_only

Well, the above is the introduction to the usage of merge() in this article. The more in-depth and comprehensive usage of merge() will be discussed later.

Finally, thank you all for reading!

Reference:

1,https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

2,https://stackoverflow.com/quest