The merge() function is used to merge two DataFrame objects or Series. This function is often used in data processing. The official website gives the definition of this function as follows:
pandas.merge(left, right, how: str = 'inner', on=None, left_on=None, right_on=None, left_index: bool = False, right_index: bool = False, sort: bool = False, suffixes='_x', '_y', copy: bool = True, indicator: bool = False, validate=None)
First introduce the meaning and function of each parameter;
left | DataFrame on the left |
---|
The above is some introduction to the parameters. The following will combine some example operations around these parameters to explain in detail the specific use of the merge() function, which is divided into several parts:
When the left and right DataFrames have the same key value;
Set the parameter on to achieve a simple merge of two DataFrames
In [1]: import pandas as pd
In [2]: data1 =pd.DataFrame({'key':['K0','K1','K2','K3'],
...: 'A':['A0','A1','A2','A3'],
...: 'B':['B0','B1','B2','B3']})
In [3]: data2 = pd.DataFrame({'key':['K0','K1','K2','K3'],})
In [4]: data2 = pd.DataFrame({'key':['K0','K1','K2','K3'],
...: 'C':['C0','C1','C2','C3'],
...: 'D':['D0','D1','D2','D3']})
In [5]: result = pd.merge(data1,data2,on = 'key')
In [6]: result
Out[6]:
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
merge four merge methods
how = "left" | Only the key on the left is used as the benchmark, and the matching failure element on the right is set to Null |
---|
In [7]: data1 =pd.DataFrame({'a':['a1','a2','a3'],
...: 'b':['b1','b2','b3'],
...: 'key':['a','b','c'],
...: 'key1':['d','e','f']})
...:
...:
In [8]: data2 = pd.DataFrame({'c':['c1','c2','c3'],
...: 'd':['d1','d2','d3'],
...: 'key':['a','b','a'],
...: 'key1':['d','e','e']})
how= "left" 合并
Take the key of the DataFrame on the left as the benchmark. If the match fails on the right, replace it with NaN, and delete the row where the extra Key appears.
- Chart interpretation:
- red: indicates that the row is removed;
- blue: represents the row is reserved;
- green: indicates that the mismatched value is replaced by NaNs;
In [9]: # how = left,以左边键为基准
In [10]: pd.merge(data1,data2,how ="left",on = ['key','key1'])
Out[10]:
a b key key1 c d
0 a1 b1 a d c1 d1
1 a2 b2 b e c2 d2
2 a3 b3 c f NaN NaN
how ="right"
Based on the key of the right DataFrame, the usage is similar to how="left", but the direction is opposite;
In [11]: #how = right ,以右边为基准
In [12]: pd.merge(data1,data2,how = 'right',on =['key','key1'])
Out[12]:
a b key key1 c d
0 a1 b1 a d c1 d1
1 a2 b2 b e c2 d2
2 NaN NaN a e c3 d3
how ="inner"
This merging method is used more frequently, based on the keys shared by the left and right DataFrames. Successful matching is retained, and all rows where the matching fails are deleted;
In [16]: # how = inner,取左右交集;
In [17]: pd.merge(data1,data2,how ='inner',on = ['key','key1'])
Out[17]:
a b key key1 c d
0 a1 b1 a d c1 d1
1 a2 b2 b e c2 d2
how ="outer"
Corresponding to the usage of how="inner", based on the keys shared by the left and right DataFrames, the keys that match successfully are retained, and the key values that fail to match are replaced with Nan;
In [13]: # how = outer,r取左右两边并集
In [15]: pd.merge(data1,data2,how ='outer',on = ['key','key1'])
Out[15]:
a b key key1 c d
0 a1 b1 a d c1 d1
1 a2 b2 b e c2 d2
2 a3 b3 c f NaN NaN
3 NaN NaN a e c3 d3
DataFrame has different key values combined
When the two DataFrames to be merged have different key values, the left_on and right_on parameters need to be used here to specify the column names of the left and right DataFrames respectively;
left_on and right_on are keys as benchmarks
When selecting the key name for left_on, you need to set the corresponding key name for right_on, and you need to ensure that len(left_on) == len(right_on),
The suffixes parameter is added because the left and right have the same column name (value) to ensure that the combined column names are different
In [18]: df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
...: 'value': [1, 2, 3, 5]})
In [19]: df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
...: 'value': [5, 6, 7, 8]})
In [20]: df1
Out[20]:
lkey value
0 foo 1
1 bar 2
2 baz 3
3 foo 5
In [21]: df2
Out[21]:
rkey value
0 foo 5
1 bar 6
2 baz 7
3 foo 8
In [22]: pd.merge(df1,df2,left_on ='lkey')
In [23]: pd.merge(df1,df2,left_on ='lkey',right_on ='rkey')
Out[23]:
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 8
2 foo 5 foo 5
3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7
# 设置 suffixes 参数之后
In [24]: pd.merge(df1,df2,left_on ='lkey',right_on ='rkey',suffixes=("_lf","_rf"))
Out[24]:
lkey value_lf rkey value_rf
0 foo 1 foo 5
1 foo 1 foo 8
2 foo 5 foo 5
3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7
Before operation, you need to ensure that the length of the key value is equal, len(left_on) == len(right_on) ; otherwise, the following error will occur:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-0660dac837b1> in <module>
----> 1 pd.merge(df1,df2,left_on ='lkey')
~\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
79 copy=copy,
80 indicator=indicator,
---> 81 validate=validate,
82 )
83 return op.get_result()
~\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
617 warnings.warn(msg, UserWarning)
618
--> 619 self._validate_specification()
620
621 # note this function has side effects
~\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in _validate_specification(self)
1221 )
1222 self.left_on = [None] * n
-> 1223 if len(self.right_on) != len(self.left_on):
1224 raise ValueError("len(right_on) must equal len(left_on)")
1225
TypeError: object of type 'NoneType' has no len()
Take the index column (index) as the merge benchmark
merge() can also use the index column as the merge benchmark. At this time, the two parameters left_on and right_on are used; both are set to True;
np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Combine multiple DataFrames at the same time
There are many ways to merge multiple DataFrames, here are the following:
Inefficient merge()
df1.merge(df2, ...).merge(df3, ...)
Multiple parameters need to be set when methods are merged, and they are relatively inefficient;
pd.concat() to merge
pd.concat() can merge multiple DataFrames at the same time. The merging method is the same as the four merge() methods mentioned above. The difference is that the keyword how is used to connect, and join is used as the connecting parameter here:
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893 -0.977278 1.0
4. The indicator parameter
The indicator parameter (set to True); used to indicate that a new column will be added to the DataFrame, the column name is _merge; to indicate the merge type information of each row.
In [25]: pd.merge(data1,data2,how ='outer',on = ['key','key1'],indicator = True)
Out[25]:
a b key key1 c d _merge
0 a1 b1 a d c1 d1 both
1 a2 b2 b e c2 d2 both
2 a3 b3 c f NaN NaN left_only
3 NaN NaN a e c3 d3 right_only
indicator can also be set to String, custom column name
In [27]: pd.merge(data1,data2,how ='outer',on = ['key','key1'],indicator ="col_info")
Out[27]:
a b key key1 c d col_info
0 a1 b1 a d c1 d1 both
1 a2 b2 b e c2 d2 both
2 a3 b3 c f NaN NaN left_only
3 NaN NaN a e c3 d3 right_only
Well, the above is the introduction to the usage of merge() in this article. The more in-depth and comprehensive usage of merge() will be discussed later.
Finally, thank you all for reading!
Reference:
1,https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html