Pands summary of two data list merging methods;
The merge() function is used to merge two DataFrame objects or Series. This function is often used in data processing. The official website gives the definition of this function as follows:
pandas.merge(left, right, how: str = ‘inner’, on=None, left_on=None, right_on=None, left_index: bool = False, right_index: bool = False, sort: bool = False, suffixes=’_x’, ‘_y’, copy: bool = True, indicator: bool = False, validate=None)
First introduce the meaning and function of each parameter;
left | DataFrame on the left |
---|---|
right | DataFrame to be merged on the right |
how | There are four merging methods: left, right, inner, outer, the default is inner left only uses the keys from the left frame for matching; right only uses the keys from the right frame for matching; outer takes the union of the left and right frame keys; If it fails to match, the element is set to Null inner to take the intersection of the left and right frame keys. If it fails to match, the element is lost; |
on | Label or list selects the merged base column that is the key name, provided that the key name exists in two DataFrames, if not set, the default is the intersection of the column names of the left and right DataFrames; |
left_on | The column or index level in the DataFrame to the left of label, list, array_list is used as the key; it can be the column name or index name |
right_on | The column or index level in the right DataFrame of label , list, array_list is used as the key; it can be the column name or index name |
left_index | bool, the default False uses the index on the left as the connection key; if it is multi-index, the number of connection keys in the DataFrame on the right must match the number of levels; |
right_index | bool, the default is False, use the index on the right as the connection key; if it is a multi-index, the number of connection keys in the DataFrame on the left must match the number of levels; |
sort | bool, the default is False. The connection keys of the merged DataFrame are combined in alphabetical order; |
suffixes | tuple of (str,str), the default (_x,_y) adds the suffix name to the duplicate column names on the left and right sides to distinguish; if the set (False, False) has overlapping class names, an exception will be thrown |
copy | bool, the default is True if Flase, avoid copying |
indicator | Boolean or str, the default is False to True, add a new column to the last DataFrame, the column name is "_merge" The basic information about the matching of each row is left_only when the matching succeeds only on the left, and only appears on the right When the match is successful, it is right_only; when both sides are matched, it is both; |
validate | If str and optional are specified, check whether the merge is the specified type 1:1, check whether the merge key is a unique value in the left and right data sets; 1:m, check whether the merge key is a unique value in the left data set m:1, check the merge key Whether the data set on the right is the only value |
The above is some introduction to the parameters. The following will combine some example operations around these parameters to explain in detail the specific use of the merge() function, which is divided into several parts:
When the left and right DataFrames have the same key value;
**Set the parameter on to achieve a simple merge of two DataFrames**
In [1]: import pandas as pd
In [2]: data1 =pd.DataFrame({
'key':['K0','K1','K2','K3'],
...: 'A':['A0','A1','A2','A3'],
...: 'B':['B0','B1','B2','B3']})
In [3]: data2 = pd.DataFrame({
'key':['K0','K1','K2','K3'],})
In [4]: data2 = pd.DataFrame({
'key':['K0','K1','K2','K3'],
...: 'C':['C0','C1','C2','C3'],
...: 'D':['D0','D1','D2','D3']})
In [5]: result = pd.merge(data1,data2,on = 'key')
In [6]: result
Out[6]:
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
merge four merge methods
how = “left” | Only the key on the left is used as the benchmark, and the matching failure element on the right is set to Null |
---|---|
how = “right” | Only the key on the right is used as the benchmark, and the matching failure element on the left is set to Null |
how = “outer” | Based on the union of the left and right keywords, the element that fails to match is set to Null |
how = “inner” | The intersection of the left and right keywords is used as the benchmark. If the match fails, delete the line directly |
In [7]: data1 =pd.DataFrame({
'a':['a1','a2','a3'],
...: 'b':['b1','b2','b3'],
...: 'key':['a','b','c'],
...: 'key1':['d','e','f']})
...:
...:
In [8]: data2 = pd.DataFrame({
'c':['c1','c2','c3'],
...: 'd':['d1','d2','d3'],
...: 'key':['a','b','a'],
...: 'key1':['d','e','e']})
**how= "left" merge**
Take the key of the DataFrame on the left as the benchmark. If the matching fails on the right, replace it with NaN, and delete the row where the extra Key is located.
- Chart interpretation:
- red: indicates that the row is removed;
- blue: represents the row is reserved;
- green: indicates that the mismatched value is replaced by NaNs;
In [9]: # how = left,以左边键为基准
In [10]: pd.merge(data1,data2,how ="left",on = ['key','key1'])
Out[10]:
a b key key1 c d
0 a1 b1 a d c1 d1
1 a2 b2 b e c2 d2
2 a3 b3 c f NaN NaN
how ="right"
Based on the key of the right DataFrame, the usage is similar to how=“left”, but the direction is opposite;
In [11]: #how = right ,以右边为基准
In [12]: pd.merge(data1,data2,how = 'right',on =['key','key1'])
Out[12]:
a b key key1 c d
0 a1 b1 a d c1 d1
1 a2 b2 b e c2 d2
2 NaN NaN a e c3 d3
how ="inner"
This merging method is used more frequently, based on the keys shared by the left and right DataFrames. Successful matching is retained, and all rows where the matching fails are deleted;
In [16]: # how = inner,取左右交集;
In [17]: pd.merge(data1,data2,how ='inner',on = ['key','key1'])
Out[17]:
a b key key1 c d
0 a1 b1 a d c1 d1
1 a2 b2 b e c2 d2
how ="outer"
Corresponding to the usage of how=“inner”, based on the keys shared by the left and right DataFrames, the keys that match successfully are retained, and the keys that fail to match are replaced with Nan;
In [13]: # how = outer,r取左右两边并集
In [15]: pd.merge(data1,data2,how ='outer',on = ['key','key1'])
Out[15]:
a b key key1 c d
0 a1 b1 a d c1 d1
1 a2 b2 b e c2 d2
2 a3 b3 c f NaN NaN
3 NaN NaN a e c3 d3
DataFrame 具有不同 key 值合并
当要合并的两个 DataFrame 具有不同 key 值时,这里需要用到 left_on、right_on 参数,分别用来指定左右 DataFrame 的列名;
left_on 、right_on 为key 作为基准
left_on 选取 key 名时,需要对 right_on 设置对应 键名,且需要保证 len(left_on) == len(right_on),
加入 suffixes 参数,是因为左右具有相同列名( value ),保证合并后的 列名都不一样
In [18]: df1 = pd.DataFrame({
'lkey': ['foo', 'bar', 'baz', 'foo'],
...: 'value': [1, 2, 3, 5]})
In [19]: df2 = pd.DataFrame({
'rkey': ['foo', 'bar', 'baz', 'foo'],
...: 'value': [5, 6, 7, 8]})
In [20]: df1
Out[20]:
lkey value
0 foo 1
1 bar 2
2 baz 3
3 foo 5
In [21]: df2
Out[21]:
rkey value
0 foo 5
1 bar 6
2 baz 7
3 foo 8
In [22]: pd.merge(df1,df2,left_on ='lkey')
In [23]: pd.merge(df1,df2,left_on ='lkey',right_on ='rkey')
Out[23]:
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 8
2 foo 5 foo 5
3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7
# 设置 suffixes 参数之后
In [24]: pd.merge(df1,df2,left_on ='lkey',right_on ='rkey',suffixes=("_lf","_rf"))
Out[24]:
lkey value_lf rkey value_rf
0 foo 1 foo 5
1 foo 1 foo 8
2 foo 5 foo 5
3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7
操作前需要保证 键值长度相等,len(left_on) == len(right_on);否则会出现下面错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-0660dac837b1> in <module>
----> 1 pd.merge(df1,df2,left_on ='lkey')
~\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
79 copy=copy,
80 indicator=indicator,
---> 81 validate=validate,
82 )
83 return op.get_result()
~\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
617 warnings.warn(msg, UserWarning)
618
--> 619 self._validate_specification()
620
621 # note this function has side effects
~\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in _validate_specification(self)
1221 )
1222 self.left_on = [None] * n
-> 1223 if len(self.right_on) != len(self.left_on):
1224 raise ValueError("len(right_on) must equal len(left_on)")
1225
TypeError: object of type 'NoneType' has no len()
以索引列( index )作为合并基准
merge() 也可以以索引列作为合并基准,此时用到两个参数 left_on、right_on;都设为 True;
np.random.seed([3, 14])
left = pd.DataFrame({
'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])
right = pd.DataFrame({
'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
同时合并多个 DataFrame
合并多个 DataFrame 的方法有很多种,这里列出下面几条:
低效 merge()
df1.merge(df2, ...).merge(df3, ...)
方法合并时需要设置多个参数,并且较为低效;
pd.concat() 进行合并
pd.concat() 可以对多个 DataFrame 进行同时合并,合并方法与前面提到的 merge() 四种方法相同,区别是前面以关键字 how 衔接,这里以 join 作为衔接参数:
np.random.seed(0)
A = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({
'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({
'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893 -0.977278 1.0
4, indicator 参数
indicator 参数(设置为 True);用来表示 DataFrame 会添加新的一列 ,列名为_merge; 来表示各行的合并类型信息。
In [25]: pd.merge(data1,data2,how ='outer',on = ['key','key1'],indicator = True)
Out[25]:
a b key key1 c d _merge
0 a1 b1 a d c1 d1 both
1 a2 b2 b e c2 d2 both
2 a3 b3 c f NaN NaN left_only
3 NaN NaN a e c3 d3 right_only
indicator 也可以设为 String ,自定义列名
In [27]: pd.merge(data1,data2,how ='outer',on = ['key','key1'],indicator ="col_info")
Out[27]:
a b key key1 c d col_info
0 a1 b1 a d c1 d1 both
1 a2 b2 b e c2 d2 both
2 a3 b3 c f NaN NaN left_only
3 NaN NaN a e c3 d3 right_only
好了,以上就是本文对 merge() 用法的介绍,关于 merge() 更深入、全面的用法,以后再加讨论
最后感谢大家阅读!
Reference:
1,https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
2,https://stackoverflow.com/questions/53645882/pandas-merging-101
文章首发于公众号(Z先生点记)