Pandas Data Processing | How to merge multiple data tables, do you know?

Pands summary of two data list merging methods;

The merge() function is used to merge two DataFrame objects or Series. This function is often used in data processing. The official website gives the definition of this function as follows:

pandas.merge(left, right, how: str = ‘inner’, on=None, left_on=None, right_on=None, left_index: bool = False, right_index: bool = False, sort: bool = False, suffixes=’_x’, ‘_y’, copy: bool = True, indicator: bool = False, validate=None)

First introduce the meaning and function of each parameter;

left DataFrame on the left
right DataFrame to be merged on the right
how There are four merging methods: left, right, inner, outer, the default is inner
left only uses the keys from the left frame for matching;
right only uses the keys from the right frame for matching;
outer takes the union of the left and right frame keys; If it fails to match, the element is set to Null
inner to take the intersection of the left and right frame keys. If it fails to match, the element is lost;
on Label or list
selects the merged base column that is the key name, provided that the key name exists in two DataFrames, if not set, the default is the intersection of the column names of the left and right DataFrames;
left_on
The column or index level in the DataFrame to the left of label, list, array_list is used as the key; it can be the column name or index name
right_on
The column or index level in the right DataFrame of label , list, array_list is used as the key; it can be the column name or index name
left_index bool, the default False
uses the index on the left as the connection key; if it is multi-index, the number of connection keys in the DataFrame on the right must match the number of levels;
right_index bool, the default is False,
use the index on the right as the connection key; if it is a multi-index, the number of connection keys in the DataFrame on the left must match the number of levels;
sort bool, the default is False. The
connection keys of the merged DataFrame are combined in alphabetical order;
suffixes tuple of (str,str), the default (_x,_y)
adds the suffix name to the duplicate column names on the left and right sides to distinguish; if the set (False, False) has overlapping class names, an exception will be thrown
copy bool, the default is True
if Flase, avoid copying
indicator Boolean or str, the default is False
to True, add a new column to the last DataFrame, the column name is "_merge" The basic information
about the matching of each row is left_only when the matching succeeds only on the left, and only appears on the right When the match is successful, it is right_only; when both sides are matched, it is both;
validate
If str and optional are specified, check whether the merge is the specified type
1:1, check whether the merge key is a unique value in the left and right data sets;
1:m, check whether the merge key is a unique value in the left data set
m:1, check the merge key Whether the data set on the right is the only value

The above is some introduction to the parameters. The following will combine some example operations around these parameters to explain in detail the specific use of the merge() function, which is divided into several parts:

When the left and right DataFrames have the same key value;

**Set the parameter on to achieve a simple merge of two DataFrames**

In [1]: import pandas as pd

In [2]: data1 =pd.DataFrame({
    
    'key':['K0','K1','K2','K3'],
   ...:                 'A':['A0','A1','A2','A3'],
   ...:                 'B':['B0','B1','B2','B3']})

In [3]: data2 = pd.DataFrame({
    
    'key':['K0','K1','K2','K3'],})

In [4]: data2 = pd.DataFrame({
    
    'key':['K0','K1','K2','K3'],
   ...:                         'C':['C0','C1','C2','C3'],
   ...:                         'D':['D0','D1','D2','D3']})

In [5]: result = pd.merge(data1,data2,on = 'key')

In [6]: result
Out[6]:
  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3

merge four merge methods

how = “left” Only the key on the left is used as the benchmark, and the matching failure element on the right is set to Null
how = “right” Only the key on the right is used as the benchmark, and the matching failure element on the left is set to Null
how = “outer” Based on the union of the left and right keywords, the element that fails to match is set to Null
how = “inner” The intersection of the left and right keywords is used as the benchmark. If the match fails, delete the line directly
In [7]: data1 =pd.DataFrame({
    
    'a':['a1','a2','a3'],
   ...:                     'b':['b1','b2','b3'],
   ...:                     'key':['a','b','c'],
   ...:                     'key1':['d','e','f']})
   ...:
   ...:

In [8]: data2 = pd.DataFrame({
    
    'c':['c1','c2','c3'],
   ...:                         'd':['d1','d2','d3'],
   ...:                         'key':['a','b','a'],
   ...:                         'key1':['d','e','e']})

**how= "left" merge**

Take the key of the DataFrame on the left as the benchmark. If the matching fails on the right, replace it with NaN, and delete the row where the extra Key is located.

Snipaste_2020-07-18_08-21-29.png

  • Chart interpretation:
    • red: indicates that the row is removed;
    • blue: represents the row is reserved;
    • green: indicates that the mismatched value is replaced by NaNs;
In [9]: # how = left,以左边键为基准

In [10]: pd.merge(data1,data2,how ="left",on = ['key','key1'])
Out[10]:
    a   b key key1    c    d
0  a1  b1   a    d   c1   d1
1  a2  b2   b    e   c2   d2
2  a3  b3   c    f  NaN  NaN

how ="right"

Based on the key of the right DataFrame, the usage is similar to how=“left”, but the direction is opposite;

Snipaste_2020-07-18_08-21-51.png

In [11]: #how = right ,以右边为基准

In [12]: pd.merge(data1,data2,how = 'right',on =['key','key1'])
Out[12]:
     a    b key key1   c   d
0   a1   b1   a    d  c1  d1
1   a2   b2   b    e  c2  d2
2  NaN  NaN   a    e  c3  d3

how ="inner"

This merging method is used more frequently, based on the keys shared by the left and right DataFrames. Successful matching is retained, and all rows where the matching fails are deleted;

inner.png

In [16]: # how = inner,取左右交集;

In [17]: pd.merge(data1,data2,how ='inner',on = ['key','key1'])
Out[17]:
    a   b key key1   c   d
0  a1  b1   a    d  c1  d1
1  a2  b2   b    e  c2  d2

how ="outer"

Corresponding to the usage of how=“inner”, based on the keys shared by the left and right DataFrames, the keys that match successfully are retained, and the keys that fail to match are replaced with Nan;

outrt.png

In [13]: # how = outer,r取左右两边并集
    
In [15]: pd.merge(data1,data2,how ='outer',on = ['key','key1'])
Out[15]:
     a    b key key1    c    d
0   a1   b1   a    d   c1   d1
1   a2   b2   b    e   c2   d2
2   a3   b3   c    f  NaN  NaN
3  NaN  NaN   a    e   c3   d3

DataFrame 具有不同 key 值合并

当要合并的两个 DataFrame 具有不同 key 值时,这里需要用到 left_on、right_on 参数,分别用来指定左右 DataFrame 的列名;

left_on 、right_on 为key 作为基准

left_on 选取 key 名时,需要对 right_on 设置对应 键名,且需要保证 len(left_on) == len(right_on),

加入 suffixes 参数,是因为左右具有相同列名( value ),保证合并后的 列名都不一样

In [18]: df1 = pd.DataFrame({
    
    'lkey': ['foo', 'bar', 'baz', 'foo'],
    ...:                     'value': [1, 2, 3, 5]})

In [19]: df2 = pd.DataFrame({
    
    'rkey': ['foo', 'bar', 'baz', 'foo'],
    ...:                     'value': [5, 6, 7, 8]})

In [20]: df1
Out[20]:
  lkey  value
0  foo      1
1  bar      2
2  baz      3
3  foo      5

In [21]: df2
Out[21]:
  rkey  value
0  foo      5
1  bar      6
2  baz      7
3  foo      8

In [22]: pd.merge(df1,df2,left_on ='lkey')


In [23]: pd.merge(df1,df2,left_on ='lkey',right_on ='rkey')
Out[23]:
  lkey  value_x rkey  value_y
0  foo        1  foo        5
1  foo        1  foo        8
2  foo        5  foo        5
3  foo        5  foo        8
4  bar        2  bar        6
5  baz        3  baz        7

# 设置 suffixes 参数之后
In [24]: pd.merge(df1,df2,left_on ='lkey',right_on ='rkey',suffixes=("_lf","_rf"))
Out[24]:
  lkey  value_lf rkey  value_rf
0  foo         1  foo         5
1  foo         1  foo         8
2  foo         5  foo         5
3  foo         5  foo         8
4  bar         2  bar         6
5  baz         3  baz         7

操作前需要保证 键值长度相等,len(left_on) == len(right_on);否则会出现下面错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-0660dac837b1> in <module>
----> 1 pd.merge(df1,df2,left_on ='lkey')

~\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     79         copy=copy,
     80         indicator=indicator,
---> 81         validate=validate,
     82     )
     83     return op.get_result()

~\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    617             warnings.warn(msg, UserWarning)
    618
--> 619         self._validate_specification()
    620
    621         # note this function has side effects

~\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in _validate_specification(self)
   1221                     )
   1222                 self.left_on = [None] * n
-> 1223         if len(self.right_on) != len(self.left_on):
   1224             raise ValueError("len(right_on) must equal len(left_on)")
   1225

TypeError: object of type 'NoneType' has no len()

以索引列( index )作为合并基准

merge() 也可以以索引列作为合并基准,此时用到两个参数 left_on、right_on;都设为 True;

np.random.seed([3, 14])
left = pd.DataFrame({
    
    'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame({
    
    'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right

           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076



left.merge(right, left_index=True, right_index=True)


         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

同时合并多个 DataFrame

合并多个 DataFrame 的方法有很多种,这里列出下面几条:

低效 merge()

df1.merge(df2, ...).merge(df3, ...)

方法合并时需要设置多个参数,并且较为低效;

pd.concat() 进行合并

pd.concat() 可以对多个 DataFrame 进行同时合并,合并方法与前面提到的 merge() 四种方法相同,区别是前面以关键字 how 衔接,这里以 join 作为衔接参数:

np.random.seed(0)
A = pd.DataFrame({
    
    'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({
    
    'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({
    
    'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C] 

# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')


pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0

4, indicator 参数

indicator 参数(设置为 True);用来表示 DataFrame 会添加新的一列 ,列名为_merge; 来表示各行的合并类型信息。

In [25]: pd.merge(data1,data2,how ='outer',on = ['key','key1'],indicator = True)
Out[25]:
     a    b key key1    c    d      _merge
0   a1   b1   a    d   c1   d1        both
1   a2   b2   b    e   c2   d2        both
2   a3   b3   c    f  NaN  NaN   left_only
3  NaN  NaN   a    e   c3   d3  right_only

indicator 也可以设为 String ,自定义列名

In [27]: pd.merge(data1,data2,how ='outer',on = ['key','key1'],indicator ="col_info")
Out[27]:
     a    b key key1    c    d    col_info
0   a1   b1   a    d   c1   d1        both
1   a2   b2   b    e   c2   d2        both
2   a3   b3   c    f  NaN  NaN   left_only
3  NaN  NaN   a    e   c3   d3  right_only

好了,以上就是本文对 merge() 用法的介绍,关于 merge() 更深入、全面的用法,以后再加讨论

最后感谢大家阅读!

Reference:

1,https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

2,https://stackoverflow.com/questions/53645882/pandas-merging-101

文章首发于公众号(Z先生点记

Guess you like

Origin blog.csdn.net/weixin_42512684/article/details/107461520