[Pandas] [1] DataFrame data merge, join (merge, join, concat)

merge to join columns by key

Pandas provides a method <Strong>merage</Strong> similar to the join operation of a relational database, which can connect rows in different DataFrames according to one or more keys. The
syntax is as follows

 
  1. merge(left, right, how='inner', on=None, left_on=None, right_on=None,

  2. left_index=False, right_index=False, sort=True,

  3. suffixes=('_x', '_y'), copy=True, indicator=False)

It is used to join the rows of two data sets through one or more keys, similar to JOIN in SQL. The typical application scenario of this function is that there are two tables containing different fields for the same primary key, and now we want to integrate them into one table. In this typical case, the number of rows in the result set does not increase, and the number of columns is the number of columns of the two metadata minus the number of join keys.
on=None is used to display the specified column name (key name). If the column name of the column on the two objects is different, it can be specified separately by left_on=None, right_on=None. Or if you want to directly use the row index as the connection key, set left_index=False, right_index=False to True.
The how='inner' parameter refers to the way to take the result when there are non-overlapping keys in the left and right objects: inner stands for intersection; outer stands for union; left and right take one side respectively.
suffixes=('_x','_y') refers to when there are columns with the same name in the left and right objects except the join key, the distinguishing method in the result set can be each added with a small tail.
For many-to-many connections, the result is the Cartesian product of rows.

Parameter description:
left and right: two different DataFrame
how: refers to the way of merging (connection): inner (inner connection), left (left outer connection), right (right outer connection), outer (full outer connection) ; The default is inner
on: Refers to the column index name used for connection. Must exist in the right and right DataFrame objects, if not specified and other parameters are not specified, the intersection of the column names of the two DataFrames is used as the connection key
left_on: left is the column name used as the connection key in the DataFrame; left and right in this parameter It is useful when the column names are not the same, but they represent the same meaning.
right_on: the column name used as the connection key in the right DataFrame
left_index: use the row index in the left DataFrame as the connection key
right_index: use the row index in the right DataFrame as the connection key
sort: the default is True, will merge The data is sorted. In most cases, setting to False can improve performance
suffixes: a tuple of string values, used to specify the suffix name appended to the column name when the same column name exists in the left and right DataFrames, the default is ('_x','_y ')
copy: the default is True, always copy the data to the data structure; in most cases, set to False can improve performance
indicator: in 0.17.0 also added a display of the source of the merged data; such as only to yourself On the left (left_only), both (both)
 

Some examples of merge features:

1. By default, overlapping column names are used as connection keys.

 

 
  1. In [16]: df1=DataFrame({'key':['a','b','b'],'data1':range(3)})

  2.  
  3. In [17]: df2=DataFrame({'key':['a','b','c'],'data2':range(3)})

  4.  
  5. In [18]: pd.merge(df1,df2) #没有指定连接键,默认用重叠列名,没有指定连接方式

  6. Out[18]:

  7. data1 key data2

  8. 0 0 a 0

  9. 1 1 b 1

  10. 2 2 b 1

 

 

2. The inner connection is done by default (take the intersection of the key), and the connection method is (left, right, outer), and the connection method is specified and the parameter: how=''

 

 
  1. In [19]: pd.merge(df2,df1)

  2. Out[19]:

  3. data2 key data1

  4. 0 0 a 0

  5. 1 1 b 1

  6. 2 1 b 2 #默认内连接,可以看见c没有连接上。

  7.  
  8. In [20]: pd.merge(df2,df1,how='left') #通过how,指定连接方式

  9. Out[20]:

  10. data2 key data1

  11. 0 0 a 0

  12. 1 1 b 1

  13. 2 1 b 2

  14. 3 2 c NaN

 

 

3. When multi-key connection, the connection key is formed into a list and passed in, for example: pd.merge(df1,df2,on=['key1','key2']

 

 
  1. In [23]: right=DataFrame({'key1':['foo','foo','bar','bar'],

  2. ...: 'key2':['one','one','one','two'],

  3. ...: 'lval':[4,5,6,7]})

  4.  
  5. In [24]: left=DataFrame({'key1':['foo','foo','bar'],

  6. ...: 'key2':['one','two','one'],

  7. ...: 'lval':[1,2,3]})

  8.  
  9. In [25]: right=DataFrame({'key1':['foo','foo','bar','bar'],

  10. ...: 'key2':['one','one','one','two'],

  11. ...: 'lval':[4,5,6,7]})

  12.  
  13. In [26]: pd.merge(left,right,on=['key1','key2'],how='outer') #传出数组

  14. Out[26]:

  15. key1 key2 lval_x lval_y

  16. 0 foo one 1 4

  17. 1 foo one 1 5

  18. 2 foo two 2 NaN

  19. 3 bar one 3 6

  20. 4 bar two NaN 7

 

 

4. If the column names of the two objects are different, you can specify them separately, for example: pd.merge(df1,df2,left_on='lkey',right_on='rkey')

 

 
  1. In [31]: df3=DataFrame({'key3':['foo','foo','bar','bar'], #将上面的right的key 改了名字

  2. ...: 'key4':['one','one','one','two'],

  3. ...: 'lval':[4,5,6,7]})

  4.  
  5. In [32]: pd.merge(left,df3,left_on='key1',right_on='key3') #键名不同的连接

  6. Out[32]:

  7. key1 key2 lval_x key3 key4 lval_y

  8. 0 foo one 1 foo one 4

  9. 1 foo one 1 foo one 5

  10. 2 foo two 2 foo one 4

  11. 3 foo two 2 foo one 5

  12. 4 bar one 3 bar one 6

  13. 5 bar one 3 bar two 7

 

 

5. Use the index as the connection key, use the parameters left_index=true, right_index=True (it is better to use join)

join splicing columns, mainly used for merging on the index


The join method provides a convenient way to merge different column indexes in two DataFrames into one DataFrame
 

join(self, other, on=None, how='left', lsuffix='', rsuffix='',sort=False):

The meaning of the parameters is basically the same as that of the merge method, but the join method defaults to the left outer join how=left
 

1. By default, merge by index, you can merge the same or similar indexes, regardless of whether they have overlapping columns.

2. Multiple DataFrames can be connected

3. You can connect other columns except the index

4. The connection method is controlled by the parameter how

5.Through lsuffix=``, rsuffix='' to distinguish columns with the same column name

concat can stack multiple objects together along an axis

The concat method is equivalent to the UNION ALL in the database. You can specify the connection according to a certain axis, or you can specify the connection method join (outer and inner only have these two). Concat does not de-duplicate when it is different from the database. To achieve de-duplication, you can use the drop_duplicates method
 

 
  1. concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,

  2. keys=None, levels=None, names=None, verify_integrity=False, copy=True):

The axial connection pd.concat() simply puts two tables together. This process is also called concatenation, binding or stacking. Therefore, it is conceivable that the key parameter of this function should be axis, which is used to specify the axis of the connection.

In the case of default  axis=0  , the effect of pd.concat([obj1,obj2]) function is the same as obj1.append(obj2);

In the  axis = 1  in the case of, pd.concat ([df1, df2] , axis = 1) effects pd.merge (df1, df2, left_index = True, right_index = True, how = 'outer') is the same of.

It can be understood that the concat function uses the index as the "connection key".
All parameters of this function are:

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False)。

objs is a collection of objects that need to be connected, usually a list or dictionary;

axis=0 is the connection axis join='outer' parameter works when the index of the other axis does not overlap, only'inner' and'outer' are optional (the usage of ignore_index=True is shown by the way)

 

Some features of concat:

1. When acting on Series, if axis=0, it is similar to union. When axis=1, a DataFrame is formed, the index is after union, and the column is similar to the result after join.

2. Specify a custom index through the parameter join_axes=[].

3. Create a hierarchical index with the parameter keys=[]

4. Rebuild the index through the parameter ignore_index=True.

 

 
  1. In [5]: df1=DataFrame(np.random.randn(3,4),columns=['a','b','c','d'])

  2.  
  3. In [6]: df2=DataFrame(np.random.randn(2,3),columns=['b','d','a'])

  4.  
  5. In [7]: pd.concat([df1,df2])

  6. Out[7]:

  7. a b c d

  8. 0 -0.848557 -1.163877 -0.306148 -1.163944

  9. 1 1.358759 1.159369 -0.532110 2.183934

  10. 2 0.532117 0.788350 0.703752 -2.620643

  11. 0 -0.316156 -0.707832 NaN -0.416589

  12. 1 0.406830 1.345932 NaN -1.874817

 
  1. In [8]: pd.concat([df1,df2],ignore_index=True)

  2. Out[8]:

  3. a b c d

  4. 0 -0.848557 -1.163877 -0.306148 -1.163944

  5. 1 1.358759 1.159369 -0.532110 2.183934

  6. 2 0.532117 0.788350 0.703752 -2.620643

  7. 3 -0.316156 -0.707832 NaN -0.416589

  8. 4 0.406830 1.345932 NaN -1.874817

 

Reprinted from: https://blog.csdn.net/zutsoft/article/details/51498026

Guess you like

Origin blog.csdn.net/xiezhen_zheng/article/details/82250657