Basic Python tutorial: DataFrame data merge, join (concat, merge, join) examples in Pandas

@This article comes from the public number: csdn2299, like you can pay attention to the public number programmers Academy
today Xiaobian will share an example of DataFrame data merge, join (concat, merge, join) in Pandas, which has a good reference value ,I hope to be helpful. Let's take a look with the editor together.
Recently, at work, I encountered problems with data merge and connection, so I will arrange them as follows for reference by those who need it ~

1. Concat: Stack multiple objects together along an axis

The concat method is equivalent to a union all in the database. It can not only specify the connection method (outer join or inner join) but also specify the connection according to an axis. Unlike the database, it will not deduplicate, but you can use the drop_duplicates method to achieve the effect of deduplication.

concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, 
    keys=None, levels=None, names=None, verify_integrity=False, copy=True):

pd.concat () simply splices the two tables together. The parameter axis is the key. It is used to specify whether it is a row or a column. The axis is 0 by default.

When axis = 0, the effect of pd.concat ([obj1, obj2]) is the same as obj1.append (obj2); when axis = 1, pd.concat ([obj1, obj2], axis = 1) The effect is the same as pd.merge (obj1, obj2, left_index = True, right_index = True, how = 'outer').

See below for an introduction to the merge method.

Parameter introduction:

objs: collection of objects that need to be connected, generally lists or dictionaries;

axis: connection axis;

join: the parameter is 'outer' or 'inner';

join_axes = []: Specify a custom index;

keys = []: create a hierarchical index;

ignore_index = True: Rebuild the index

For example: pd.concat () is simply to splice the two tables together. The parameter axis is the key, which is used to specify whether it is a row or a column.

When axis = 0, the effect of pd.concat ([obj1, obj2]) is the same as obj1.append (obj2); when axis = 1, pd.concat ([obj1, obj2], axis = 1) The effect is the same as pd.merge (obj1, obj2, left_index = True, right_index = True, how = 'outer').

See below for an introduction to the merge method.

Parameter introduction:

objs: collection of objects that need to be connected, generally lists or dictionaries;

axis: connection axis;

join: the parameter is 'outer' or 'inner';

join_axes = []: Specify a custom index;

keys = []: create a hierarchical index;

ignore_index = True: Rebuild the index

Examples:

df1=DataFrame(np.random.randn(3,4),columns=['a','b','c','d']) 
  
df2=DataFrame(np.random.randn(2,3),columns=['b','d','a']) 
  
pd.concat([df1,df2]) 
  
     a     b     c     d 
0 -0.848557 -1.163877 -0.306148 -1.163944
1 1.358759 1.159369 -0.532110 2.183934
2 0.532117 0.788350 0.703752 -2.620643
0 -0.316156 -0.707832    NaN -0.416589
1 0.406830 1.345932    NaN -1.874817
  
pd.concat([df1,df2],ignore_index=True) 
  
     a     b     c     d 
0 -0.848557 -1.163877 -0.306148 -1.163944
1 1.358759 1.159369 -0.532110 2.183934
2 0.532117 0.788350 0.703752 -2.620643
3 -0.316156 -0.707832    NaN -0.416589
4 0.406830 1.345932    NaN -1.874817

Second, merge: splice columns by key

Similar to the connection method of relational databases, different DatFrames can be connected according to one or more keys. The typical application scenario of this function is that for a table with two different fields for the same primary key, it is integrated into one table according to the primary key.

merge(left, right, how='inner', on=None, left_on=None, right_on=None, 
left_index=False, right_index=False, sort=True, 
suffixes=('_x', '_y'), copy=True, indicator=False)

Parameter introduction:

left and right: two different DataFrames;

how: connection method, inner, left, right, outer, default is inner;

on: refers to the column index name used for the connection, must exist in the left and right DataFrame, if not specified and other parameters are not specified, the intersection of the two DataFrame column names as the connection key

left_on: The column name used to connect the keys in the left DataFrame. This parameter is very useful when the left and right column names are different but the meanings are the same;

right_on: the column name used to connect the keys in the right DataFrame;

left_index: Use the row index in the left DataFrame as the connection key;

right_index: Use the row index in the right DataFrame as the connection key;

sort: True by default, sort the merged data, set to False to improve performance;

suffixes: tuple composed of string values, used to specify the suffix name appended to the column name when the same column name exists in the left and right DataFrame, the default is ('_x', '_y');

copy: True by default, always copy the data into the data structure, set to False can improve performance;

indicator: display the source of the data in the merged data

Examples:

# 1.默认以重叠的列名当做连接键。
df1=DataFrame({'key':['a','b','b'],'data1':range(3)})  
df2=DataFrame({'key':['a','b','c'],'data2':range(3)})  
pd.merge(df1,df2)  #没有指定连接键,默认用重叠列名,没有指定连接方式 
  
  data1 key data2 
0   0  a   0
1   1  b   1
2   2  b   1
  
# 2.默认做inner连接(取key的交集),连接方式还有(left,right,outer),制定连接方式加参数:how=''
pd.merge(df2,df1) 
  
  data2 key data1 
0   0  a   0
1   1  b   1
2   1  b   2          #默认内连接,可以看见c没有连接上。 
  
pd.merge(df2,df1,how='left')  #通过how,指定连接方式 
  
  data2 key data1 
0   0  a   0
1   1  b   1
2   1  b   2
3   2  c  NaN 
  
# 3.多键连接时将连接键组成列表传入,例:pd.merge(df1,df2,on=['key1','key2']
right=DataFrame({'key1':['foo','foo','bar','bar'], 
     'key2':['one','one','one','two'], 
     'lval':[4,5,6,7]}) 
left=DataFrame({'key1':['foo','foo','bar'], 
     'key2':['one','two','one'], 
     'lval':[1,2,3]}) 
right=DataFrame({'key1':['foo','foo','bar','bar'], 
     'key2':['one','one','one','two'], 
     'lval':[4,5,6,7]}) 
pd.merge(left,right,on=['key1','key2'],how='outer') #传出数组 
   
 key1 key2 lval_x lval_y 
0 foo one    1    4
1 foo one    1    5
2 foo two    2   NaN 
3 bar one    3    6
4 bar two   NaN    7
  
# 4.如果两个对象的列名不同,可以分别指定,例:pd.merge(df1,df2,left_on='lkey',right_on='rkey')
df3=DataFrame({'key3':['foo','foo','bar','bar'], #将上面的right的key 改了名字 
     'key4':['one','one','one','two'], 
     'lval':[4,5,6,7]}) 
pd.merge(left,df3,left_on='key1',right_on='key3') #键名不同的连接 
   
 key1 key2 lval_x key3 key4 lval_y 
0 foo one    1 foo one    4
1 foo one    1 foo one    5
2 foo two    2 foo one    4
3 foo two    2 foo one    5
4 bar one    3 bar one    6
5 bar one    3 bar two    7

Three, join: mainly used for index merge

join(self, other, on=None, how='left', lsuffix='', rsuffix='',sort=False):

The meaning of its parameters is basically the same as that of the merge method.
Thank you very much for reading
. When I chose to study python at university, I found that I ate a bad computer foundation. I did n’t have an academic qualification. This is
nothing to do. I can only make up for it, so I started my own counterattack outside of coding. The road, continue to learn the core knowledge of python, in-depth study of computer basics, sorted out, if you are not willing to be mediocre, then join me in coding, and continue to grow!
In fact, there are not only technology here, but also things beyond those technologies. For example, how to be an exquisite programmer, rather than "cock silk", the programmer itself is a noble existence, isn't it? [Click to join] Want to be yourself, want to be a noble person, come on!

Published 45 original articles · praised 16 · 20,000+ views

Guess you like

Origin blog.csdn.net/chengxun03/article/details/105521957