Data combining scheme pandas mainly concat, merge, join, etc. functions.
- Wherein concat mainly stitching rows or columns based on the index, taking the intersection of a row or column only or union.
- merge merged primarily based on a common column or index, you may take the connector, connecting the left and right, that the outer connections.
- join with similar functionality merge, is omitted.
import pandas as pd
from pandas import Series,DataFrame
# 定义一个函数,根据行和列名对元素设置值
def make_df(cols,inds):
data = {c:[c+str(i) for i in inds] for c in cols}
return DataFrame(data,index=inds)
df1 = make_df(list("abc"),[1,2,4])
df1
|
a |
b |
c |
1 |
a1 |
b1 |
c1 |
2 |
a2 |
b2 |
c2 |
4 |
a4 |
b4 |
CH4 |
df2 = make_df(list("abcd"),[2,4,6])
df2
|
a |
b |
c |
d |
2 |
a2 |
b2 |
c2 |
d2 |
4 |
a4 |
b4 |
CH4 |
d4 |
6 |
a6 |
b6 |
c6 |
d6 |
df11=df1.set_index('a')
df22=df2.set_index('a')
1. concat function
- axis: The default is 0, rows of stitching; 1 is spliced by column
- ignore_index: The default is False, the index will be spliced; True original index will be ignored, rebuilding a new index
- join: as splicing, including inner, outer
- sort: True represents a sort Index
(1) simply press splicing index ranks
# 按行拼接
pd.concat([df1,df2],sort=False)
|
a |
b |
c |
d |
1 |
a1 |
b1 |
c1 |
NaN |
2 |
a2 |
b2 |
c2 |
NaN |
4 |
a4 |
b4 |
CH4 |
NaN |
2 |
a2 |
b2 |
c2 |
d2 |
5 |
a5 |
b5 |
c5 |
d5 |
6 |
a6 |
b6 |
c6 |
d6 |
# 按列拼接
pd.concat([df1,df2],axis=1)
|
a |
b |
c |
a |
b |
c |
d |
1 |
a1 |
b1 |
c1 |
NaN |
NaN |
NaN |
NaN |
2 |
a2 |
b2 |
c2 |
a2 |
b2 |
c2 |
d2 |
4 |
a4 |
b4 |
CH4 |
NaN |
NaN |
NaN |
NaN |
5 |
NaN |
NaN |
NaN |
a5 |
b5 |
c5 |
d5 |
6 |
NaN |
NaN |
NaN |
a6 |
b6 |
c6 |
d6 |
(2) remove the original index splicing
# 按行拼接,去掉原来的行索引重新索引
pd.concat([df1,df2],sort=False,ignore_index=True)
|
a |
b |
c |
d |
0 |
a1 |
b1 |
c1 |
NaN |
1 |
a2 |
b2 |
c2 |
NaN |
2 |
a4 |
b4 |
CH4 |
NaN |
3 |
a2 |
b2 |
c2 |
d2 |
4 |
a5 |
b5 |
c5 |
d5 |
5 |
a6 |
b6 |
c6 |
d6 |
# 按列拼接,去掉原来的列索引重新索引
pd.concat([df1,df2],axis=1,ignore_index=True)
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
1 |
a1 |
b1 |
c1 |
NaN |
NaN |
NaN |
NaN |
2 |
a2 |
b2 |
c2 |
a2 |
b2 |
c2 |
d2 |
4 |
a4 |
b4 |
CH4 |
NaN |
NaN |
NaN |
NaN |
5 |
NaN |
NaN |
NaN |
a5 |
b5 |
c5 |
d5 |
6 |
NaN |
NaN |
NaN |
a6 |
b6 |
c6 |
d6 |
Splicing (3) specifies the connection mode
- Splicing have inner, outer, left, right
# 交集,inner join
pd.concat([df1,df2],sort=False,join='inner')
|
a |
b |
c |
1 |
a1 |
b1 |
c1 |
2 |
a2 |
b2 |
c2 |
4 |
a4 |
b4 |
CH4 |
2 |
a2 |
b2 |
c2 |
5 |
a5 |
b5 |
c5 |
6 |
a6 |
b6 |
c6 |
# 并集,outer join
pd.concat([df1,df2],sort=False,join='outer')
|
a |
b |
c |
d |
1 |
a1 |
b1 |
c1 |
NaN |
2 |
a2 |
b2 |
c2 |
NaN |
4 |
a4 |
b4 |
CH4 |
NaN |
2 |
a2 |
b2 |
c2 |
d2 |
5 |
a5 |
b5 |
c5 |
d5 |
6 |
a6 |
b6 |
c6 |
d6 |
2.merge function
- how: data consolidation approach. left: Based on the data left dataframe columns combined; right: merging data based on the right dataframe column; outer: based on external data columns were combined (taken and set); inner: based on combined (intersected) within data column; default ' inner '.
- on: Based on the same column merge
- left_on / right_on: Left / Right dataframe combined column name.
- left_index / right_index: whether the index column names as data consolidation, True means yes. May use left_on / right_on combined
- sort: According dataframe merge sort keys, default yes.
- suffixes: if the same column and the column is not listed as combined, may be provided by the suffixes a column extension, typically a list of tuples and type.
(1) Based on the same column merge
df3 = pd.merge(df1,df2,how='inner',on='a') # 基于单列的合并
df4 = pd.merge(df1,df2,how='inner',on=['a','b']) # 基于多列的合并
df5 = pd.merge(df1,df2,how='left',on='a',suffixes=['_1','_2']) # 左连接,且指定后缀
df5
|
a |
b_1 |
c_1 |
b_2 |
c_2 |
d |
0 |
a1 |
b1 |
c1 |
NaN |
NaN |
NaN |
1 |
a2 |
b2 |
c2 |
b2 |
c2 |
d2 |
2 |
a4 |
b4 |
CH4 |
b4 |
CH4 |
d4 |
(2) based on the different combined column name or column and index, or the index and index
df6 = pd.merge(df1,df2,how='inner',left_on='a',right_on='b') # 基于不同列名
df7 = pd.merge(df1,df22,how='inner',left_on='a',right_index=True) #基于列和索引
df8 = pd.merge(df1,df2,how='inner',left_index=True,right_index=True) #基于两边都是索引
df8
|
a_x |
b_x |
c_x |
Oh |
b_y |
c_y |
d |
2 |
a2 |
b2 |
c2 |
a2 |
b2 |
c2 |
d2 |
4 |
a4 |
b4 |
CH4 |
a4 |
b4 |
CH4 |
d4 |