pandas plurality DataFrame combined in a variety of ways:
- The merge merger merging is the same value in a column
- merge join around a plurality DataFrame, incorporating a plurality of columns is equivalent to
- concat be the same combined DataFrame plurality of column names, or may be provided combined columns into row
1. merge (merging according to the same connecting column values)
how = connection
on = combined column name
The default connection within an inner join
Not only consider the key column set some common values
df1 = pd.DataFrame({'id':[1,2,3,4,5,6],'city':['wuhan','newyork','shanghai','paris','losangeles','london'],'country':['china','usa','china','france','usa','england'],'visits':[2,1,2,1,2,1]})
df2 = pd.DataFrame({'id':[1,2,3,4,5],'country':['china','france','usa','germany','japan'],'visits':[2,2,2,2,2]})
print(df1)
print(df2)
print('内链接')
df3 = pd.merge(df1,df2,on='country')
print(df3)
Left connection
Df1 in consideration of all the values in the column key, key value column df2 does not correspond, then, NaN filling
print('左链接')
df3 = pd.merge(df1,df2,on='country',how='left')
print(df3)
The right connection
Consider all the key values in columns df2, key value column df1 it does not correspond, NaN filling
print('右链接')
df3 = pd.merge(df1,df2,on='country',how='right')
print(df3)
Multiple columns named key link
The combined plurality of columns, if desired, provided on = ( 'column 1', 'Column 2', ...)
print('多个列名为链接键')
df4 = pd.merge(df1,df2,on=('country','visits'))
print(df4)
After the merger set the column name suffix for the new column
If other columns the same column name exists after the merger, may be provided df1, df2 repeated name suffixes:
suffixes, that = ( 'suffix left', 'right-suffix'))
If not set, the default is the suffix _x, _y
print('合并后设置新列的列名后缀')
df4 = pd.merge(df1,df2,on='country',suffixes=('_city','_country'))
print(df4)
About different table column names merge
The case has left and right two tables need to merge not necessarily the same column name, such as the country combined df2 df1 in countryname
print('左右表不同列名合并')
df22.columns=['id','countryname','visits']
df5 = pd.merge(df1,df22,left_on=['country'],right_on=['countryname'],suffixes=('_city','_country'))
print(df5)
Delete extra columns
Extra columns (countryname) just delete appears
print('删除列')
df5.drop(columns=['countryname'],inplace=True)
print(df5)
2.join (df two different column names of merger)
Default left join
Do not set the how, the default left-connected, that is, considering all the rows df1, df2 if the number of lines is not enough to make up the Nan
df1 = pd.DataFrame({'id':[1,2,3,4,5,6],'city':['wuhan','newyork','shanghai','paris','losangeles','london'],'country':['china','usa','china','france','usa','england'],'visits':[2,1,2,1,2,1]})
df2 = pd.DataFrame({'idy':[1,2,3,4,6],'cityy':['wuhan','newyork','shanghai','paris','losangeles'],'countryy':['china','usa','china','france','usa']})
print(df1)
print(df2)
print('默认how=left')
df3 = df1.join(df2)
print(df3)
The right connection
Set how = 'right', considering all rows df2, df1 if the number of lines is sufficient to complement Nan
print('右链接')
df3 = df1.join(df2,how='right')
print(df3)
En
how = 'inner', in order df1, df2 minimum number of rows prevail
print('内链接')
df3 = df1.join(df2,how='inner')
print(df3)
Outer join
how = 'outer', in order df1, df2 largest number of rows prevail
print('外链接')
df3 = df1.join(df2,how='outer')
print(df3)
concat (specified dimensions merge merge df)
Column connection
Set axis = 1, then all the columns are spliced together, the number of columns = the number of columns + df2 series df1
from pandas import concat
df1 = pd.DataFrame({'id':[1,2,3,4,5,6],'city':['wuhan','newyork','shanghai','paris','losangeles','london'],'country':['china','usa','china','france','usa','england'],'visits':[2,1,2,1,2,1]})
df2 = pd.DataFrame({'id':[1,2,3,4,6],'city':['wuhan','newyork','shanghai','paris','losangeles'],'country':['china','usa','china','france','usa'],'visits':[2,1,2,1,2]})
print(df1)
print(df2)
print('列连接')
df3 = concat([df1,df2],join="inner",axis=1)
print(df3)
Line connection
Axis is not provided, the axis = 0, requires the presence of the same column name, the name of the same column, the upper stitching line, the number of rows = number of rows df1 rows + df2
print('行连接')
df3 = concat([df1,df2])
print(df3)
Line connecting line and to develop an index
When the line is connected, you can specify index
print('行连接并制定行索引')
df3 = concat([df1,df2],keys=['a','b'])
print(df3)
Deduplication
Will be connected to the same line after line information, drop_duplicates can delete duplicate rows
print("去重")
df3 = concat([df1,df2],ignore_index=True).drop_duplicates()
print(df3)