In order to facilitate maintenance, the data of a general company is stored in separate tables in the database. For example, a table is used to store the basic information of all users, and a table is used to store the consumption status of users. Therefore, in daily data processing, it is often necessary to splice two tables together . Such an operation corresponds to SQL join
, and is merge
used to . This article will talk about merge
the main principles.
The introduction part above mentioned that merge
it is used to splice two tables, so the user information needs to be spliced one by one when splicing, so the two tables for splicing need to have a common key to identify users (key) . To sum up, the whole merge
process is a process of matching information one by onemerge
. The four types introduced below are 'inner'
, 'left'
, 'right'
and 'outer'
.
1. inner
merge
The 'inner'
type is called an inner join , which takes the intersection of the keys (keys) of the two tables during the splicing process . What does that mean? The following is a step-by-step disassembly in a graphical way.
First of all, we have the following data. The data on the left and right represent the basic information and consumption information of the user respectively . The key to connect the two tables is userid
.
'inner'
the way it is nowmerge
In [6]: df_1.merge(df_2,how='inner',on='userid')
Out[6]:
userid age payment
0 a 23 2000
1 c 32 3500
Process Diagram:
① Take the intersection of the keys of the two tables , where the intersection of df_1
and isdf_2
userid
{a,c}
② Corresponding matching
③Result
Process summary:
I believe the whole process is not difficult to understand. The above demonstration is the case where two tables correspond to only one piece of data under the same key (one user corresponds to one consumption record). Then, if a user corresponds to multiple consumption records , then How is it stitched together?
Assuming that the current data becomes like this, in df_2
, there are two and a
corresponding data:
Do it in the same inner
way merge
:
In [12]: df_1.merge(df_2,how='inner',on='userid')
Out[12]:
userid age payment
0 a 23 2000
1 a 23 500
2 b 46 1000
3 c 32 3500
Except for the corresponding matching stage , the whole process is basically the same as above.
Process Diagram:
① Take the intersection of the keys of the two tables , where the intersection of df_1
and isdf_2
userid
{a,b,c}
② When matching, since a
there are two corresponding consumption records here, when splicing, the a
corresponding data in the user basic information table will be copied by one more row to match with the right side .
③Result
Case base camp of paper reproduction machine learning model (collection)
Two, left and right
Case base camp of paper reproduction machine learning model (collection)
'left'
'right'
The way of and merge
is actually similar, which are called left join and right join respectively . These two methods can be converted to each other, so they are introduced together here.
'left'
merge
, pairing is based on the key in the left table , and if the key in the left table does not exist on the right, fill in the missing value NaN
.
'right'
merge
When , the pairing is performed based on the key in the right tableNaN
. If the key in the right table does not exist on the left, the missing value is filled.
What does that mean? To explain it concretely with an example, this is the demo data
'left'
the way it is nowmerge
In [21]: df_1.merge(df_2,how='left',on='userid')
Out[21]:
userid age payment
0 a 23 2000.0
1 b 46 NaN
2 c 32 3500.0
3 d 19 NaN
Process Diagram:
①Pair according to all the keys in the table on the left. In the figure, because the ones in the right table e
are not in the left table, no pairing will be performed.
② If the columns in the right table payment
are merged into the left table, NaN
fill in missing values for those without matching values
Process summary:
The sum of 'right'
the types is almost the same, as long as the positions of the two tables are exchanged, the results returned by the two methods are the same (), as follows:merge
'left'
In [22]: df_2.merge(df_1,how='right',on='userid')
Out[22]:
userid payment age
0 a 2000.0 23
1 c 3500.0 32
2 b NaN 46
3 d NaN 19
As for the case where the key connected to 'left'
Zhong 'right'
(and even what will be introduced below ) is one-to-many, the principle is similar to the one above, so I won’t repeat it here.'outer'
'inner'
Three, outer
'outer'
It is an outer join . During the splicing process, it will take the union of the keys (keys) of the two tables for splicing. The text is not intuitive enough, let’s give an example!
Or use the demo data used above
This time 'outer'
usemerge
In [24]: df_1.merge(df_2,how='outer',on='userid')
Out[24]:
userid age payment
0 a 23.0 2000.0
1 b 46.0 NaN
2 c 32.0 3500.0
3 d 19.0 NaN
4 e NaN 600.0
The diagram is as follows:
① Take the union of two table keys, here is{a,b,c,d,e}
②Put the data columns of the two tables together, and NaN
fill in the missing values for the places that do not match
Friends who can read here must have basically understood merge
the whole process. In summary, merge
the difference between the different types is that when splicing, the key sets of the two tables are selected differently . merge
This is the end of the introduction about Pandas !
Turn to https://zhuanlan.zhihu.com/p/102274476