Pandas Tutorial_Merge function realizes data merging with graphic details

In order to facilitate maintenance, the data of a general company is stored in separate tables in the database. For example, a table is used to store the basic information of all users, and a table is used to store the consumption status of users. Therefore, in daily data processing, it is often necessary to splice two tables together . Such an operation corresponds to SQL join, and is mergeused to . This article will talk about mergethe main principles.

The introduction part above mentioned that mergeit is used to splice two tables, so the user information needs to be spliced ​​one by one when splicing, so the two tables for splicing need to have a common key to identify users (key) . To sum up, the whole mergeprocess is a process of matching information one by onemerge . The four types introduced below are 'inner', 'left', 'right'and 'outer'.

1. inner

mergeThe 'inner'type is called an inner join , which takes the intersection of the keys (keys) of the two tables during the splicing process . What does that mean? The following is a step-by-step disassembly in a graphical way.

First of all, we have the following data. The data on the left and right represent the basic information and consumption information of the user respectively . The key to connect the two tables is userid.

'inner'the way it is nowmerge

In [6]: df_1.merge(df_2,how='inner',on='userid')
Out[6]:
  userid  age  payment
0      a   23     2000
1      c   32     3500

Process Diagram:

① Take the intersection of the keys of the two tables , where the intersection of df_1and isdf_2userid{a,c}

② Corresponding matching

③Result

Process summary:

I believe the whole process is not difficult to understand. The above demonstration is the case where two tables correspond to only one piece of data under the same key (one user corresponds to one consumption record). Then, if a user corresponds to multiple consumption records , then How is it stitched together?

Assuming that the current data becomes like this, in df_2, there are two and acorresponding data:

Do it in the same innerway merge:

In [12]: df_1.merge(df_2,how='inner',on='userid')
Out[12]:
  userid  age  payment
0      a   23     2000
1      a   23      500
2      b   46     1000
3      c   32     3500

Except for the corresponding matching stage , the whole process is basically the same as above.

Process Diagram:

① Take the intersection of the keys of the two tables , where the intersection of df_1and isdf_2userid{a,b,c}

② When matching, since athere are two corresponding consumption records here, when splicing, the acorresponding data in the user basic information table will be copied by one more row to match with the right side .

③Result


Case base camp of paper reproduction machine learning model (collection)


Two, left and right

'left''right'The way of and mergeis actually similar, which are called left join and right join respectively . These two methods can be converted to each other, so they are introduced together here.

  • 'left'

merge, pairing is based on the key in the left table , and if the key in the left table does not exist on the right, fill in the missing value NaN.

  • 'right'

mergeWhen , the pairing is performed based on the key in the right tableNaN . If the key in the right table does not exist on the left, the missing value is filled.

What does that mean? To explain it concretely with an example, this is the demo data

'left'the way it is nowmerge

In [21]: df_1.merge(df_2,how='left',on='userid')
Out[21]:
  userid  age  payment
0      a   23   2000.0
1      b   46      NaN
2      c   32   3500.0
3      d   19      NaN

Process Diagram:

①Pair according to all the keys in the table on the left. In the figure, because the ones in the right table eare not in the left table, no pairing will be performed.

② If the columns in the right table paymentare merged into the left table, NaNfill in missing values ​​for those without matching values

Process summary:

The sum of 'right'the types is almost the same, as long as the positions of the two tables are exchanged, the results returned by the two methods are the same (), as follows:merge'left'

In [22]: df_2.merge(df_1,how='right',on='userid')
Out[22]:
  userid  payment  age
0      a   2000.0   23
1      c   3500.0   32
2      b      NaN   46
3      d      NaN   19

As for the case where the key connected to 'left'Zhong 'right'(and even what will be introduced below ) is one-to-many, the principle is similar to the one above, so I won’t repeat it here.'outer''inner'

Three, outer

'outer'It is an outer join . During the splicing process, it will take the union of the keys (keys) of the two tables for splicing. The text is not intuitive enough, let’s give an example!

Or use the demo data used above

This time 'outer'usemerge

In [24]: df_1.merge(df_2,how='outer',on='userid')
Out[24]:
  userid   age  payment
0      a  23.0   2000.0
1      b  46.0      NaN
2      c  32.0   3500.0
3      d  19.0      NaN
4      e   NaN    600.0

The diagram is as follows:

① Take the union of two table keys, here is{a,b,c,d,e}

②Put the data columns of the two tables together, and NaNfill in the missing values ​​for the places that do not match

Friends who can read here must have basically understood mergethe whole process. In summary, mergethe difference between the different types is that when splicing, the key sets of the two tables are selected differently . mergeThis is the end of the introduction about Pandas !

Turn to https://zhuanlan.zhihu.com/p/102274476

Guess you like

Origin blog.csdn.net/toby001111/article/details/127180457