Article Directory

I. Introduction
2. Grammatical comparison
3. Summary

I. Introduction

Environment:
windows11 64-bit
Python3.9
MySQL8
pandas1.4.2

This article mainly introduces MySQL unionand joinhow to use pandas to implement it, and what is the difference between the two.

Note: Python is a very flexible language. There may be multiple ways to achieve the same goal. What I provide is only one of the solutions. If you have other methods, please leave a message to discuss.

2. Grammatical comparison

data sheet

The data used this time are as follows.
The syntax for constructing this dataset using Python is as follows:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    
     'col1' : list(range(1,7))
                    ,'col2' : ['AA','AA','AA','BB','BB','BB']#list('AABCA')
                    ,'col3' : ['X',np.nan,'Da','Xi','Xa','xa']
                    ,'col4' : [10,5,3,5,2,None]
                    ,'col5' : [90,60,60,80,50,50]
                    ,'col6' : ['Abc','Abc','bbb','Cac','Abc','bbb']
                   })
df2 = pd.DataFrame({
    
    'col2':['AA','BB','CC'],'col7':[1,2,3],'col4':[5,6,7]})
df3 = pd.DataFrame({
    
    'col2':['AA','DD','CC'],'col8':[50,70,90]})

Note: Just put the code in the cell of jupyter and run it. In the following, we directly use df1, df2, and df3call the corresponding data.

The syntax for constructing this dataset using MySQL is as follows:

with t1 as(
  select  1 as col1, 'AA' as col2, 'X' as col3, 10.0 as col4, 90 as col5, 'Abc' as col6 union all
  select  2 as col1, 'AA' as col2, null as col3, 5.0 as col4, 60 as col5, 'Abc' as col6 union all
  select  3 as col1, 'AA' as col2, 'Da' as col3, 3.0 as col4, 60 as col5, 'bbb' as col6 union all
  select  4 as col1, 'BB' as col2, 'Xi' as col3, 5.0 as col4, 80 as col5, 'Cac' as col6 union all
  select  5 as col1, 'BB' as col2, 'Xa' as col3, 2.0 as col4, 50 as col5, 'Abc' as col6 union all
  select  6 as col1, 'BB' as col2, 'xa' as col3, null as col4, 50 as col5, 'bbb' as col6 
)
,t2 as(
  select  'AA' as col2, 1 as col7, 5 as col4 union all
  select  'BB' as col2, 2 as col7, 6 as col4 union all
  select  'CC' as col2, 3 as col7, 7 as col4 
)
,t3 as(
  select  'AA' as col2, 50 as col8 union all
  select  'DD' as col2, 70 as col8 union all
  select  'CC' as col2, 90 as col8 
)
select * from t1;

Note: Just put the code in the MySQL code run box and run it. When running the SQL code later, the data set (lines 1 to 18 of the code) is brought by default, and only the query statement is displayed, such as line 19.

The corresponding relationship is as follows:

Python dataset	MySQL dataset
df1	t1
df2	t2
df3	t3

union

unionThe situation is relatively simple, and there are two types: those that do not deduplicate union alland those that deduplicate union.

1. Without deduplication, union all
can use Pandas pd.concat()functions to implement MySQL union.
Pay attention to pd.concat()passing a list to the function, and use all the data sets as the elements of the list. If there are three forms union, then just give the list 3 corresponding data sets as elements, and add as many as there are. reset_index()The function is to reset the index, and the parameter drop=Trueis to delete the old index. If it is not deleted, an additional column will be added index. If you don't need to reset the index, you can .reset_index(drop=True)remove it.

language	Python	MySQL
the code	pd.concat([df1.col2,df2.col2]).reset_index(drop=True)	select col2 from t1 union all select col2 from t2;
result

When there are other restrictions on the front and rear forms, such as the following example, it can be processed accordingunion to the conditions in the previous article " Python and MySQL Comparison (1): Using Pandas to Realize MySQL Grammatical Effects " .where

language	Python	MySQL
the code	df1_1 = df1.col2[df1.col2==‘BB’] df2_1 = df2.col2[df2.col2==‘AA’] pd.concat([df1_1, df2_1]).reset_index(drop=True)	select col2 from t1 where col2=‘BB’ union all select col2 from t2 where col2=‘AA’;
result

2. Deduplication union
MySQL uses union, does not add all, realizes the deduplication of the merged data table, and can be used in Pandas to df.drop_duplicates()achieve the same effect.

language	Python	MySQL
the code	pd.concat([df1.col2,df2.col2]).drop_duplicates().reset_index(drop=True)	select col2 from t1 union select col2 from t2;
result

join

joinThe content is more, divided into inner join, left join, right joinand outer join. In Pandas, merge()the implementation is generally used. In addition, Pandas also has a Cartesian product cross join, but it is generally not used. The two basic sums are used the most , inner joinand the connection method of the sum is relatively easy to understand, and can generally be realized through. Among these connection methods, there are single condition ( ) and multiple conditions ( ), the association of two tables and the association of multiple tables. The following sections introduce four joining methods through two tables (by the way, introduce multiple conditions) and use and perform multi-table association.left joinleft joinright joinleft join
onand
inner joinleft join

1. Inner join of two tables

Single condition, generally only**on** ** associates a foreign key of the two tables. **When the keys of the two tables are the same, you can use [Code 1] to use them directly on; if the keys of the two tables are different, you can use [Code 2] to use left_onand right_onspecify the names of the corresponding keys respectively. suffixesThe parameter is to pass two suffixes. When the two tables have fields with the same name except the associated foreign key, add the suffix specified here to the fields with the same name in the two tables. The default is the same name of the left ('_x','_y')table Add _xa suffix to the field, and add _ya suffix to the field with the same name in the right table.
[Code 1] and [Code 2] are used to pd.merge(df1,df2)operate, the first parameter of the function is leftthe left table, the second parameter is rightthe right table; in addition to using this method, it can also be used df1.merge(df2), df1is the left table, df2is In the table on the right, other parameters are the same as those mentioned in the previous paragraph. The examples of [Code 3] and [Code 4] are as follows.
Note: Since there are two identical columns after the MySQL connection, the data in the first column will be overwritten by the data in the second column, so you can see two sum col2columns col4, and the data is consistent. However, Pandas will keep one key, and add suffixes to other columns with the same name to distinguish them.

language	Python	MySQL
the code	【代码一】 pd.merge(df1,df2,on=‘col2’,how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码二】pd.merge(df1,df2,left_on=‘col2’,right_on=‘col2’,how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码三】 df1.merge(df2,on=‘col2’,how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码四】 df1.merge(df2,left_on=‘col2’,right_on=‘col2’,how=‘inner’,suffixes=(‘_left’,‘_right’))	select * from t1 join t2 on t1.col2=t2.col2 ;
result

Multiple conditions - multiple keys
When multiple keys are connected, you can modify the parameters of onor left joinand on the basis of a single key, and use the form of a list as the parameter value. Each element of the list is the name of the foreign key, and each foreign key right joinThere must be a one-to-one correspondence between the index positions of the keys in the list.
Note: A single construction can also use the form of a list to pass the key. At this time, there is only one element in the list.

language	Python	MySQL
the code	【代码一】 pd.merge(df1,df2,left_on=[‘col2’,‘col4’],right_on=[‘col2’,‘col4’],how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码二】 pd.merge(df1,df2,on=[‘col2’,‘col4’],how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码三】 df1.merge(df2,left_on=[‘col2’,‘col4’],right_on=[‘col2’,‘col4’],how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码四】 df1.merge(df2,on=[‘col2’,‘col4’],how=‘inner’,suffixes=(‘_left’,‘_right’))	select * from t1 join t2 on t1.col2=t2.col2 and t1.col4=t2.col4;
result

Multiple conditions - the results returned by the following 4 codes under single table constraints
are consistent.
For the two codes of MySQL, [MySQL1] is jointo be screened at the same time, while [MySQL2] is jointo be completed first where.
When using pandas to implement, you can perform conditional filtering merge()before , as shown in [Code 1]; you can also merge()perform conditional filtering later, as shown in [Code 2].
Note: The usage df1.merge(df2)and usage pd.merge(df1,df2)are similar, so we df1.merge(df2)won’t repeat them here.

language	Python	MySQL
the code	【代码一】 pd.merge(df1,df2[df2.col2==‘AA’],left_on=[‘col2’],right_on=[‘col2’],how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码二】 df1_1 = pd.merge(df1,df2,left_on=[‘col2’],right_on=[‘col2’],how=‘inner’,suffixes=(‘_left’,‘_right’)) df1_1[df1_1.col2==‘AA’]	【MySQL1】 select * from t1 join t2 on t1.col2=t2.col2 and t2.col2=‘AA’; 【MySQL2】 select * from t1 join t2 on t1.col2=t2.col2 where t2.col2=‘AA’;
结果

2、两表 left join

多条件-多键
单条件的比较简单，直接看一个多键的情况。
用 Python 实现left join的语法和inner join的语法差不多，只是把how的参数由inner改为left即可。当然，返回的结果是有一定差别的。left join是以左表为准，将右表能通过外键关联上的字段关联到左表上。
差异：在下面的例子中可以看出一些差异，Python 返回的结果只保留了左表的键，如果需要使用到右表的键，则不能直接使用，需要借助右表其他的列辅助判断。而 MySQL 还是可以通过 t2.col2和t2.col4进行调用右表的键。

语言	Python	MySQL
代码	pd.merge(df1,df2,left_on=[‘col2’,‘col4’],right_on=[‘col2’,‘col4’],how=‘left’,suffixes=(‘_left’,‘_right’))	select * from t1 left join t2 on t1.col2=t2.col2 and t1.col4=t2.col4;
结果		注：截图中由于有相同列，导致相同列显示数据不准确。

多条件-单表限制条件
当有多条件，且是对单独的表单进行限制的时候，和inner join有一些差异，下图是前面inner join多条件-单表限制条件的4个代码，结果是一致的。
如果改为left join，结果会大不同。

如下，【MySQL1】和【MySQL2】结果有很大不同，【MySQL1】是以t1为主表，and t2.col2='AA'条件是在关联的时候对t2进行筛选，筛选完再将t2关联到t1上，所以返回的记录都有t1表的数据，再加上能关联上t1.col2的t2表的列数据。而【MySQL2】是以t1为主表，直接将t2关联到t1上，然后再对关联后的数据通过where t2.col2='AA'进行筛选。
Python 【代码一】效果同【MySQL1】，【代码二】效果同【MySQL2】。

语言	Python	MySQL
代码	【代码一】 pd.merge(df1,df2[df2.col2==‘AA’],left_on=[‘col2’],right_on=[‘col2’],how=‘left’,suffixes=(‘_left’,‘_right’)) 【代码二】 df1_1 = pd.merge(df1,df2,left_on=[‘col2’],right_on=[‘col2’],how=‘left’,suffixes=(‘_left’,‘_right’)) df1_1[df1_1.col2==‘AA’]	【MySQL1】 select * from t1 left join t2 on t1.col2=t2.col2 and t2.col2=‘AA’; 【MySQL2】 select * from t1 left join t2 on t1.col2=t2.col2 where t2.col2=‘AA’;
结果

3、两表 right join

right join和left join语法和含义上差不多，如果将左右表位置换一下，就可以通过right join替换left join，具体看看下面例子，和left join的【多条件-多键】的效果（下图）是一致的。

从以下例子可以看出二者的一些差异，Python 返回的结果只保留了右表的键，如果需要使用到左表的键，则不能直接使用，需要借助左表其他的列辅助判断。而 MySQL 还是可以通过 t2.col2和t2.col4进行调用左表的键。

语言	Python	MySQL
代码	pd.merge(df2,df1,left_on=[‘col2’,‘col4’],right_on=[‘col2’,‘col4’],how=‘right’,suffixes=(‘_left’,‘_right’))	select * from t1 left join t2 on t1.col2=t2.col2 and t1.col4=t2.col4;
结果		注：截图中由于有相同列，导致相同列显示数据不准确。

4、两表 outer join

outer join也叫full join，在 MySQL 8 中，有一些版本不支持full join，下表的【MySQL1】使用了full join的方式进行关联，如果使用不了该语法，也可以使用【MySQL2】或【MySQL3】代替。（注意，由于有多列col2，第一列会被第二列覆盖，所以为了看出差异，可以对其中一列进行重命名。）
而在 Python 中，没有对outer join限制，可以一步到位，如下【代码一】，该结果和 MySQL 也有一定的差异，主要是 Python 代码对左右表的col2列进行了合并，所以看到的col2列有4个值。另外， Python 的【代码二】和【代码三】也可以实现一样的效果，逻辑和【MySQL2】和【MySQL3】类似，不过由于外键覆盖，使用不到右表的外键，所以【代码三】通过右表的其他字段将空值筛选掉，但是需要保证这个其他字段没有空值，否则会使得原本需要保留的空值也被剔除掉。

语言	Python	MySQL
代码	【代码一】 pd.merge(df2,df3,left_on=[‘col2’],right_on=[‘col2’],how=‘outer’,suffixes=(‘_left’,‘_right’)) 【代码二】 df2_1 = pd.merge(df2,df3,left_on=[‘col2’],right_on=[‘col2’],how=‘left’,suffixes=(‘_left’,‘_right’)) df3_1 = pd.merge(df2,df3,left_on=[‘col2’],right_on=[‘col2’],how=‘right’,suffixes=(‘_left’,‘_right’)) pd.concat([df2_1, df3_1]).drop_duplicates().reset_index(drop=True) # 推荐【代码三】 df2_1 = pd.merge(df2,df3,left_on=[‘col2’],right_on=[‘col2’],how=‘left’,suffixes=(‘_left’,‘_right’)) df3_1 = pd.merge(df2,df3,left_on=[‘col2’],right_on=[‘col2’],how=‘right’,suffixes=(‘_left’,‘_right’)) pd.concat([df2_1, df3_1[df3_1.col4.isna()]]).reset_index(drop=True) # 如果df3_1.col4本身有空值，会影响最终结果	【MySQL1】 select * from t2 full join t3 on t3.col2=t2.col2; 【MySQL2】 select * from t2 left join t3 on t3.col2=t2.col2 union all select * from t3 left join t2 on t3.col2=t2.col2 where t2.col2 is null; 【MySQL3】 select * from t2 left join t3 on t3.col2=t2.col2 union all select * from t2 right join t3 on t3.col2=t2.col2 where t2.col2 is null;
结果		注：由于有多个相同列，第一列的数据会被第二列覆盖，可以通过命别名形式避免该问题。

其实，本次结果应该如下图所示。MySQL 和 Python 由于一些限制，返回的结果都有一些出入。

该图的 MySQL 代码如下：

select t2.col2 t2_col2,t2.col7,t2.col4,t3.col2 t3_col2,t3.col8 
from t2 full join t3 on t3.col2=t2.col2;
-- 或
select t2.col2 t2_col2,t2.col7,t2.col4,t3.col2 t3_col2,t3.col8
from t2 left join t3 on t3.col2=t2.col2
union all
select t2.col2 t2_col2,t2.col7,t2.col4,t3.col2 t3_col2,t3.col8
from t3 left join t2 on t3.col2=t2.col2 where t2.col2 is null;

5、多表 join

多表进行join的时候，推荐使用df1.merge(df2)的形式来进行链式join，如下例子

语言	Python	MySQL
代码	df1.merge(df2,left_on=‘col2’,right_on=‘col2’,how=‘left’).merge(df3,left_on=‘col2’,right_on=‘col2’,how=‘left’)	select * from t1 left join t2 on t2.col2=t1.col2 left join t3 on t3.col2=t1.col2
结果		注：由于有多个相同列，第一列的数据会被第二列覆盖，可以通过命别名形式避免该问题。

Python’s handling of foreign keys also merges. In addition to this problem, one thing to note is that when non-foreign keys have duplicate columns, Python will add a suffix. If the suffix is used as a foreign key in the following, jointhen Need to pay attention to add field names with foreign keys.

language	Python	MySQL
the code	df1.merge(df2,left_on=‘col2’,right_on=‘col2’,how=‘left’,suffixes=(‘_x’,‘_y’)).merge(df3,left_on=‘col4_y’,right_on=‘col8’,how=‘left’,suffixes=(‘_left’,‘_right’))	select * from t1 left join t2 on t2.col2=t1.col2 left join t3 on t3.col8=t2.col4;
result		Note: Since there are multiple identical columns, the data in the first column will be overwritten by the second column, which can be avoided by aliasing.

3. Summary

When Python joins multiple tables, it will merge foreign keys. inner joinIt doesn’t matter if you use , but if you use left join, right joinor inner join, you need to pay attention to the difference to avoid unnecessary mistakes.

Comparison between Python and MySQL (2): Using Pandas to realize the syntactic effect of union and join of MySQL