Article Directory
I. Introduction
Environment:
windows11 64-bit
Python3.9
MySQL8
pandas1.4.2
This article mainly introduces MySQL union
and join
how to use pandas to implement it, and what is the difference between the two.
Note: Python is a very flexible language. There may be multiple ways to achieve the same goal. What I provide is only one of the solutions. If you have other methods, please leave a message to discuss.
2. Grammatical comparison
data sheet
The data used this time are as follows.
The syntax for constructing this dataset using Python is as follows:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'col1' : list(range(1,7))
,'col2' : ['AA','AA','AA','BB','BB','BB']#list('AABCA')
,'col3' : ['X',np.nan,'Da','Xi','Xa','xa']
,'col4' : [10,5,3,5,2,None]
,'col5' : [90,60,60,80,50,50]
,'col6' : ['Abc','Abc','bbb','Cac','Abc','bbb']
})
df2 = pd.DataFrame({
'col2':['AA','BB','CC'],'col7':[1,2,3],'col4':[5,6,7]})
df3 = pd.DataFrame({
'col2':['AA','DD','CC'],'col8':[50,70,90]})
Note: Just put the code in the cell of jupyter and run it. In the following, we directly use
df1
,df2
, anddf3
call the corresponding data.
The syntax for constructing this dataset using MySQL is as follows:
with t1 as(
select 1 as col1, 'AA' as col2, 'X' as col3, 10.0 as col4, 90 as col5, 'Abc' as col6 union all
select 2 as col1, 'AA' as col2, null as col3, 5.0 as col4, 60 as col5, 'Abc' as col6 union all
select 3 as col1, 'AA' as col2, 'Da' as col3, 3.0 as col4, 60 as col5, 'bbb' as col6 union all
select 4 as col1, 'BB' as col2, 'Xi' as col3, 5.0 as col4, 80 as col5, 'Cac' as col6 union all
select 5 as col1, 'BB' as col2, 'Xa' as col3, 2.0 as col4, 50 as col5, 'Abc' as col6 union all
select 6 as col1, 'BB' as col2, 'xa' as col3, null as col4, 50 as col5, 'bbb' as col6
)
,t2 as(
select 'AA' as col2, 1 as col7, 5 as col4 union all
select 'BB' as col2, 2 as col7, 6 as col4 union all
select 'CC' as col2, 3 as col7, 7 as col4
)
,t3 as(
select 'AA' as col2, 50 as col8 union all
select 'DD' as col2, 70 as col8 union all
select 'CC' as col2, 90 as col8
)
select * from t1;
Note: Just put the code in the MySQL code run box and run it. When running the SQL code later, the data set (lines 1 to 18 of the code) is brought by default, and only the query statement is displayed, such as line 19.
The corresponding relationship is as follows:
Python dataset | MySQL dataset |
---|---|
df1 | t1 |
df2 | t2 |
df3 | t3 |
union
union
The situation is relatively simple, and there are two types: those that do not deduplicate union all
and those that deduplicate union
.
1. Without deduplication, union all
can use Pandas pd.concat()
functions to implement MySQL union
.
Pay attention to pd.concat()
passing a list to the function, and use all the data sets as the elements of the list. If there are three forms union
, then just give the list 3 corresponding data sets as elements, and add as many as there are. reset_index()
The function is to reset the index, and the parameter drop=True
is to delete the old index. If it is not deleted, an additional column will be added index
. If you don't need to reset the index, you can .reset_index(drop=True)
remove it.
language | Python | MySQL |
---|---|---|
the code | pd.concat([df1.col2,df2.col2]).reset_index(drop=True) | select col2 from t1 union all select col2 from t2; |
result |
When there are other restrictions on the front and rear forms, such as the following example, it can be processed accordingunion
to the conditions in the previous article " Python and MySQL Comparison (1): Using Pandas to Realize MySQL Grammatical Effects " .where
language | Python | MySQL |
---|---|---|
the code | df1_1 = df1.col2[df1.col2==‘BB’] df2_1 = df2.col2[df2.col2==‘AA’] pd.concat([df1_1, df2_1]).reset_index(drop=True) |
select col2 from t1 where col2=‘BB’ union all select col2 from t2 where col2=‘AA’; |
result |
2. Deduplication union
MySQL uses union
, does not add all
, realizes the deduplication of the merged data table, and can be used in Pandas to df.drop_duplicates()
achieve the same effect.
language | Python | MySQL |
---|---|---|
the code | pd.concat([df1.col2,df2.col2]).drop_duplicates().reset_index(drop=True) | select col2 from t1 union select col2 from t2; |
result |
join
join
The content is more, divided into inner join
, left join
, right join
and outer join
. In Pandas, merge()
the implementation is generally used. In addition, Pandas also has a Cartesian product cross join
, but it is generally not used. The two basic sums are used the most , inner join
and the connection method of the sum is relatively easy to understand, and can generally be realized through. Among these connection methods, there are single condition ( ) and multiple conditions ( ), the association of two tables and the association of multiple tables. The following sections introduce four joining methods through two tables (by the way, introduce multiple conditions) and use and perform multi-table association.left join
left join
right join
left join
on
and
inner join
left join
1. Inner join of two tables
Single condition, generally only**on**
** associates a foreign key of the two tables. **When the keys of the two tables are the same, you can use [Code 1] to use them directly on
; if the keys of the two tables are different, you can use [Code 2] to use left_on
and right_on
specify the names of the corresponding keys respectively. suffixes
The parameter is to pass two suffixes. When the two tables have fields with the same name except the associated foreign key, add the suffix specified here to the fields with the same name in the two tables. The default is the same name of the left ('_x','_y')
table Add _x
a suffix to the field, and add _y
a suffix to the field with the same name in the right table.
[Code 1] and [Code 2] are used to pd.merge(df1,df2)
operate, the first parameter of the function is left
the left table, the second parameter is right
the right table; in addition to using this method, it can also be used df1.merge(df2)
, df1
is the left table, df2
is In the table on the right, other parameters are the same as those mentioned in the previous paragraph. The examples of [Code 3] and [Code 4] are as follows.
Note: Since there are two identical columns after the MySQL connection, the data in the first column will be overwritten by the data in the second column, so you can see two sum col2
columns col4
, and the data is consistent. However, Pandas will keep one key, and add suffixes to other columns with the same name to distinguish them.
language | Python | MySQL |
---|---|---|
the code | 【代码一】 pd.merge(df1,df2,on=‘col2’,how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码二】pd.merge(df1,df2,left_on=‘col2’,right_on=‘col2’,how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码三】 df1.merge(df2,on=‘col2’,how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码四】 df1.merge(df2,left_on=‘col2’,right_on=‘col2’,how=‘inner’,suffixes=(‘_left’,‘_right’)) |
select * from t1 join t2 on t1.col2=t2.col2 ; |
result |
Multiple conditions - multiple keys
When multiple keys are connected, you can modify the parameters of on
or left join
and on the basis of a single key, and use the form of a list as the parameter value. Each element of the list is the name of the foreign key, and each foreign key right join
There must be a one-to-one correspondence between the index positions of the keys in the list.
Note: A single construction can also use the form of a list to pass the key. At this time, there is only one element in the list.
language | Python | MySQL |
---|---|---|
the code | 【代码一】 pd.merge(df1,df2,left_on=[‘col2’,‘col4’],right_on=[‘col2’,‘col4’],how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码二】 pd.merge(df1,df2,on=[‘col2’,‘col4’],how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码三】 df1.merge(df2,left_on=[‘col2’,‘col4’],right_on=[‘col2’,‘col4’],how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码四】 df1.merge(df2,on=[‘col2’,‘col4’],how=‘inner’,suffixes=(‘_left’,‘_right’)) |
select * from t1 join t2 on t1.col2=t2.col2 and t1.col4=t2.col4; |
result |
Multiple conditions - the results returned by the following 4 codes under single table constraints
are consistent.
For the two codes of MySQL, [MySQL1] is join
to be screened at the same time, while [MySQL2] is join
to be completed first where
.
When using pandas to implement, you can perform conditional filtering merge()
before , as shown in [Code 1]; you can also merge()
perform conditional filtering later, as shown in [Code 2].
Note: The usage df1.merge(df2)
and usage pd.merge(df1,df2)
are similar, so we df1.merge(df2)
won’t repeat them here.
language | Python | MySQL |
---|---|---|
the code | 【代码一】 pd.merge(df1,df2[df2.col2==‘AA’],left_on=[‘col2’],right_on=[‘col2’],how=‘inner’,suffixes=(‘_left’,‘_right’)) 【代码二】 df1_1 = pd.merge(df1,df2,left_on=[‘col2’],right_on=[‘col2’],how=‘inner’,suffixes=(‘_left’,‘_right’)) df1_1[df1_1.col2==‘AA’] |
【MySQL1】 select * from t1 join t2 on t1.col2=t2.col2 and t2.col2=‘AA’; 【MySQL2】 select * from t1 join t2 on t1.col2=t2.col2 where t2.col2=‘AA’; |
结果 |
2、两表 left join
多条件-多键
单条件的比较简单,直接看一个多键的情况。
用 Python 实现left join
的语法和inner join
的语法差不多,只是把how
的参数由inner
改为left
即可。当然,返回的结果是有一定差别的。left join
是以左表为准,将右表能通过外键关联上的字段关联到左表上。
差异:在下面的例子中可以看出一些差异,Python 返回的结果只保留了左表的键,如果需要使用到右表的键,则不能直接使用,需要借助右表其他的列辅助判断。而 MySQL 还是可以通过 t2.col2
和t2.col4
进行调用右表的键。
语言 | Python | MySQL |
---|---|---|
代码 | pd.merge(df1,df2,left_on=[‘col2’,‘col4’],right_on=[‘col2’,‘col4’],how=‘left’,suffixes=(‘_left’,‘_right’)) | select * from t1 left join t2 on t1.col2=t2.col2 and t1.col4=t2.col4; |
结果 | 注:截图中由于有相同列,导致相同列显示数据不准确。 |
多条件-单表限制条件
当有多条件,且是对单独的表单进行限制的时候,和inner join
有一些差异,下图是前面inner join
多条件-单表限制条件的4个代码,结果是一致的。
如果改为left join
,结果会大不同。
如下,【MySQL1】和【MySQL2】结果有很大不同,【MySQL1】是以t1
为主表,and t2.col2='AA'
条件是在关联的时候对t2
进行筛选,筛选完再将t2
关联到t1
上,所以返回的记录都有t1
表的数据,再加上能关联上t1.col2
的t2
表的列数据。而【MySQL2】是以t1
为主表,直接将t2
关联到t1
上,然后再对关联后的数据通过where t2.col2='AA'
进行筛选。
Python 【代码一】效果同【MySQL1】,【代码二】效果同【MySQL2】。
语言 | Python | MySQL |
---|---|---|
代码 | 【代码一】 pd.merge(df1,df2[df2.col2==‘AA’],left_on=[‘col2’],right_on=[‘col2’],how=‘left’,suffixes=(‘_left’,‘_right’)) 【代码二】 df1_1 = pd.merge(df1,df2,left_on=[‘col2’],right_on=[‘col2’],how=‘left’,suffixes=(‘_left’,‘_right’)) df1_1[df1_1.col2==‘AA’] |
【MySQL1】 select * from t1 left join t2 on t1.col2=t2.col2 and t2.col2=‘AA’; 【MySQL2】 select * from t1 left join t2 on t1.col2=t2.col2 where t2.col2=‘AA’; |
结果 |
3、两表 right join
right join
和left join
语法和含义上差不多,如果将左右表位置换一下,就可以通过right join
替换left join
,具体看看下面例子,和left join
的【多条件-多键】的效果(下图)是一致的。
从以下例子可以看出二者的一些差异,Python 返回的结果只保留了右表的键,如果需要使用到左表的键,则不能直接使用,需要借助左表其他的列辅助判断。而 MySQL 还是可以通过 t2.col2
和t2.col4
进行调用左表的键。
语言 | Python | MySQL |
---|---|---|
代码 | pd.merge(df2,df1,left_on=[‘col2’,‘col4’],right_on=[‘col2’,‘col4’],how=‘right’,suffixes=(‘_left’,‘_right’)) | select * from t1 left join t2 on t1.col2=t2.col2 and t1.col4=t2.col4; |
结果 | 注:截图中由于有相同列,导致相同列显示数据不准确。 |
4、两表 outer join
outer join
也叫full join
,在 MySQL 8 中,有一些版本不支持full join
,下表的【MySQL1】使用了full join
的方式进行关联,如果使用不了该语法,也可以使用【MySQL2】或【MySQL3】代替。(注意,由于有多列col2
,第一列会被第二列覆盖,所以为了看出差异,可以对其中一列进行重命名。)
而在 Python 中,没有对outer join
限制,可以一步到位,如下【代码一】,该结果和 MySQL 也有一定的差异,主要是 Python 代码对左右表的col2
列进行了合并,所以看到的col2
列有4个值。另外, Python 的【代码二】和【代码三】也可以实现一样的效果,逻辑和【MySQL2】和【MySQL3】类似,不过由于外键覆盖,使用不到右表的外键,所以【代码三】通过右表的其他字段将空值筛选掉,但是需要保证这个其他字段没有空值,否则会使得原本需要保留的空值也被剔除掉。
语言 | Python | MySQL |
---|---|---|
代码 | 【代码一】 pd.merge(df2,df3,left_on=[‘col2’],right_on=[‘col2’],how=‘outer’,suffixes=(‘_left’,‘_right’)) 【代码二】 df2_1 = pd.merge(df2,df3,left_on=[‘col2’],right_on=[‘col2’],how=‘left’,suffixes=(‘_left’,‘_right’)) df3_1 = pd.merge(df2,df3,left_on=[‘col2’],right_on=[‘col2’],how=‘right’,suffixes=(‘_left’,‘_right’)) pd.concat([df2_1, df3_1]).drop_duplicates().reset_index(drop=True) # 推荐 【代码三】 df2_1 = pd.merge(df2,df3,left_on=[‘col2’],right_on=[‘col2’],how=‘left’,suffixes=(‘_left’,‘_right’)) df3_1 = pd.merge(df2,df3,left_on=[‘col2’],right_on=[‘col2’],how=‘right’,suffixes=(‘_left’,‘_right’)) pd.concat([df2_1, df3_1[df3_1.col4.isna()]]).reset_index(drop=True) # 如果df3_1.col4本身有空值,会影响最终结果 |
【MySQL1】 select * from t2 full join t3 on t3.col2=t2.col2; 【MySQL2】 select * from t2 left join t3 on t3.col2=t2.col2 union all select * from t3 left join t2 on t3.col2=t2.col2 where t2.col2 is null; 【MySQL3】 select * from t2 left join t3 on t3.col2=t2.col2 union all select * from t2 right join t3 on t3.col2=t2.col2 where t2.col2 is null; |
结果 | 注:由于有多个相同列,第一列的数据会被第二列覆盖,可以通过命别名形式避免该问题。 |
其实,本次结果应该如下图所示。MySQL 和 Python 由于一些限制,返回的结果都有一些出入。
该图的 MySQL 代码如下:
select t2.col2 t2_col2,t2.col7,t2.col4,t3.col2 t3_col2,t3.col8
from t2 full join t3 on t3.col2=t2.col2;
-- 或
select t2.col2 t2_col2,t2.col7,t2.col4,t3.col2 t3_col2,t3.col8
from t2 left join t3 on t3.col2=t2.col2
union all
select t2.col2 t2_col2,t2.col7,t2.col4,t3.col2 t3_col2,t3.col8
from t3 left join t2 on t3.col2=t2.col2 where t2.col2 is null;
5、多表 join
多表进行join
的时候,推荐使用df1.merge(df2)
的形式来进行链式join
,如下例子
语言 | Python | MySQL |
---|---|---|
代码 | df1.merge(df2,left_on=‘col2’,right_on=‘col2’,how=‘left’).merge(df3,left_on=‘col2’,right_on=‘col2’,how=‘left’) |
select * from t1 left join t2 on t2.col2=t1.col2 left join t3 on t3.col2=t1.col2 |
结果 | 注:由于有多个相同列,第一列的数据会被第二列覆盖,可以通过命别名形式避免该问题。 |
Python’s handling of foreign keys also merges. In addition to this problem, one thing to note is that when non-foreign keys have duplicate columns, Python will add a suffix. If the suffix is used as a foreign key in the following, join
then Need to pay attention to add field names with foreign keys.
language | Python | MySQL |
---|---|---|
the code | df1.merge(df2,left_on=‘col2’,right_on=‘col2’,how=‘left’,suffixes=(‘_x’,‘_y’)).merge(df3,left_on=‘col4_y’,right_on=‘col8’,how=‘left’,suffixes=(‘_left’,‘_right’)) |
select * from t1 left join t2 on t2.col2=t1.col2 left join t3 on t3.col8=t2.col4; |
result | Note: Since there are multiple identical columns, the data in the first column will be overwritten by the second column, which can be avoided by aliasing. |
3. Summary
When Python joins multiple tables, it will merge foreign keys. inner join
It doesn’t matter if you use , but if you use left join
, right join
or inner join
, you need to pay attention to the difference to avoid unnecessary mistakes.