Article Directory
I. Introduction
Environment:
windows11 64-bit
Python3.9
MySQL8
pandas1.4.2
This article mainly introduces how to implement the window function row_number()
, lead()/lag()
, rank()/dense_rank()
, first_value()
, in MySQL using pandas, and what is the difference between the two.count()
sum()
Note: Python is a very flexible language. There may be multiple ways to achieve the same goal. What I provide is only one of the solutions. If you have other methods, please leave a message to discuss.
2. Grammatical comparison
data sheet
The data used this time are as follows.
The syntax for constructing this dataset using Python is as follows:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'col1' : list(range(1,7))
,'col2' : ['AA','AA','AA','BB','BB','BB']#list('AABCA')
,'col3' : ['X',np.nan,'Da','Xi','Xa','xa']
,'col4' : [10,5,3,5,2,None]
,'col5' : [90,60,60,80,50,50]
,'col6' : ['Abc','Abc','bbb','Cac','Abc','bbb']
})
df2 = pd.DataFrame({
'col2':['AA','BB','CC'],'col7':[1,2,3],'col4':[5,6,7]})
df3 = pd.DataFrame({
'col2':['AA','DD','CC'],'col8':[5,7,9],'col9':['abc,bcd,fgh','rst,xyy,ijk','nml,opq,wer']})
Note: Just put the code in the cell of jupyter and run it. In the following, we directly use
df1
,df2
, anddf3
call the corresponding data.
The syntax for constructing this dataset using MySQL is as follows:
with t1 as(
select 1 as col1, 'AA' as col2, 'X' as col3, 10.0 as col4, 90 as col5, 'Abc' as col6 union all
select 2 as col1, 'AA' as col2, null as col3, 5.0 as col4, 60 as col5, 'Abc' as col6 union all
select 3 as col1, 'AA' as col2, 'Da' as col3, 3.0 as col4, 60 as col5, 'bbb' as col6 union all
select 4 as col1, 'BB' as col2, 'Xi' as col3, 5.0 as col4, 80 as col5, 'Cac' as col6 union all
select 5 as col1, 'BB' as col2, 'Xa' as col3, 2.0 as col4, 50 as col5, 'Abc' as col6 union all
select 6 as col1, 'BB' as col2, 'xa' as col3, null as col4, 50 as col5, 'bbb' as col6
)
,t2 as(
select 'AA' as col2, 1 as col7, 5 as col4 union all
select 'BB' as col2, 2 as col7, 6 as col4 union all
select 'CC' as col2, 3 as col7, 7 as col4
)
,t3 as(
select 'AA' as col2, 5 as col8, 'abc,bcd,fgh' as col9 union all
select 'DD' as col2, 7 as col8, 'rst,xyy,ijk' as col9 union all
select 'CC' as col2, 9 as col8, 'nml,opq,wer' as col9
)
select * from t1;
Note: Just put the code in the MySQL code run box and run it. When running the SQL code later, the data set (lines 1 to 18 of the code) is brought by default, and only the query statement is displayed, such as line 19.
The corresponding relationship is as follows:
Python dataset | MySQL dataset |
---|---|
df1 | t1 |
df2 | t2 |
df3 | t3 |
row_number()
row_number()
Is to calculate the row number of the retrieved data, starting from 1 and incrementing. It generally involves grouping fields and sorting fields, and the row number in each group is unique.
MySQL row_number()
functions can be used in Python groupby()+rank()
to achieve similar effects.
groupby()
When a single column is aggregated, just pass in the column name directly, for examplegroupby('col2')
; if it is multiple columns, pass a list, for examplegroupby(['col2','col6'])
.rank()
Only one column can be sorted. For exampledf.col2.rank()
, when there are multiple columns to sort, you can usesort_values(['col6','col5']
the sorting first, then aggregate, and then use the accumulation functioncumcount()
or sorting functionrank()
.
In addition, it should be noted that if the sorting field has duplicate values, it will be randomly returned in MySQL, while in Python, index
columns will be used for further sorting by default.
Specific examples are as follows:
1. Single-column grouping and single-column sorting
When there is only one column for grouping and sorting, use groupby()
single-column aggregation and rank()
sorting on a single column in Python.
language | Python | MySQL |
---|---|---|
the code | df1_1 = df1.copy() df1_1[‘label’] = df1_1.groupby(‘col2’)[‘col5’].rank(ascending=False,method=‘first’) df1_1[[‘col2’,‘col5’,‘label’]] |
select col2,col5,row_number()over(partition by col2 order by col5 desc) label from t1; |
result |
2. Multi-column grouping, single-column sorting
When there are multiple column groups, pass a list to groupby()
the function.
language | Python | MySQL |
---|---|---|
the code | df1_1 = df1.copy() df1_1[‘label’] = df1_1.groupby([‘col2’,‘col6’])[‘col5’].rank(ascending=True,method=‘first’) df1_1[[‘col2’,‘col6’,‘col5’,‘label’]] |
select col2,col6,col5,row_number()over(partition by col2,col5 order by col5) label from t1; |
result |
3. Single-column grouping, multi-column sorting
If it is multi-column sorting, it is relatively complicated, as follows [Python1] first use sort_values()
sorting, then use groupby()
aggregation, and then use rank()
to add the sorting number; while [Python2] and [Python1] before The 2 steps are the same, the implementation number is used in the last step cumcount()
.
language | Python | MySQL |
---|---|---|
the code | 【Python1】 df1_1 = df1.copy() df1_1[‘label’] = df1_1.sort_values([‘col6’,‘col5’],ascending=[False,True]).groupby([‘col2’])[‘col2’].rank(ascending=False,method=‘first’) df1_1[[‘col2’,‘col6’,‘col5’,‘label’]] 【Python2】 df1_1 = df1.copy() df1_1[‘label’] = df1_1.sort_values([‘col6’,‘col5’],ascending=[False,True]).groupby([‘col2’]).cumcount()+1 df1_1[[‘col2’,‘col6’,‘col5’,‘label’]] |
select col2,col6,col5,row_number()over(partition by col2 order by col6 desc,col5) label from t1; |
result |
3. Multi-column grouping and multi-column sorting For multi-column
grouping and multi-column sorting, you can directly add multiple grouping fields to the list on the basis of 【 3. Single-column grouping and multi-column sorting 】. groupby([])
No longer.
lead()/lag()
lead()
It is to take the column value backward from the current row, which can also be understood as moving the specified column up; on the lag()
contrary, it is to take the column value forward from the current row, which can also be understood as moving the specified column down.
With sorting, the two can be interchanged, namely:
- Positive order
lead()
== reverse orderlag()
- reverse order
lead()
== positive orderlag()
In Python, shift()
the column value can be moved up and down through a function. When a positive number is passed in , the column value moves down , and when a negative number is passed in , the column value moves up .
Note: About single-column/multi-column grouping and single-column/multi-column sorting, please refer to it row_number()
and will not repeat it.
1. To move 1 row When moving 1 row, you can directly use /
in MySQL , and there is no problem using / , and then combine the ascending and descending order to realize the movement of column values up and down. In Python, use or to achieve the same effect. The following example is to move down, so use .lead(col1)
lag(col1)
lead(col1,1)
lag(col1,1)
shift(-1)
shift(1)
col1
shift(-1)
language | Python | MySQL |
---|---|---|
the code | df1_1 = df1.copy() df1_1[‘col1_2’] = df1_1.groupby([‘col2’]).col1.shift(-1) df1_1[[‘col2’,‘col1’,‘col1_2’]].sort_values([‘col2’,‘col1’],ascending=[True,True]) |
【MySQL1】 select col2,col1,lead(col1)over(partition by col2 order by col1) col1_2 from t1; 【MySQL2】 select col2,col1,lag(col1)over(partition by col2 order by col1 desc) col1_2 from t1; |
result |
2. To move multiple rows
When moving multiple rows, you need to specify the number of rows to move in MySQL. In the following example, to move 2 rows, use or lead(col1,2)
, lag(col1,2)
and combine the ascending and descending order to move the column value up and down.
In Python, just modify shift()
the parameter value passed to the function, as in the following example, use shift(2)
to move up 2 lines.
language | Python | MySQL |
---|---|---|
the code | df1_1 = df1.copy() df1_1['col1_2'] = df1_1.groupby(['col2']).col1.shift(2) # Control df1_1[['col2','col1','col1_2'] by shift ].sort_values(['col2','col1'], ascending=[True,True]) |
【MySQL1】 select col2,col1,lead(col1,2)over(partition by col2 order by col1 desc) col1_2 from t1; 【MySQL2】 select col2,col1,lag(col1,2)over(partition by col2 order by col1) col1_2 from t1; |
result |
rank()/dense_rank()
rank()
和dense_rank()
用于计算排名。rank()
排名可能不连续,就是当有重复值的时候,会并列使用小的排名,而重复值之后的排名则按照重复个数叠加往后排,如一组数(10,20,20,30),按升序排列是(1,2,2,4);而dense_rank()
的排名是连续的,还是上面的例子,按升序排列是(1,2,2,3)。
而在 Python 中,排序同样是通过rank()
函数实现,只是method
和row_number()
使用的不一样。实现rank()
的效果,使method='min'
,而实现dense_rank()
的效果,使用method='dense'
。除了这两种和在row_number()
中使用的method='first'
,还有average
和max
。average
的逻辑是所有值进行不重复连续排序之后,将分组内的重复值的排名进行平均,还是上面的例子,按升序排列是(1,2.5,2.5,4),max
和min
相反,使用的是分组内重复值取大的排名进行排序,还是上面的例子,按升序排列是(1,3,3,4)。
同样地,排序字段如果有重复值,在 MySQL 中会随机返回,而 Python 中会默认使用index
列进一步排序。
注:关于单列/多列分组和单列/多列排序的情况,参考row_number()
,不再赘述。
1、rank()
Python 中使用rank(method='min')
实现 MySQL 中的rank()
窗口函数。
语言 | Python | MySQL |
---|---|---|
代码 | df1_1 = df1.copy() df1_1[‘label’] = df1_1.groupby([‘col2’])[‘col5’].rank(ascending=True,method=‘min’) df1_1[[‘col2’,‘col5’,‘label’]] |
select col2,col5,rank()over(partition by col2 order by col5) col1_2 from t1; |
结果 |
2、dense_rank()
Python 中使用rank(method='dense')
实现 MySQL 中的rank()
窗口函数。
语言 | Python | MySQL |
---|---|---|
代码 | df1_1 = df1.copy() df1_1[‘label’] = df1_1.groupby([‘col2’])[‘col5’].rank(ascending=True,method=‘dense’) df1_1[[‘col2’,‘col5’,‘label’]] |
select col2,col5,dense_rank()over(partition by col2 order by col5) col1_2 from t1; |
结果 |
first_value()
MySQL 中的窗口函数first_value()
是取第一个值,可用于取数据默认顺序的第一个值,也可以通过排序,取某一列的最大值或最小值。
在 Pandas 中,也有相同功能的函数first()
。
不过,first_value()
是窗口函数,不会影响表单内的其他字段,但first()
时一个普通函数,只返回表单中的第一个值对应的行,所以在 Python 中要实现first_value()
窗口函数相同的结果,需要将first()
函数返回的结果,再通过表联结关联回原表(具体例子如下)。在 Python 中,还有一个last()
函数,和first()
相反,结合排序,也可以实现相同效果,和first()
可互换,读者可自行测试,不再赘述。
注:关于单列/多列分组和单列/多列排序的情况,参考row_number()
,不再赘述。
1、取最大值
MySQL 中,对col5
降序,便可通过first_value()
取得最大值。同样,在 Python 中,使用sort_values()
对col5
进行降序,便可通过first()
取得最大值,然后再merge()
回原表。
语言 | Python | MySQL |
---|---|---|
代码 | df1_1 = df1.copy() df1_2 = df1_1.sort_values([‘col5’],ascending=[False]).groupby([‘col2’]).first().reset_index()[[‘col2’,‘col5’]] # 最好加个排序 df1[[‘col2’,‘col5’]].merge(df1_2,on = ‘col2’,how = ‘left’,suffixes=(‘’,‘_2’)) |
select col2,col5,first_value(col5)over(partition by col2 order by col5 desc) col5_2 from t1; |
结果 |
2、取最小值
取最小值,则是在取最大值的基础上,改变col5
的排序即可,由降序改为升序。
语言 | Python | MySQL |
---|---|---|
代码 | df1_1 = df1.copy() df1_2 = df1_1.sort_values([‘col5’],ascending=[True]).groupby([‘col2’]).first().reset_index()[[‘col2’,‘col5’]] df1[[‘col2’,‘col5’]].merge(df1_2,on = ‘col2’,how = ‘left’,suffixes=(‘’,‘_2’)) |
select col2,col5,first_value(col5)over(partition by col2 order by col5) col5_2 from t1; |
结果 |
count()/sum()
MySQL 的聚合函数count()
和sum()
等,也可以加上over()
实现窗口函数的效果。
count()
可以用于求各个分组内的个数,也可以对分组内某个列的值进行累计。sum()
可以用于对各个分组内某个列的值求和,也可以对分组某个列的值进行累加。
在 Python 中,针对累计和累加的功能,可以使用groupby()+cumcount()
和groupby()+cumsum()
实现(如下例子1和2),而针对分组内的计数和求和,可以通过groupby()+count()
和groupby()+sum()
实现(如下例子3和4)。
注:关于单列/多列分组和单列/多列排序的情况,参考row_number()
,不再赘述。
1、升序累计
Python 中使用sort_values()+groupby()+cumcount()
实现 MySQL count(<col_name>)over(partition by <col_name> order by <col_name>)
效果。
语言 | Python | MySQL |
---|---|---|
代码 | df1_1 = df1.copy() df1_1[‘col5_2’] = df1_1.sort_values([‘col5’,‘col1’],ascending=[True,False]).groupby(‘col2’).col5.cumcount()+1 df1_1[[‘col2’,‘col5’,‘col5_2’]] |
select col2,col5,count(col5)over(partition by col2 order by col5,col1) col5_2 from t1; |
结果 |
2、升序累加
Python 中使用sort_values()+groupby()+cumsum()
实现 MySQL sum(<col_name>)over(partition by <col_name> order by <col_name>)
效果。
语言 | Python | MySQL |
---|---|---|
代码 | df1_1 = df1.copy() df1_1[‘col5_2’] = df1_1.sort_values([‘col5’,‘col1’],ascending=[True,False]).groupby(‘col2’).col5.cumsum() df1_1[[‘col2’,‘col5’,‘col5_2’]] |
select col2,col5,sum(col5)over(partition by col2 order by col5,col1) col5_2 from t1; |
结果 |
3、分组计数
Python 中使用sort_values()+groupby()+count()
实现 MySQL count(<col_name>)over(partition by <col_name>)
效果。
语言 | Python | MySQL |
---|---|---|
代码 | df1_1 = df1.copy() df1_2 = df1_1.sort_values([‘col5’,‘col1’],ascending=[True,False]).groupby(‘col2’).col5.count().reset_index() df1_1[[‘col2’,‘col5’]].merge(df1_2,how=‘left’,on=‘col2’,suffixes=(‘’,‘_2’)) |
select col2,col5,count(col5)over(partition by col2) col5_2 from t1; |
结果 |
4、分组求和
Python 中使用sort_values()+groupby()+sum()
实现 MySQL sum(<col_name>)over(partition by <col_name>)
效果。
语言 | Python | MySQL |
---|---|---|
代码 | df1_1 = df1.copy() df1_2 = df1_1.sort_values([‘col5’,‘col1’],ascending=[True,False]).groupby(‘col2’).col5.sum().reset_index() df1_1[[‘col2’,‘col5’]].merge(df1_2,how=‘left’,on=‘col2’,suffixes=(‘’,‘_2’)) |
select col2,col5,sum(col5)over(partition by col2) col5_2 from t1; |
结果 |
三、小结
MySQL's window function effect, in Python, basically needs to go through multiple steps and use multiple functions for combined processing. Window functions involve grouping fields and sorting fields, and correspondingly use groupby()
and sort_values()
in Python, so basically, to realize the effect of window functions in Python, these two functions need to be used to assist in data processing. The remaining aggregation forms are modified according to the characteristics of the aggregation window function, and the corresponding relationship is as follows:
MySQL window functions | Python counterpart function |
---|---|
row_number() | rank() |
lead()/lag() | shift() |
rank()/dense_rank() | rank() |
first_value() | first() |
count() | count()、cumcount() |
sum() | sum()、cumsum() |