Comparison between Python and MySQL (5): Using Pandas to realize the effect of MySQL window function

I. Introduction

Environment:
windows11 64-bit
Python3.9
MySQL8
pandas1.4.2

This article mainly introduces how to implement the window function row_number(), lead()/lag(), rank()/dense_rank(), first_value(), in MySQL using pandas, and what is the difference between the two.count()sum()

Note: Python is a very flexible language. There may be multiple ways to achieve the same goal. What I provide is only one of the solutions. If you have other methods, please leave a message to discuss.

2. Grammatical comparison

data sheet

The data used this time are as follows.
The syntax for constructing this dataset using Python is as follows:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    
     'col1' : list(range(1,7))
                    ,'col2' : ['AA','AA','AA','BB','BB','BB']#list('AABCA')
                    ,'col3' : ['X',np.nan,'Da','Xi','Xa','xa']
                    ,'col4' : [10,5,3,5,2,None]
                    ,'col5' : [90,60,60,80,50,50]
                    ,'col6' : ['Abc','Abc','bbb','Cac','Abc','bbb']
                   })
df2 = pd.DataFrame({
    
    'col2':['AA','BB','CC'],'col7':[1,2,3],'col4':[5,6,7]})
df3 = pd.DataFrame({
    
    'col2':['AA','DD','CC'],'col8':[5,7,9],'col9':['abc,bcd,fgh','rst,xyy,ijk','nml,opq,wer']})

Note: Just put the code in the cell of jupyter and run it. In the following, we directly use df1, df2, and df3call the corresponding data.

The syntax for constructing this dataset using MySQL is as follows:

with t1 as(
  select  1 as col1, 'AA' as col2, 'X' as col3, 10.0 as col4, 90 as col5, 'Abc' as col6 union all
  select  2 as col1, 'AA' as col2, null as col3, 5.0 as col4, 60 as col5, 'Abc' as col6 union all
  select  3 as col1, 'AA' as col2, 'Da' as col3, 3.0 as col4, 60 as col5, 'bbb' as col6 union all
  select  4 as col1, 'BB' as col2, 'Xi' as col3, 5.0 as col4, 80 as col5, 'Cac' as col6 union all
  select  5 as col1, 'BB' as col2, 'Xa' as col3, 2.0 as col4, 50 as col5, 'Abc' as col6 union all
  select  6 as col1, 'BB' as col2, 'xa' as col3, null as col4, 50 as col5, 'bbb' as col6 
)
,t2 as(
  select  'AA' as col2, 1 as col7, 5 as col4 union all
  select  'BB' as col2, 2 as col7, 6 as col4 union all
  select  'CC' as col2, 3 as col7, 7 as col4 
)
,t3 as(
  select  'AA' as col2, 5 as col8, 'abc,bcd,fgh' as col9 union all
  select  'DD' as col2, 7 as col8, 'rst,xyy,ijk' as col9 union all
  select  'CC' as col2, 9 as col8, 'nml,opq,wer' as col9 
)
select * from t1;

Note: Just put the code in the MySQL code run box and run it. When running the SQL code later, the data set (lines 1 to 18 of the code) is brought by default, and only the query statement is displayed, such as line 19.

The corresponding relationship is as follows:

Python dataset MySQL dataset
df1 t1
df2 t2
df3 t3

row_number()

row_number()Is to calculate the row number of the retrieved data, starting from 1 and incrementing. It generally involves grouping fields and sorting fields, and the row number in each group is unique.
MySQL row_number()functions can be used in Python groupby()+rank()to achieve similar effects.

  • groupby()When a single column is aggregated, just pass in the column name directly, for example groupby('col2'); if it is multiple columns, pass a list, for example groupby(['col2','col6']).
  • rank()Only one column can be sorted. For example df.col2.rank(), when there are multiple columns to sort, you can use sort_values(['col6','col5']the sorting first, then aggregate, and then use the accumulation function cumcount()or sorting function rank().

In addition, it should be noted that if the sorting field has duplicate values, it will be randomly returned in MySQL, while in Python, indexcolumns will be used for further sorting by default.
Specific examples are as follows:

1. Single-column grouping and single-column sorting
When there is only one column for grouping and sorting, use groupby()single-column aggregation and rank()sorting on a single column in Python.

language Python MySQL
the code df1_1 = df1.copy()
df1_1[‘label’] = df1_1.groupby(‘col2’)[‘col5’].rank(ascending=False,method=‘first’)
df1_1[[‘col2’,‘col5’,‘label’]]
select col2,col5,row_number()over(partition by col2 order by col5 desc) label from t1;
result image.png image.png

2. Multi-column grouping, single-column sorting
When there are multiple column groups, pass a list to groupby()the function.

language Python MySQL
the code df1_1 = df1.copy()
df1_1[‘label’] = df1_1.groupby([‘col2’,‘col6’])[‘col5’].rank(ascending=True,method=‘first’)
df1_1[[‘col2’,‘col6’,‘col5’,‘label’]]
select col2,col6,col5,row_number()over(partition by col2,col5 order by col5) label from t1;
result image.png image.png

3. Single-column grouping, multi-column sorting
If it is multi-column sorting, it is relatively complicated, as follows [Python1] first use sort_values()sorting, then use groupby()aggregation, and then use rank()to add the sorting number; while [Python2] and [Python1] before The 2 steps are the same, the implementation number is used in the last step cumcount().

language Python MySQL
the code 【Python1】
df1_1 = df1.copy()
df1_1[‘label’] = df1_1.sort_values([‘col6’,‘col5’],ascending=[False,True]).groupby([‘col2’])[‘col2’].rank(ascending=False,method=‘first’)
df1_1[[‘col2’,‘col6’,‘col5’,‘label’]]
【Python2】
df1_1 = df1.copy()
df1_1[‘label’] = df1_1.sort_values([‘col6’,‘col5’],ascending=[False,True]).groupby([‘col2’]).cumcount()+1
df1_1[[‘col2’,‘col6’,‘col5’,‘label’]]
select col2,col6,col5,row_number()over(partition by col2 order by col6 desc,col5) label from t1;
result image.png image.png

3. Multi-column grouping and multi-column sorting For multi-column
grouping and multi-column sorting, you can directly add multiple grouping fields to the list on the basis of 【 3. Single-column grouping and multi-column sorting 】. groupby([])No longer.

lead()/lag()

lead()It is to take the column value backward from the current row, which can also be understood as moving the specified column up; on the lag()contrary, it is to take the column value forward from the current row, which can also be understood as moving the specified column down.
With sorting, the two can be interchanged, namely:

  • Positive order lead()== reverse orderlag()
  • reverse order lead()== positive orderlag()

In Python, shift()the column value can be moved up and down through a function. When a positive number is passed in , the column value moves down , and when a negative number is passed in , the column value moves up .
Note: About single-column/multi-column grouping and single-column/multi-column sorting, please refer to it row_number()and will not repeat it.

1. To move 1 row When moving 1 row, you can directly use /
in MySQL , and there is no problem using / , and then combine the ascending and descending order to realize the movement of column values ​​up and down. In Python, use or to achieve the same effect. The following example is to move down, so use .lead(col1)lag(col1)lead(col1,1)lag(col1,1)
shift(-1)shift(1)col1shift(-1)

language Python MySQL
the code df1_1 = df1.copy()
df1_1[‘col1_2’] = df1_1.groupby([‘col2’]).col1.shift(-1)
df1_1[[‘col2’,‘col1’,‘col1_2’]].sort_values([‘col2’,‘col1’],ascending=[True,True])
【MySQL1】
select col2,col1,lead(col1)over(partition by col2 order by col1) col1_2 from t1;
【MySQL2】
select col2,col1,lag(col1)over(partition by col2 order by col1 desc) col1_2 from t1;
result image.png image.png

2. To move multiple rows
When moving multiple rows, you need to specify the number of rows to move in MySQL. In the following example, to move 2 rows, use or lead(col1,2), lag(col1,2)and combine the ascending and descending order to move the column value up and down.
In Python, just modify shift()the parameter value passed to the function, as in the following example, use shift(2)to move up 2 lines.

language Python MySQL
the code df1_1 = df1.copy()
df1_1['col1_2'] = df1_1.groupby(['col2']).col1.shift(2) # Control
df1_1[['col2','col1','col1_2'] by shift ].sort_values(['col2','col1'], ascending=[True,True])
【MySQL1】
select col2,col1,lead(col1,2)over(partition by col2 order by col1 desc) col1_2 from t1;
【MySQL2】
select col2,col1,lag(col1,2)over(partition by col2 order by col1) col1_2 from t1;
result image.png image.png

rank()/dense_rank()

rank()dense_rank()用于计算排名。rank()排名可能不连续,就是当有重复值的时候,会并列使用小的排名,而重复值之后的排名则按照重复个数叠加往后排,如一组数(10,20,20,30),按升序排列是(1,2,2,4);而dense_rank()的排名是连续的,还是上面的例子,按升序排列是(1,2,2,3)。
而在 Python 中,排序同样是通过rank()函数实现,只是methodrow_number()使用的不一样。实现rank()的效果,使method='min',而实现dense_rank()的效果,使用method='dense'。除了这两种和在row_number()中使用的method='first',还有averagemaxaverage的逻辑是所有值进行不重复连续排序之后,将分组内的重复值的排名进行平均,还是上面的例子,按升序排列是(1,2.5,2.5,4),maxmin相反,使用的是分组内重复值取大的排名进行排序,还是上面的例子,按升序排列是(1,3,3,4)。
同样地,排序字段如果有重复值,在 MySQL 中会随机返回,而 Python 中会默认使用index列进一步排序。

注:关于单列/多列分组和单列/多列排序的情况,参考row_number(),不再赘述。
1、rank()
Python 中使用rank(method='min')实现 MySQL 中的rank()窗口函数。

语言 Python MySQL
代码 df1_1 = df1.copy()
df1_1[‘label’] = df1_1.groupby([‘col2’])[‘col5’].rank(ascending=True,method=‘min’)
df1_1[[‘col2’,‘col5’,‘label’]]
select col2,col5,rank()over(partition by col2 order by col5) col1_2 from t1;
结果 image.png image.png

2、dense_rank()
Python 中使用rank(method='dense')实现 MySQL 中的rank()窗口函数。

语言 Python MySQL
代码 df1_1 = df1.copy()
df1_1[‘label’] = df1_1.groupby([‘col2’])[‘col5’].rank(ascending=True,method=‘dense’)
df1_1[[‘col2’,‘col5’,‘label’]]
select col2,col5,dense_rank()over(partition by col2 order by col5) col1_2 from t1;
结果 image.png image.png

first_value()

MySQL 中的窗口函数first_value()是取第一个值,可用于取数据默认顺序的第一个值,也可以通过排序,取某一列的最大值或最小值。
在 Pandas 中,也有相同功能的函数first()
不过,first_value()是窗口函数,不会影响表单内的其他字段,但first()时一个普通函数,只返回表单中的第一个值对应的行,所以在 Python 中要实现first_value()窗口函数相同的结果,需要将first()函数返回的结果,再通过表联结关联回原表(具体例子如下)。在 Python 中,还有一个last()函数,和first()相反,结合排序,也可以实现相同效果,和first()可互换,读者可自行测试,不再赘述。

注:关于单列/多列分组和单列/多列排序的情况,参考row_number(),不再赘述。
1、取最大值
MySQL 中,对col5降序,便可通过first_value()取得最大值。同样,在 Python 中,使用sort_values()col5进行降序,便可通过first()取得最大值,然后再merge()回原表。

语言 Python MySQL
代码 df1_1 = df1.copy()
df1_2 = df1_1.sort_values([‘col5’],ascending=[False]).groupby([‘col2’]).first().reset_index()[[‘col2’,‘col5’]] # 最好加个排序
df1[[‘col2’,‘col5’]].merge(df1_2,on = ‘col2’,how = ‘left’,suffixes=(‘’,‘_2’))
select col2,col5,first_value(col5)over(partition by col2 order by col5 desc) col5_2 from t1;
结果 image.png image.png

2、取最小值
取最小值,则是在取最大值的基础上,改变col5的排序即可,由降序改为升序。

语言 Python MySQL
代码 df1_1 = df1.copy()
df1_2 = df1_1.sort_values([‘col5’],ascending=[True]).groupby([‘col2’]).first().reset_index()[[‘col2’,‘col5’]]
df1[[‘col2’,‘col5’]].merge(df1_2,on = ‘col2’,how = ‘left’,suffixes=(‘’,‘_2’))
select col2,col5,first_value(col5)over(partition by col2 order by col5) col5_2 from t1;
结果 image.png image.png

count()/sum()

MySQL 的聚合函数count()sum()等,也可以加上over()实现窗口函数的效果。

  • count()可以用于求各个分组内的个数,也可以对分组内某个列的值进行累计。
  • sum()可以用于对各个分组内某个列的值求和,也可以对分组某个列的值进行累加。

在 Python 中,针对累计和累加的功能,可以使用groupby()+cumcount()groupby()+cumsum()实现(如下例子1和2),而针对分组内的计数和求和,可以通过groupby()+count()groupby()+sum()实现(如下例子3和4)。

注:关于单列/多列分组和单列/多列排序的情况,参考row_number(),不再赘述。
1、升序累计
Python 中使用sort_values()+groupby()+cumcount()实现 MySQL count(<col_name>)over(partition by <col_name> order by <col_name>)效果。

语言 Python MySQL
代码 df1_1 = df1.copy()
df1_1[‘col5_2’] = df1_1.sort_values([‘col5’,‘col1’],ascending=[True,False]).groupby(‘col2’).col5.cumcount()+1
df1_1[[‘col2’,‘col5’,‘col5_2’]]
select col2,col5,count(col5)over(partition by col2 order by col5,col1) col5_2 from t1;
结果 image.png image.png

2、升序累加
Python 中使用sort_values()+groupby()+cumsum()实现 MySQL sum(<col_name>)over(partition by <col_name> order by <col_name>)效果。

语言 Python MySQL
代码 df1_1 = df1.copy()
df1_1[‘col5_2’] = df1_1.sort_values([‘col5’,‘col1’],ascending=[True,False]).groupby(‘col2’).col5.cumsum()
df1_1[[‘col2’,‘col5’,‘col5_2’]]
select col2,col5,sum(col5)over(partition by col2 order by col5,col1) col5_2 from t1;
结果 image.png image.png

3、分组计数
Python 中使用sort_values()+groupby()+count()实现 MySQL count(<col_name>)over(partition by <col_name>)效果。

语言 Python MySQL
代码 df1_1 = df1.copy()
df1_2 = df1_1.sort_values([‘col5’,‘col1’],ascending=[True,False]).groupby(‘col2’).col5.count().reset_index()
df1_1[[‘col2’,‘col5’]].merge(df1_2,how=‘left’,on=‘col2’,suffixes=(‘’,‘_2’))
select col2,col5,count(col5)over(partition by col2) col5_2 from t1;
结果 image.png image.png

4、分组求和
Python 中使用sort_values()+groupby()+sum()实现 MySQL sum(<col_name>)over(partition by <col_name>)效果。

语言 Python MySQL
代码 df1_1 = df1.copy()
df1_2 = df1_1.sort_values([‘col5’,‘col1’],ascending=[True,False]).groupby(‘col2’).col5.sum().reset_index()
df1_1[[‘col2’,‘col5’]].merge(df1_2,how=‘left’,on=‘col2’,suffixes=(‘’,‘_2’))
select col2,col5,sum(col5)over(partition by col2) col5_2 from t1;
结果 image.png image.png

三、小结

MySQL's window function effect, in Python, basically needs to go through multiple steps and use multiple functions for combined processing. Window functions involve grouping fields and sorting fields, and correspondingly use groupby()and sort_values()in Python, so basically, to realize the effect of window functions in Python, these two functions need to be used to assist in data processing. The remaining aggregation forms are modified according to the characteristics of the aggregation window function, and the corresponding relationship is as follows:

MySQL window functions Python counterpart function
row_number() rank()
lead()/lag() shift()
rank()/dense_rank() rank()
first_value() first()
count() count()、cumcount()
sum() sum()、cumsum()

Guess you like

Origin blog.csdn.net/qq_45476428/article/details/128731019