SQL post-calculation script

In most cases, we can use SQL (stored procedures) to complete database calculations, but if we encounter some complex operations that SQL is not good at, we can only use other programming languages ​​to read the data out of the database, and then complete it outside the database Calculations, such programming languages ​​often appear in the form of simple scripts, which we call SQL post-calculation scripts here.

The operations that SQL is not good at mainly include complex set calculations, ordered calculations, associative calculations, and multi-step calculations. SQL collection is not thorough enough, and there is no explicit collection data type, which makes it difficult to reuse the collection generated in the calculation process. For example, after grouping, it must be mandatory to summarize, and the subset after grouping cannot be calculated; SQL is designed based on the theory of unordered collections , It is very troublesome to deal with ordered operations such as inter-row groups and rankings. JOINs or sub-queries are often used to temporarily generate serial numbers, which is not only difficult to write but also very inefficient. SQL does not yet support record references. You can only use sub-queries or JOIN statements to describe the relationship. Once you encounter more hierarchical or self-associated situations, the code will be extremely complicated; SQL itself does not advocate multi-step code, which often forces the program The staff wrote many levels of nested long sentences, although the use of stored procedures can solve this problem to a certain extent, but sometimes the actual environment does not allow us to use stored procedures, such as DBA strictly control the permissions of stored procedures, old databases and small databases do not support storage Processes, etc., and the debugging of stored procedures is also very inconvenient, not very suitable for writing calculations with procedures.

In addition to the above complex calculations, there are also some cases where SQL post-calculation scripts are used. For example, calculation logic needs to be migrated between different types of databases, involving non-relational databases; the input source or output target is not only a database, but Excel, text and other files; it may also perform mixed calculations among multiple databases. These will involve calculations outside the library, using SQL post-calculation scripts.

For SQL post-calculation scripts, the most important function is of course to implement complex operations that SQL is not good at. In addition, it is best to have some more advanced features, such as computing files, non-relational databases and other diverse data, being able to handle large amounts of data, and computing performance should not be too slow. Of course, the most basic thing is to conveniently support reading and writing databases, so that post-calculation of SQL can be realized.

Commonly used scripts for post-SQL calculations include JAVA, Python pandas, and esProc. Let us learn more about these scripts and see their differences in post-SQL calculations.

 

JAVA

High-level languages ​​such as C++ and JAVA are theoretically omnipotent, and naturally they can also implement operations that SQL is not good at. JAVA supports generics, the collection is more thorough, and complex collection operations can be realized. JAVA arrays originally have serial numbers, which can realize ordered operations. JAVA supports object references, which can be used to express relationships, and there is no problem with associative operations. JAVA supports procedural grammars such as branch and loop, which can easily implement multi-step complex operations.

However, JAVA lacks a structured class library. Even the simplest structured calculations must be hard-coded, and the most basic structured data types must be created manually, which will lead to tedious and cumbersome code.

Give an example of orderly calculation: find the longest consecutive rising days for a stock. The database table AAPL stores the stock price information of a certain stock. The main fields are transaction date and closing price. Please calculate the longest consecutive rising days for the stock.

Realize this task in a natural way: loop the stock records with orderly dates. If this record is up compared with the previous record, the number of consecutive rising days (initially 0) will be increased by 1, and if it is down, it will be Comparing the number of consecutive rising days with the current maximum consecutive rising days (initially 0), select the new current maximum consecutive rising days, and then reset the consecutive rising days to 0. The cycle continues until the end, and the current maximum consecutive rising days is the final maximum consecutive rising days.

SQL is not good at orderly calculations, and cannot be realized with the above natural ideas. It can only use some weird and incomprehensible techniques: divide the stock records ordered by date into several groups, and divide the continuously rising records into the same group, that is, a certain day If the share price of is up from the previous day, it will be assigned to the same group as the previous day’s record. If it has fallen, a new group will be started. Finally, look at the largest number of members in all groups, that is, the largest number of consecutive rising days.

The specific SQL is as follows:

select max(continue_inc_days)

from (select count(*) continue_inc_days

      from   (select sum(inc_de_flag) over(order by transDate) continue_de_days

              from (select transDate,

                      case when

                          price>LAG(price)   over(order by transDate)

                     then 0 else 1 end inc_de_flag

                    from AAPL) )

group by continue_de_days)

 

This piece of SQL is not very long, but it is nested four levels, and the techniques used are weird and difficult to understand. It is difficult for ordinary people to come up with such code.

When implemented in JAVA, you can return to natural thinking:

package stock;

 

import java.sql.*;

 

public class APP {

       public static void main(String[] args) throws SQLException,

                     ClassNotFoundException   {

              Connection con = null;

              Class.forName("com.mysql.cj.jdbc.Driver");

              with = DriverManager

                           .getConnection(

                                         "jdbc:mysql://127.0.0.1:3306/mysql?&useSSL=false&serverTimezone=UTC",

                                         "root", "");

              String dql = "select   * from AAPL order by transDate";

              PreparedStatement   stmt = con.prepareStatement(dql,

                           ResultSet.TYPE_SCROLL_INSENSITIVE,   ResultSet.CONCUR_READ_ONLY);

              ResultSet aapl = stmt.executeQuery();

              int continue_inc_days = 0;

              int max_continue_inc_days = 0;

              float last_Price = 0;

              while (aapl.next()) {

                     float price = aapl.getFloat("price");

                     if (price >= last_Price) {

                           continue_inc_days++;

                     } else {

                           if (continue_inc_days >= max_continue_inc_days) {

                                  max_continue_inc_days = continue_inc_days;

                           }

                           continue_inc_days = 0;

                     }

                     last_Price = price;

              }

              System.out.println("max_continue_inc_days=" + max_continue_inc_days);

              if (con != null)

                     con.close();

       }

}

后面那段代码就是前面讲述的思路,只要一层循环就可以完成了。

然而,我们也发现,Java写出的这段代码,虽然思路简单,难度不大,但显然代码很冗长。

这个问题的复杂度并不高,还没涉及到常见的分组、连接等结构化数据计算,否则代码量将更为惊人,限于篇幅,就不再用JAVA举例了。

在多样性数据、优化性能、处理大数据等高级功能方面,JAVA的特点同样是“能实现,但太繁琐”,这里也不再赘述。

JAVA是个优秀的企业级通用语言,但通用的另一层意思往往是不专业,换句话说,JAVA缺乏专业的结构化计算类库,代码冗长繁琐,算不上理想的SQL后计算脚本。

 

Python pandas

Python有简捷的语法,还拥有众多的第三方函数库,其中就有服务于结构化计算的Pandas。也正因为如此,Pandas常被用作SQL的后计算脚本。

作为结构化计算函数库,Pandas简化SQL复杂运算的能力要比JAVA强很多。

比如,同样的有序运算 “求最长连续上涨天数”,Pandas代码是这样的:

import pymysql

import pandas as pd

conn =   pymysql.connect(

    host = "127.0.0.1",

    port = 3306,   

    user = "root",

    password = "",

    database = "mysql",

)

aapl = pd.read_sql_query("select price from AAPL order by   transDate", conn)

continue_inc_days=0 ; max_continue_inc_days=0

for i in aapl['price'].shift(0)>aapl['price'].shift(1):

    continue_inc_days   =0 if i==False else continue_inc_days +1

    max_continue_inc_days   = continue_inc_days if max_continue_inc_days   < continue_inc_days else max_continue_inc_days

print(max_continue_inc_days)

conn.close()

上述代码中,Pandas提供了用于结构化计算的数据结构dataFrame,这种数据结构天然带序号,在有序运算中可以简化代码,比JAVA更容易进行跨行取数。此外,Pandas对SQL取数的封装也很紧凑,比JAVA代码更加简短。

再比如集合计算例子:一行拆分为多行。库表tb有2个字段,其中ANOMALIES存储以空格为分隔符的字符串,需将ANOMALIES按空格拆分,使每个ID字段对应一个成员。

处理前的数据                     

ID ANOMALIES
1 A1 B1 C1 D1
2 A2
3 A3 B3 C3
4 A3 B4 D4

处理后的数据:

ID ANOMALIES
1 A1
1 B1
1 C1
1 D1
2 A2

Pandas核心代码如下(省略数据库输入输出,下同):

split_dict = pd.read_sql_query("select * from tb", conn)

split_list = []

for key,value in split_dict.items():

    anomalies = value[0].split(' ')

    key_array = np.tile(key,len(anomalies))

    split_df =   pd.DataFrame(np.array([key_array,anomalies]).T,columns=['ID','ANOMALIES'])

    split_list.append(split_df)

df = pd.concat(split_list,ignore_index=True)

上述代码中,Pandas用集合函数将字符串直接拆分为dataFrame,再用集合函数将多个dataFrame直接合并,代码非常简练。JAVA虽然可以实现类似的功能,但都要手工实现,代码要繁琐得多。

作为结构化计算函数库,Pandas代码的确比JAVA简练,但这仅限于复杂度有限的情况下,如果复杂度进一步提高,Pandas代码也会变得冗长难懂。

比如这个涉及集合计算+有序计算的例子:连续值班情况。库表duty记录着每日值班情况,一个人通常会持续值班几个工作日,之后再换人,现在请根据duty依次计算出每个人连续的值班情况。数据结构示意如下:

处理前(duty)

Date Name
2018-03-01 Emily
2018-03-02 Emily
2018-03-04 Emily
2018-03-04 Johnson
2018-04-05 Ashley
2018-03-06 Emily
2018-03-07 Emily

处理后

Name Begin End
Emily 2018-03-01 2018-03-03
Johnson 2018-03-04 2018-03-04
Ashley 2018-03-05 2018-03-05
Emily 2018-03-06 2018-03-07


核心的pandas代码如下:

……

duty = pd.read_sql_query("select date,name from duty order by   date", conn)

name_rec = ''

start = 0

duty_list = []

for i in range(len(duty)):

    if name_rec == '':

        name_rec = duty['name'][i]

    if name_rec != duty['name'][i]:

        begin =   duty['date'].loc[start:i-1].values[0]

        end =   duty['date'].loc[start:i-1].values[-1]

          duty_list.append([name_rec,begin,end])

        start = i

        name_rec = duty['name'][i]

begin = duty['date'].loc[start:i].values[0]

end = duty['date'].loc[start:i].values[-1]

duty_list.append([name_rec,begin,end])

duty_b_e = pd.DataFrame(duty_list,columns=['name','begin','end'])

上面已经省略了数据库输出输出的过程,可以看到代码还是有点繁琐。

 

再比如集合计算+多步骤运算的例子:计算分期贷款明细。库表loan记录着贷款信息,包括贷款 ID,贷款总额、期数、年利率,示意如下:

LoanID LoanAmt Term Rate
L01 100000 5 4.8
L02 20000 2 5.0
L03 500000 12 4.5

需要计算出各期明细,包括:当期还款额、当期利息、当期本金、剩余本金。计算结果如下:

LoanID LoanAmt Payment Term Rate interest princepal princeplebalance
L01 100000 20238.13 5 4.75 395.83 19842.29 80159.71
L01 100000 20238.13 5 4.75 317.29 19920.83 60236.87
L01 100000 20238.13 5 4.75 238.44 19999.69 40237.18
L01 100000 20238.13 5 4.75 159.27 20078.85 20158.33







实现上述运算的Pandas核心代码如下:

loan_data = pd.read_sql_query("select loanID,LoanAmt,Term,Rate from   loan", conn)

loan_data['mrate'] = loan_data['Rate']/(100*12)

loan_data['mpayment'] =   loan_data['LoanAmt']*loan_data['mrate']*np.power(1+loan_data['mrate'],loan_data['Term'])   \

                          /(np.power(1+loan_data['mrate'],loan_data['Term'])-1)

loan_term_list = []

for i in range(len(loan_data)):

    loanid =   np.tile(loan_data.loc[i]['LoanID'],loan_data.loc[i]['Term'])

    loanamt = np.tile(loan_data.loc[i]['LoanAmt'],loan_data.loc[i]['Term'])

    term =   np.tile(loan_data.loc[i]['Term'],loan_data.loc[i]['Term'])

    rate =   np.tile(loan_data.loc[i]['Rate'],loan_data.loc[i]['Term'])

    payment =   np.tile(np.array(loan_data.loc[i]['mpayment']),loan_data.loc[i]['Term'])

    interest = np.zeros(len(loanamt))

    principal = np.zeros(len(loanamt))

    principalbalance  = np.zeros(len(loanamt))

    loan_amt = loanamt[0]

    for j in range(len(loanamt)):

        interest[j] =   loan_amt*loan_data.loc[i]['mrate']

        principal[j] = payment[j]   - interest[j]

        principalbalance[j] =   loan_amt - principal[j]

        loan_amt =   principalbalance[j]

    loan_data_df =   pd.DataFrame(np.transpose(np.array([loanid,loanamt,term,rate,payment,interest,principal,principalbalance])),

                  columns =   ['loanid','loanamt','term','rate','payment','interest','principal','principalbalance'])

    loan_term_list.append(loan_data_df)

loan_term_pay = pd.concat(loan_term_list,ignore_index=True)

print(loan_term_pay)

可以看到,在简化SQL复杂运算方面Python虽然比JAVA强很多,但只限于简单情况,如果需求再复杂些,代码也会变得冗长难懂。之所以出现这种现象,可能因为Pandas只是第三方函数库,不能得到Python从语法层面的底层支撑,设计的专业性也不足。

 

Pandas的专业性不足,还体现在多样性数据上。Pandas没有为各类数据源开发统一接口,只支持常见的本地文件,但不支持复杂的数据源,比如Hadoop、MongoDB,用户还要自己寻找第三方(实际是第四方)函数库,并编写复杂的访问代码。Pandas甚至没有统一数据库接口,比如MySQL就有好几种第三方函数库,常见的有PyMySQL、sqlalchemy、MySQLdb。不过,这个问题对于大多数桌面应用场景还不严重,常见的数据库Python 基本上都能简单地支持。

对于多源混合关联问题,只要能读出各种数据源的数据,基本上也就能实现了,Pandas在这方面的表现基本令人满意。不过,还是上面的说法,对于简单的混合关联关系,Pandas都容易实现,而一旦出现较复杂的关联运算,实现过程就会变得困难起来。

在大数据量方面,Pandas的表现就不尽如人意了。Pandas没有游标数据类型,这导致解决较大数据量的计算时,必须硬编码实现循环取数,而不能自动进行内外存交换,代码因此异常繁琐。详情可参考《How Python Handles Big Files

Pandas的运算性能也一般,但基本够用。令人经常诟病的主要是多线程并行,Python下很难实现此类运算。比如数据库IO一般都较慢,但可以在数据库不忙时使用并行取数的办法来提高取数性能。而Python要借助其他第三方函数库才能实现并行,代码异常繁琐,且在表达效率、执行效率、稳定性等方便均缺乏保障。

Pandas虽然是结构化计算函数库,但仍不够好用。

 

esProc

与Pandas类似,esProc也具有丰富的结构化计算函数,与Pandas不同的是, esProc是由商业公司支持的产品,是专业的结构化计算语言,而不是开源社区的第三方库函数,也不存在一个松散的上级组织。esProc可以从全局角度设计一致的结构化计算语法,可以自底向上设计统一的结构化数据类型,使函数之间以最大的灵活度搭配组合,从而快捷方便地解决SQL后计算中遇到的问题。

作为专业的结构化计算语言,esProc擅长简化SQL复杂运算,比如,求最长连续上涨天数,实现前面说过的自然思路,esProc只需短短2行:


A
1 =mysqlDB.query@x("select   price from AAPL order by transDate")
2 =a=0,A1.max(a=if(price>price[-1],a+1,0))

上述代码使用了序表和循环函数,序表是专用于结构化计算的数据结构,可以比Pandas更容易进行跨行取数,可以更方便地实现有序计算,循环函数可以避免大部分的for语句(复杂情况下还是应该用for),可以大幅简化代码。此外,esProc对SQL取数的封装更紧凑,比Pandas代码更加简短。

再比如一行拆分为多行,esProc代码依然简短:


A
1 =orcl.query@x("select   * from tb")
2 =A1.news(ANOMALIES.split(" ");ID,~: ANOMALIES)

 

即使需求复杂度进一步提高,esProc仍然可以轻松实现。

比如连续值班情况,esProc代码要比Pandas简短很多:


A
1 =orcl.query("select   date,name from duty order by date")
2 =A1.group@o(name)
3 =A2.new(name,~.m(1).date:begin,~.m(-1).date:end)

再比如计算分期贷款明细,esProc同样比Pandas简短:


A
1 =orcl.query@x("select   loanID,LoanAmt,Term,Rate from loan")
2 =A1.derive(Rate/100/12:mRate,LoanAmt*mRate*power((1+mRate),Term)/(power((1+mRate),Term)-1):mPayment)
3 =A2.news((t=LoanAmt,Term);LoanID,     LoanAmt, mPayment:payment, Term, Rate, t* mRate:interest,     payment-interest:principal, t=t-principal:principlebalance)

 

对于Pandas很难实现的复杂运算,esProc通常也能轻松实现,而且代码不难。比如涉及多步骤算法+集合运算+动态表结构的任务:将子表横向插入子表。

源表关系

Order(主表)
OrderDetail(子表)
ID(pk)    ß- OrderID(PK)
Customer
Number(pk)
Date
Product


Amount

目标结果

ID Customer Date Product1 Amount1 Product2 Amount2 Product3 Amount3
1 3 2019-01-01 Apple 5 Milk 3 Salt 1
2 5 2019-01-02 Beef 2 Pork 4

3 2 2019-01-02 Pizza 3



 

esProc可以大幅简化这段代码:


A B
1 =orcl.query@x("select   * from OrderDetail left join Order on Order.ID=OrderDetail.OrderID")
2 =A1.group(ID) =A2.max(~.count()).("Product"+string(~)+","+"Amount"+string(~)).concat@c()
3 =create(ID,Customer,Date,${B2}) >A2.run(A3.record([ID,Customer,Date]|~.([Product,Amount]).conj()))

 

作为专业的结构化计算语言,esProc不仅可以大幅简化SQL不擅长的复杂运算,还具备更高级的能力去解决一些特殊情况。

在多样性数据方面,esProc支持多种文件格式和复杂的数据源,比如Hadoop、MongoDB等。更进一步,只需使用相同的代码,数据分析师就能计算来源各异的数据,既包括数据库,也包括非数据库。

在大数据量方面,esProc从底层提供了游标机制,对上层隐藏了内外存交换细节,允许数据分析师用类似处理小数据量的语法,直观地处理较大的数据量。

比如,库表orders记录着电商的大量订单,全部读出会超出内存,现在需要在库外计算每个销售员销售额最大的3笔订单。esProc代码如下:


A
1 =my.cursor@x("select   sellerid,amount from orders")
2 =A1.groups(sellerid;top(3;   -amount):top3)
3 =A2.conj(top3)

esProc也提供了很多简单易用的方法进行性能优化。比如:orders表每月的数据大致相等,请按月份进行多线程并行查询,从而大幅提高查询性能。esProc只需如下代码:


A B
1 fork to(1,12) /12线程并行
2
=connect(“my”)
3
=B2.query@x(“select *   from orders where month(orderdate)=?”,A1)
4 =A1.conj() / 合并查询结果

 

经过前面的比较我们可以发现,esProc具备丰富的结构化函数,是专业的结构化计算语言,可以大幅简化SQL不擅长的复杂运算逻辑,是更加理想的SQL后计算脚本。


Guess you like

Origin blog.51cto.com/12749034/2551776