Calculation script suitable for time series data

Time series data here refers to daily business data sorted by time. When calculating time series data, it not only involves regular calculations such as quarters, months, weekdays, and weekends, but also often encounters more complex and orderly operations, which requires scripting languages ​​to have corresponding computing capabilities. The calculation scripts generally used to process time series data include SQL, Python Pandas, and esProc. Let us learn more about these scripts and see the differences in their capabilities.

SQL

SQL has a long history and has many users, and it has already reached its limit within its model framework. Almost every simple operation can find a corresponding SQL solution, which includes ordered operations.

For example, compared with the previous example: table stock1001 stores the transaction information of a certain stock. The main fields are transaction date transDate and closing price price. Please calculate the growth rate of the closing price of each trading day compared with the previous trading day.

This example belongs to relative position calculation. If you use window functions, SQL is relatively easy to write:

select transDate,price,

             price/lag(price)   over(order by transDate)-1 comp

 from stock1001

But some SQL does not support window functions, the implementation will be much more troublesome than the previous issue:

With A as(SELECT    t1.transDate,t1.price, COUNT(*) AS rk

      FROM stock1001 AS t1,

           stock1001 AS t2

       WHERE t1.transDate   >t2.transDate or (t1.transDate=t2.transDate and t1.price<=t2.price)

      GROUP BY  t1.transDate, t1.price

      ORDER BY rk)

select t1.transDate,(t1.price/t2.price-1) comp  from A as t1 left join  A as t2

on t1.rk=t2.rk+1

The above code is troublesome, first of all, because SQL is based on unordered collections, it does not have a sequence number, and it is inconvenient to perform ordered operations. In order to achieve ordered operations, it is necessary to forcibly create a sequence number column for unordered collections. This process requires self The code for association and grouping is more complicated. Secondly, it is a relative position calculation compared to the previous issue. If SQL has a relative sequence number, this calculation will be much simpler, but SQL does not have a relative sequence number, so you can only associate the previous row to this row, and the adjacent position calculation is realized in disguise, and the code becomes complicated. .

SQL based on unordered sets is inconvenient to implement ordered operations. Although window functions can alleviate this situation, it will still be troublesome if the operations are complex.

For example, the median example: the scores table stores student scores. The main fields include student number studentdid and math score math. Please calculate the median of math scores. The definition of the median is: if the total number of records L is an even number, it returns the average of the middle two values ​​(the serial numbers are L/2 and L/2+1 respectively); if L is an odd number, it returns the only median value (the serial number Is (L+1)/2).

SQL code for calculating the median:

With A as (select studentdid,math,   row_number() over (order by math) rk

                      from scores),

B as  (select count(1)  L from scores)

select avg(math)  from A where   rk in (

                select case when   mod(L,2)=1 then   ((L+1) div 2)  else ( L div 2) end no from B

                union

                select case when   mod(L,2)=1 then  ((L+1) div 2)  else (L div 2)+1 end  from B

))

As you can see, although window functions have been used, SQL is still very complicated. The process of generating sequence numbers is redundant for ordered collections, but it is an indispensable step for SQL, especially in this case where the sequence number must be used explicitly, which makes the code complex. SQL implementation of branch judgment is also more troublesome, so when dealing with the case where L is an odd number, it does not return a unique intermediate value, but averages two identical intermediate values. Although this technique can simplify branch judgment, it is slightly easier to understand Difficulties.

If you use the rounding function, you can cleverly skip the judgment process and calculate the median while simplifying the code. However, this technique is different from the original definition of the median and will cause difficulties in understanding, so it is not adopted here.

Let's look at a slightly more complicated example: the number of consecutive rising days. The database table AAPL stores the stock price information of a certain stock. The main fields include the transaction date transDate and the closing price price. Please calculate the longest consecutive rising days for the stock. The SQL is as follows:

select max(continue_inc_days)

from (select count(*) continue_inc_days

        from (select sum(inc_de_flag) over(order by transDate) continue_de_days

            from (select transDate,

                      case when

                          price>LAG(price)   over(order by transDate)

                     then 0 else 1 end inc_de_flag

                  from AAPL) )

group by continue_de_days)

 

按自然思路实现这个任务时,应对日期有序的股票记录进行循环,如果本条记录与上一条记录相比是上涨的,则将连续上涨天数(初始为0)加1,如果是下跌的,则将连续上涨天数和当前最大连续上涨天数(初始为0)相比,选出新的当前最大连续上涨天数,再将连续上涨天数清0。如此循环直到结束,当前最大连续上涨天数即最终的最大连续上涨天数。

但SQL不擅长有序计算,无法用上述自然思路实现,只能用一些难懂的技巧。把按日期有序的股票记录分成若干组,连续上涨的记录分成同一组,也就是说,某天的股价比上一天是上涨的,则和上一天记录分到同一组,如果下跌了,则开始一个新组。最后看所有分组中最大的成员数量,也就是最多连续上涨的天数。

对于这两个稍复杂的有序运算例子,SQL实现起来就已经很困难了,一旦遇到更复杂的运算,SQL几乎无法完成。之所以出现这种结果,是因为SQL的理论基础就是无序集合,这种天然缺陷无论怎样打补丁,都无法从根本上解决问题。

 

Python Pandas

Pandas是Python的结构化计算库,常被用作时间序列数据的计算脚本。

作为结构化计算函数库,Pandas可以轻松实现简单的有序计算。比如,同样计算比上期,Pandas代码是这样的:

import pandas as pd
  stock1001=pd.read_csv('d:/stock1001.csv')       #return  as a DataFrame

stock1001 ['comp'] = stock1001.math/ stock1001.shift(1).math-1

上述前两句是为了从文件读取数据,核心代码仅有一句。需要注意的是,Pandas并不能表示前一行,从而直接实现相对位置计算,但可以用shift(1)函数将列整体下移一行,从而变相实现相对位置计算。代码中行和列、前一行和下一行看上去很像,初学者容易混淆。

作为现代程序语言,Pandas在有序计算方面要比SQL先进,主要体现在Pandas基于有序集合构建,dataFrame数据类型天生具有序号,适合进行有序计算。前面那些稍复杂的有序计算,用SQL会非常困难,用Pandas就相对容易。

同样计算中位数,Pandas核心代码如下:


  df=pd.read_csv('d:/scores.csv')       #return  as a DataFrame
  math=df['math']
  L=len(df)
  if L % 2 == 1:
      result= math[int(L / 2)]
  else:
      result= (math[int(L / 2 - 1)] +   math[int(L / 2)]) / 2
  print(result)

 

上述代码中,Pandas可以直接用[N]表示序号,而不用额外制造序号,代码因此得到简化。其次,Pandas是过程性语言,分支判断比SQL易于理解,也不需要技巧来简化代码。

同样稍复杂的例子最长连续上涨天数,Pandas也比SQL容易实现。核心代码如下:

aapl = pd.read_sql_query("select price from AAPL order by   transDate", conn)

continue_inc_days=0 ; max_continue_inc_days=0

for i in aapl['price'].shift(0)>aapl['price'].shift(1):

    continue_inc_days =0 if   i==False else continue_inc_days +1

    max_continue_inc_days = continue_inc_days   if max_continue_inc_days < continue_inc_days else max_continue_inc_days

print(max_continue_inc_days)

conn.close()

本例中,Pandas可以按照自然思路实现,而不必采取难懂的技巧,代码的表达效率要比SQL高得多。

有点遗憾的是, 有序计算常常要涉及相对位置计算,但Pandas不能直接表达相对位置,只能把列下移一行来变相表示本行的上一行,理解时有点困难。

Pandas在有序计算方面的确比SQL容易些,但遇到更复杂的情况,Pandas也会变得很繁琐,下面试举两例。

比如过滤累计值的例子:表sales存储客户的销售额数据,主要字段有客户client、销售额amount,请找出销售额累计占到一半的前n个大客户,并按销售额从大到小排序。Pandas代码如下:

import pandas as pd

sale_info = pd.read_csv("d:/sales.csv")

sale_info.sort_values(by=‘Amount’,inplace=True,ascending=False)

half_amount = sale_info[‘Amount’].sum()/2

vip_list = []

amount = 0

for client_info in sale_info.itertuples():

    amount += getattr(client_info, ‘Amount’)

    if amount < half_amount:

          vip_list.append(getattr(client_info, ‘Client’))

    else:

          vip_list.append(getattr(client_info, ‘Client’))

        break

print(vip_list)

 

再比如计算股价最高3天的涨幅:表stock1001存储某支股票的每日股价,主要字段有交易日期transDate、收盘价price,请将股价最高的三天按逆序排列,计算每一天相比前一天的涨幅。Pandas代码如下:

import pandas as pd

stock1001 = pd.read_csv("d:/stock1001_price.txt",sep   = ‘\t’)

CL = stock1001[‘CL’]

CL_psort = CL.argsort()[::-1].iloc[:3].values

CL_psort_shift1 = CL_psort-1

CL_rise = CL[CL_psort].values/CL[CL_psort_shift1].values-1

max_3 = stock1001.loc[CL_psort].reset_index(drop = True)

max_3[‘RISE’] = CL_rise

print(max_3)

这些更复杂的例子也需要用到一些难懂的技巧去实现,不仅难以编写,而且难以读懂,这里就不再详细解释。

 

esProc

与Pandas类似,esProc也具有丰富的结构化计算函数,与Pandas不同的是,esProc除了基于有序集合并支持序号机制外,还提供了方便的相邻引用机制,以及丰富的位置函数,从而快捷方便地实现有序计算。

对于简单的有序计算,esProc和其他计算脚本一样,都可以轻松实现。比如同样比上期的esProc代码:


A B
1 =file("d:/stock1001.csv").import@tc() /读csv文件
2 =A1.derive(price/price[-1]-1:comp) /用相对位置计算比上期

上面代码A1从csv文件取数,A2是核心代码。esProc可以用直观易懂的[-1]表示相对本行的前一行,这是Pandas和SQL都没有的功能,也是esProc更为专业的表现。

同样计算中位数,esProc核心代码如下:


A
1
2 =L=A1.len()
3 =if(A2%2==0,A1([L/2,L/2+1]).avg(math),A1((L+1)/2).math)

上述代码中,esProc可以直接用[N]表示序号,而不用额外制造序号,代码更为简洁。esProc同样是过程性语法,既可以用if/else语句实现大段的分支,也可以像本例一样,用if函数实现简洁的判断。

同样稍复杂的例子最长连续上涨天数,esProc也比SQL/Pandas容易实现。核心代码如下:


A
1
2 =a=0,A1.max(a=if(price>price[-1],a+1,0))

本例中,esProc可以按照自然思路实现,而不必采取特殊的技巧,代码表达效率要比SQL更高。除此外, esProc既可以用循环语句实现大段的循环,也可以像本例一样,用循环函数max实现简洁的循环聚合。

esProc是更为专业的结构化计算语言,即使遇到更复杂的有序计算,也能较为轻松地实现。

比如过滤累计值的例子,esProc只需如下代码:


A B
1 =demo.query(“select client,amount from sales”).sort(amount:-1)  取数并逆序排序
2 =A1.cumulate(amount) 计算累计序列
3 =A2.m(-1)/2 最后的累计值即是总和
4 =A2.pselect(~>=A3) 超过一半的位置
5 =A1(to(A4))

 本例按自然思维实现,先在A2计算出从最大的客户到每个客户的累计值,再在A3算出最大累计值的一半,在A4算出累计值大于A3的位置,最后按位置取数据就是所需结果。这里有体现esProc专业性的两处特色,其一是A3中的m函数,该函数可以逆序取数,-1表示倒数第一条;其二是A4中的pselect,可以按条件返回序号。这两种函数都可以有效简化有序计算。

再比如计算股价最高那3天的涨幅,esProc只需如下代码:


A B
1 =file("d:/stock1001.csv").import@tc() /取数
2 =A1.ptop(-3,price) /股价最高的3天的位置
3 =A1.calc(A2,price/price[-1]-1) /计算这三天的涨幅
4 =A1(A2).new(transDate,price,A3(#):comp) /用列拼出二维表

上述代码中,A2中的ptop表示前N条的位置,和前面的pselect类似,返回的不是记录的集合,而是序号的集合,类似这样的函数在esProc中还有很多,其目的都是简化有序计算。A4中的#也是esProc的特色,直接表示序号字段,使用起来非常方便,不必像SQL那样额外制造,或Pandas那样设定index。

经过比较我们可以发现,esProc具备丰富的结构化函数,是专业的结构化计算语言,可以轻松实现常见的有序计算,即使更复杂的计算也能有效简化,是更加理想的时间序列数据计算脚本。


Guess you like

Origin blog.51cto.com/12749034/2547130