Pandas时间序列：重采样及频率转换

import pandas as pd
import numpy as np

一、介绍

重采样(resampling)指的是将时间序列从一个频率转换到另一个频率的处理过程；
将高频率(间隔短)数据聚合到低频率(间隔长)称为降采样(downsampling)；
将低频率数据转换到高频率则称为升采样(unsampling)；
有些采样即不是降采样也不是升采样，例如将W-WED(每周三)转换为W-FRI；

二、resample方法–转换频率的主力函数

rng = pd.date_range('1/1/2000',periods=100,freq='D')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts.resample('M').mean() # 将100天按月进行降采样(聚合)

2000-01-31   -0.156092
2000-02-29    0.060607
2000-03-31   -0.039608
2000-04-30   -0.154838
Freq: M, dtype: float64

ts.resample('M',kind='period').mean()

2000-01   -0.156092
2000-02    0.060607
2000-03   -0.039608
2000-04   -0.154838
Freq: M, dtype: float64

三、降采样(聚合)

1.降采样面元(区间)默认才有左闭右开的形式，而且聚合的索引是以左边界标记

rng = pd.date_range('1/1/2000',periods=12,freq='T')
ts = pd.Series(np.arange(12),index=rng)
ts

2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

ts.resample('5min').sum()

2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32

2.通过参数closed=’right’可以实现左开右闭

ts.resample('5min',closed='right').sum()

1999-12-31 23:55:00     0
2000-01-01 00:00:00    15
2000-01-01 00:05:00    40
2000-01-01 00:10:00    11
Freq: 5T, dtype: int32

3.通过参数label=’right’可以实现以右边界为聚合后的标签

ts.resample('5min',closed='right',label='right').sum()

2000-01-01 00:00:00     0
2000-01-01 00:05:00    15
2000-01-01 00:10:00    40
2000-01-01 00:15:00    11
Freq: 5T, dtype: int32

4.通过参数loffset可以实现精准的调整标签

ts.resample('5min',closed='right',loffset='-1s').sum()

1999-12-31 23:54:59     0
1999-12-31 23:59:59    15
2000-01-01 00:04:59    40
2000-01-01 00:09:59    11
Freq: 5T, dtype: int32

四、OHLC重采样

在金融领域常用的聚合方式–OHLC，它会计算各个面元的：第一个值(开盘)、最后一个值(收盘)、最大值和最小值，并产生一个DataFrame

print(ts.resample('5min').ohlc())

                     open  high  low  close
2000-01-01 00:00:00     0     4    0      4
2000-01-01 00:05:00     5     9    5      9
2000-01-01 00:10:00    10    11   10     11

五、通过groupby进行重采样

rng = pd.date_range('1/1/2000',periods=100,freq='D')
ts = pd.Series(np.arange(100),index=rng)
ts.groupby(lambda x:x.month).mean() # 等价于 ts.groupby(rng.month).mean()

1    15
2    45
3    75
4    95
dtype: int32

ts.groupby(lambda x:x.weekday).mean() # 按周聚合

0    47.5
1    48.5
2    49.5
3    50.5
4    51.5
5    49.0
6    50.0
dtype: float64

六、升采样和插值

升采样是从低频率到高频率，这样会引入缺失值；
升采样时需要决定采样后结果中具体那个值代替原始的值；
当决定了替换原始值的值后，中间的值会按照频率进行添加；

frame = pd.DataFrame(np.random.randn(2,4),
                    index = pd.date_range('1/1/2000',periods=2,freq='W-WED'),
                    columns = ['Colorado','Texas','New York','Ohio'])
print(frame)

            Colorado     Texas  New York      Ohio
2000-01-05 -0.078765  1.389417  0.732726  0.816723
2000-01-12 -0.663686  0.744384  1.395332 -0.031715

1.升采样、前向填充

df_daily = frame.resample('D')
print(df_daily.ffill())

            Colorado     Texas  New York      Ohio
2000-01-05 -0.078765  1.389417  0.732726  0.816723
2000-01-06 -0.078765  1.389417  0.732726  0.816723
2000-01-07 -0.078765  1.389417  0.732726  0.816723
2000-01-08 -0.078765  1.389417  0.732726  0.816723
2000-01-09 -0.078765  1.389417  0.732726  0.816723
2000-01-10 -0.078765  1.389417  0.732726  0.816723
2000-01-11 -0.078765  1.389417  0.732726  0.816723
2000-01-12 -0.663686  0.744384  1.395332 -0.031715

print(df_daily.ffill(limit=2))

            Colorado     Texas  New York      Ohio
2000-01-05 -0.078765  1.389417  0.732726  0.816723
2000-01-06 -0.078765  1.389417  0.732726  0.816723
2000-01-07 -0.078765  1.389417  0.732726  0.816723
2000-01-08       NaN       NaN       NaN       NaN
2000-01-09       NaN       NaN       NaN       NaN
2000-01-10       NaN       NaN       NaN       NaN
2000-01-11       NaN       NaN       NaN       NaN
2000-01-12 -0.663686  0.744384  1.395332 -0.031715

2.重采样后的日期不一定与先前的日期有交集

print(frame)

            Colorado     Texas  New York      Ohio
2000-01-05 -0.078765  1.389417  0.732726  0.816723
2000-01-12 -0.663686  0.744384  1.395332 -0.031715

print(frame.resample('W-THU').ffill()) # 重采样后的结果开始为全NaN，使用ffill会使用2000-01-05和2000-01-12的值向前填充

            Colorado     Texas  New York      Ohio
2000-01-06 -0.078765  1.389417  0.732726  0.816723
2000-01-13 -0.663686  0.744384  1.395332 -0.031715

七、通过时期(period)进行重采样

1.将采样

frame = pd.DataFrame(np.random.randn(24,4),
                    index = pd.period_range('1-2000','12-2001',freq='M'),
                    columns = ['Colorado','Texas','New York','Ohio'])
print(frame[:5])

         Colorado     Texas  New York      Ohio
2000-01 -1.956495 -0.689508  0.057439 -0.655832
2000-02 -0.491443 -1.731887  1.336801  0.659877
2000-03 -0.139601 -1.310386 -0.299205  1.194269
2000-04  0.431474 -1.312518  1.880223  0.379421
2000-05 -0.674796  0.471018  0.132998  0.509761

annual_frame = frame.resample('A-DEC').mean()
print(annual_frame)

      Colorado     Texas  New York      Ohio
2000 -0.332076 -0.762599  0.046917  0.224908
2001 -0.152922  0.168667 -0.326439 -0.052034

2.通过convention决定在升采样后，那端来替换原来的值

# Q-DEC:以12月做为最后一个季度的最后一个月进行升采样.也就是1-3月是1季度，4-6月是2季度，7-9月是3季度，10-12月是4季度
print(annual_frame.resample('Q-DEC').ffill())

        Colorado     Texas  New York      Ohio
2000Q1 -0.332076 -0.762599  0.046917  0.224908
2000Q2 -0.332076 -0.762599  0.046917  0.224908
2000Q3 -0.332076 -0.762599  0.046917  0.224908
2000Q4 -0.332076 -0.762599  0.046917  0.224908
2001Q1 -0.152922  0.168667 -0.326439 -0.052034
2001Q2 -0.152922  0.168667 -0.326439 -0.052034
2001Q3 -0.152922  0.168667 -0.326439 -0.052034
2001Q4 -0.152922  0.168667 -0.326439 -0.052034

# 使用2000Q4替换2000、2001Q4替换2001，这两个值2000Q4和2001Q4之间就是升采样新增的值
print(annual_frame.resample('Q-DEC',convention='end').ffill())

        Colorado     Texas  New York      Ohio
2000Q4 -0.332076 -0.762599  0.046917  0.224908
2001Q1 -0.332076 -0.762599  0.046917  0.224908
2001Q2 -0.332076 -0.762599  0.046917  0.224908
2001Q3 -0.332076 -0.762599  0.046917  0.224908
2001Q4 -0.152922  0.168667 -0.326439 -0.052034

3.综合案例解析

Q-MAR：4-6月是1季度，7-9月是2季度，10-12月是3季度，1-3月是4季度；
2000-01到2000-03是2000Q4，2000-04到2000-6是2001Q1,以此类推；
2000转变为[2000Q4,2001Q1,2001Q2,2001Q3]，2001转变为[2001Q4,2002Q1,2002Q2,2002Q3]；
convention=’end’，那么会使用2001Q3替换原始的2000，2002Q3替换2001,中间的部分自动添加；
索引结果为[2001Q3,2001Q4,2002Q1,2002Q2,2002Q3]；

print(annual_frame.resample('Q-MAR',convention='end').ffill())

        Colorado     Texas  New York      Ohio
2001Q3 -0.332076 -0.762599  0.046917  0.224908
2001Q4 -0.332076 -0.762599  0.046917  0.224908
2002Q1 -0.332076 -0.762599  0.046917  0.224908
2002Q2 -0.332076 -0.762599  0.046917  0.224908
2002Q3 -0.152922  0.168667 -0.326439 -0.052034

Pandas时间序列：重采样及频率转换

一、介绍

二、resample方法–转换频率的主力函数

三、降采样(聚合)

1.降采样面元(区间)默认才有左闭右开的形式，而且聚合的索引是以左边界标记

2.通过参数closed=’right’可以实现左开右闭

3.通过参数label=’right’可以实现以右边界为聚合后的标签

4.通过参数loffset可以实现精准的调整标签

四、OHLC重采样

五、通过groupby进行重采样

六、升采样和插值

1.升采样、前向填充

2.重采样后的日期不一定与先前的日期有交集

七、通过时期(period)进行重采样

1.将采样

2.通过convention决定在升采样后，那端来替换原来的值

3.综合案例解析

猜你喜欢