【python】详解pandas.DataFrame.resample根据时间聚合采样（一）

首先我们直接看官方的文档：

DataFrame.resample(rule, how=None, axis=0, fill_method=None, closed=None,
				    label=None, convention='start', kind=None, loffset=None,
				    limit=None, base=0, on=None, level=None)

聚合的时间参数rule
参数如下：

rule : 表示目标转换的偏移字符串或对象，一般是时间参数，比如“M”，“A”，“Q”，“BM”，“BA”，“BQ”和“W”；
axis : int, optional, default 0
closed : {‘right’, ‘left’}；间隔的哪一侧是关闭的，对于除“M”，“A”，“Q”，“BM”，“BA”，“BQ”和“W”之外的所有频率偏移，默认值为“左”，其默认值均为“右”
label : {‘right’, ‘left’}；用于标记bins，间隔的哪一侧是关闭的，对于除“M”，“A”，“Q”，“BM”，“BA”，“BQ”和“W”之外的所有频率偏移，默认值为“左”，其默认值均为“右”
convention : {‘start’, ‘end’, ‘s’, ‘e’}：For PeriodIndex only, controls whether to use the start or end of rule
kind: {‘timestamp’, ‘period’}, optional；Pass ‘timestamp’ to convert the resulting index to a DateTimeIndex or ‘period’ to convert it to a PeriodIndex. By default the input representation is retained.
loffset : 调整重新采样的时间标签
on : 对于DataFrame，要使用的列而不是索引进行重新采样。列必须与日期时间相似的数据。

实例

1.1 创建时间索引，并查看按时间合并后进行求和计算

#首先创建一个包含9个一分钟时间戳的系列
index = pd.date_range('1/1/2000', periods=9, freq='T')  # 生成频率为每分钟的数据
series = pd.Series(range(9), index=index)

series
Out[31]: 
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

series.resample('3T')  
# 3T就是rule，合并的方向默认是axis = 0；
Out[34]: DatetimeIndexResampler [freq=<3 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0]

series.resample('3T').sum()
Out[35]: 
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

1.2 close和left的左右


# 间隔的哪一侧是关闭的，对于除“M”，“A”，“Q”，“BM”，“BA”，“BQ”和“W”之外的所有频率偏移，默认值为“左”，其默认值均为“右”

series.resample('W')
Out[37]: DatetimeIndexResampler [freq=<Week: weekday=6>, axis=0, closed=right, label=right, convention=start, base=0]
series.resample('T')
Out[38]: DatetimeIndexResampler [freq=<Minute>, axis=0, closed=left, label=left, convention=start, base=0]

# 如果是T分钟频率，合并后的K线是从合并的时间开始计算，如果是2T，就从开始的第一根往后数2根；
series.resample('T').sum()
Out[40]: 
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

# 对比可以看出是左侧关闭的
series.resample('2T').sum()
Out[42]: 
2000-01-01 00:00:00     1
2000-01-01 00:02:00     5
2000-01-01 00:04:00     9
2000-01-01 00:06:00    13
2000-01-01 00:08:00     8
Freq: 2T, dtype: int64

# 如果是W周频率，那么整个合并的结果归集到最后一天
series = pd.Series(range(9), index=index)
series.resample('W').sum()
Out[53]: 
2000-01-02     0
2000-01-09     3
2000-01-16    12
2000-01-23    13
2000-01-30     8
Freq: W-SUN, dtype: int64

series
Out[54]: 
2000-01-01    0
2000-01-04    1
2000-01-07    2
2000-01-10    3
2000-01-13    4
2000-01-16    5
2000-01-19    6
2000-01-22    7
2000-01-25    8
Freq: 3D, dtype: int64

1.3 自定义label的方向,也就是数据归集的位置

index = pd.date_range('1/1/2000', periods=9, freq='T')  # 生成频率为每分钟的数据
series = pd.Series(range(9), index=index)
series.resample('3T', label='right').sum()

Out[57]: 
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

1.4 自定义close的方向

index = pd.date_range('1/1/2000', periods=9, freq='T')  # 生成频率为每分钟的数据
series = pd.Series(range(9), index=index)


series
Out[63]: 
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64
# 被合并的数据的最右不显示，关闭箱间隔的右侧
series.resample('3T', closed='right').sum()
Out[64]: 
1999-12-31 23:57:00     0
2000-01-01 00:00:00     6
2000-01-01 00:03:00    15
2000-01-01 00:06:00    15
Freq: 3T, dtype: int64

1.5 上采用成30s的数据，加上asfreq()

series.resample('30S').asfreq()
Out[66]: 
2000-01-01 00:00:00    0.0
2000-01-01 00:00:30    NaN
2000-01-01 00:01:00    1.0
2000-01-01 00:01:30    NaN
2000-01-01 00:02:00    2.0
2000-01-01 00:02:30    NaN
2000-01-01 00:03:00    3.0
2000-01-01 00:03:30    NaN
2000-01-01 00:04:00    4.0
2000-01-01 00:04:30    NaN
2000-01-01 00:05:00    5.0
2000-01-01 00:05:30    NaN
2000-01-01 00:06:00    6.0
2000-01-01 00:06:30    NaN
2000-01-01 00:07:00    7.0
2000-01-01 00:07:30    NaN
2000-01-01 00:08:00    8.0
Freq: 30S, dtype: float64

1.6 将系列上采样到30秒的箱中，并使用pad方法填充NaN值

series.resample('30S').pad()[0:5]
Out[67]: 
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

1.7 将系列上上采样30秒的箱中，并使用bfill方法填充NaN值

series.resample('30S').bfill()[0:5]
Out[68]: 
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

1.8 通过apply传递自定义功能

def custom_resampler(array_like):
    return np.sum(array_like)+5


import numpy as np

series.resample('3T').apply(custom_resampler)
Out[74]: 
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64

1.9 对于具有PeriodIndex的Series，关键字约定可用于控制是否使用规则的开头或结尾。

s = pd.Series([1, 2], index=pd.period_range('2012-01-01',
                                                freq='A',
                                                periods=2))
                                                
# 上采样需要asfreq( )
s.resample('M', convention='start').asfreq().head()
Out[76]: 
2012-01    1.0
2012-02    NaN
2012-03    NaN
2012-04    NaN
2012-05    NaN
Freq: M, dtype: float64

s.resample('M', convention='end').asfreq()
Out[78]: 
2012-12    1.0
2013-01    NaN
2013-02    NaN
2013-03    NaN
2013-04    NaN
2013-05    NaN
2013-06    NaN
2013-07    NaN
2013-08    NaN
2013-09    NaN
2013-10    NaN
2013-11    NaN
2013-12    2.0
Freq: M, dtype: float64

1.10 对于DataFrame对象，关键字on可用于指定列而不是重新取样的索引

df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd'])
df['time'] = pd.date_range('1/1/2000', periods=9, freq='T')
df.resample('3T', on='time').sum()

Out[81]: 
                     a  b  c  d
time                           
2000-01-01 00:00:00  0  3  6  9
2000-01-01 00:03:00  0  3  6  9
2000-01-01 00:06:00  0  3  6  9

1.11 how，未来不会用的功能

index = pd.date_range('1/1/2000', periods=9, freq='T')  # 生成频率为每分钟的数据
series = pd.Series(range(9), index=index)


series.resample('2T',how='last')
Out[83]: 
2000-01-01 00:00:00    1
2000-01-01 00:02:00    3
2000-01-01 00:04:00    5
2000-01-01 00:06:00    7
2000-01-01 00:08:00    8
Freq: 2T, dtype: int64

series.resample('2T',how='mean')
Out[84]: 
2000-01-01 00:00:00    0.5
2000-01-01 00:02:00    2.5
2000-01-01 00:04:00    4.5
2000-01-01 00:06:00    6.5
2000-01-01 00:08:00    8.0
Freq: 2T, dtype: float64

【python】详解pandas.DataFrame.resample根据时间聚合采样（一）

首先我们直接看官方的文档：

实例

1.1 创建时间索引，并查看按时间合并后进行求和计算

1.2 close和left的左右

1.3 自定义label的方向,也就是数据归集的位置

1.4 自定义close的方向

1.5 上采用成30s的数据，加上asfreq()

1.6 将系列上采样到30秒的箱中，并使用pad方法填充NaN值

1.7 将系列上上采样30秒的箱中，并使用bfill方法填充NaN值

1.8 通过apply传递自定义功能

1.9 对于具有PeriodIndex的Series，关键字约定可用于控制是否使用规则的开头或结尾。

1.10 对于DataFrame对象，关键字on可用于指定列而不是重新取样的索引

1.11 how，未来不会用的功能

猜你喜欢