Datawhale组队-Pandas（下）时序数据（打卡）

Pandas可以处理任何领域的时序数据（time series），使用Numpy的datetime64 和timedelta64 类型，Pandas整合了来自其他Python库的大量功能，如Scikits.TimeSeries，并为处理时间序列数据创建了大量新功能。

一、时序的创建

1.四类时间变量

名称	描述	元素类型	创建方式
Datetimes（时间点/时刻）	描述特定日期或时间点	Timestamp	to_datetime或date_range
Timespans（时间段/时期）	由时间点定义的一段时期	Period	Period或period_range
Dateoffsets（相对时间差）	一段时间的相对大小（与夏/冬令时无关）	Dateoffset	DateOffset
Timedeltas（绝对时间差）	一段时间的绝对大小（与夏/冬令时有关）	Timedelta	to_timedelta或 timedelta_range

对于时间序列数据，传统的做法是在Series或DataFrame索引中表示时间分量，这样就可以对时间元素执行操作。但是，Series和DataFrame也可以直接支持作为数据本身的时间组件。当传递到这些构造函数时，Series和DataFrame扩展了对日期时间、时间增量和期间数据的数据类型支持和功能。然而，DateOffset数据将作为对象数据存储。

#在index加入时间成分，dtype为int64
pd.Series(range(3), index=pd.date_range('2000', freq='D', periods=3))
#直接定义时间成分，dtype为datetime64[ns]
pd.Series(pd.date_range('2000', freq='D', periods=3))

2.时间点的创建

Timestamped是将值与时间点相关联的最基本的时间序列数据类型。对于pandas objects来说，这意味着使用时间点。

（a）to_datetime方法

Pandas在时间点建立的输入格式规定上给了很大的自由度，下面的语句都能正确建立同一时间点

print(pd.to_datetime('2020.1.1'))
print(pd.to_datetime('2020 1.1'))
print(pd.to_datetime('2020 1 1'))
print(pd.to_datetime('2020 1-1'))
print(pd.to_datetime('2020-1 1'))
print(pd.to_datetime('2020-1-1'))
print(pd.to_datetime('2020/1/1'))
print(pd.to_datetime('1.1.2020'))
print(pd.to_datetime('1.1 2020'))
print(pd.to_datetime('1 1 2020'))
print(pd.to_datetime('1 1-2020'))
print(pd.to_datetime('1-1 2020'))
print(pd.to_datetime('1-1-2020'))
print(pd.to_datetime('1/1/2020'))
print(pd.to_datetime('20200101'))
print(pd.to_datetime('2020.0101'))

#pd.to_datetime('2020\\1\\1') #报错
#pd.to_datetime('2020`1`1') #报错
#pd.to_datetime('2020.1 1') #报错
#pd.to_datetime('1 1.2020') #报错

利用format参数强制匹配

print(pd.to_datetime('2020\\1\\1',format='%Y\\%m\\%d'))
print(pd.to_datetime('2020`1`1',format='%Y`%m`%d'))
print(pd.to_datetime('2020.1 1',format='%Y.%m %d'))
print(pd.to_datetime('1 1.2020',format='%d %m.%Y'))

也可使用列表将其转为时间点索引

pd.Series(range(2),index=pd.to_datetime(['2020/1/1','2020/1/2']))

查看类型

type(pd.to_datetime(['2020/1/1','2020/1/2']))

对于DataFrame，如果列已经按照时间顺序排好，则利用to_datetime可自动转换

df = pd.DataFrame({'year': [2020, 2020],'month': [1, 1], 'day': [1, 2]})
pd.to_datetime(df)

（b）时间精度与范围限制

Timestamp的精度远远不止day，可以最小到纳秒ns，同时它的范围为

pd.to_datetime('2020/1/1 00:00:00.123456789')

#最小范围
print(pd.Timestamp.min)  #output:Timestamp('1677-09-21 00:12:43.145225')
#最大范围
print(pd.Timestamp.min)  #output:Timestamp('2262-04-11 23:47:16.854775807')

（c）date_range方法

start/end/periods（时间点个数）/freq（间隔方法）是该方法最重要的参数，给定了其中的3个，剩下的一个就会被却sing

freq参数如下：

符号	D/B	W	M/Q/Y	BM/BQ/BY	MS/QS/YS	BMS/BQS/BYS	H	T	S
描述	日/工作日	周	月末	月/季/年末日	月/季/年末工作日	月/季/年初日	时	分钟	秒

3.Dateoffset对象

（a）DateOffset与Timedelta的区别

Timedelta绝对时间差的特点指无论是冬令时还是夏令时，增减1day都只计算24小时

DateOffset相对时间差指，无论一天是23/24/25小时，增减1day都与当天相同的时间保持一致

例如，英国当地时间 2020年03月29日，01:00:00 时钟向前调整 1 小时变为 2020年03月29日，02:00:00，开始夏令时

ts = pd.Timestamp('2020-3-29 01:00:00', tz='Europe/Helsinki')
ts + pd.Timedelta(days=1)

ts = pd.Timestamp('2020-3-29 01:00:00', tz='Europe/Helsinki')
ts + pd.DateOffset(days=1)

可去除tz属性，就可使两者保持一致。

（b）增减一段时间

pd.Timestamp('2020-01-01') + pd.DateOffset(minutes=20) - pd.DateOffset(weeks=2)

（c）各类常用offset对象

pd.Timestamp('2020-01-01') + pd.offsets.Week(2)  #增加两星期
pd.Timestamp('2020-01-01') + pd.offsets.BQuarterBegin(1)  #营业季度开始

（d）序列的offset操作

利用apply函数

pd.Series(pd.offsets.BYearBegin(3).apply(i) for i in pd.date_range('20200101',periods=3,freq='Y'))

直接使用对象加减

pd.date_range('20200101',periods=3,freq='Y') + pd.offsets.BYearBegin(3)

定制offset，可以指定weekmask和holidays参数

pd.Series(pd.offsets.CDay(3,weekmask='Wed Fri',holidays='2020010').apply(i)
                                  for i in pd.date_range('20200105',periods=3,freq='D'))

二、时序的索引及属性

1.索引切片

rng = pd.date_range('2020','2021', freq='W')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts['2020-01-26':'20200726'].head() #日期从01-26，到07-26，字符自己转换成合理的

2.子集索引

#只取7月份数据
ts['2020-7'].head()
#支持混合形态索引
ts['2011-1':'20200726'].head()

3.时间点的属性

采用dt对象可以轻松获得关于时间的信息

#2020年有52个星期
pd.Series(ts.index).dt.week
#每星期是在几号
pd.Series(ts.index).dt.day

利用strftime修改时间格式

pd.Series(ts.index).dt.strftime('%Y-间隔1-%m-间隔2-%d').head()

对于datetime对象可以直接通过属性获取信息

#每个星期所在的月份
pd.date_range('2020','2021', freq='W').month
#每个星期所在的月份
pd.date_range('2020','2021', freq='W').weekday #The number of the day of the week with Monday=0, Sunday=6

三、重采样

重采样，就是指resample函数，它可以看做时序版本的groupby函数

1.resample对象的基本操作

采样频率一般设置为上面提到的offset字符

df_r = pd.DataFrame(np.random.randn(1000, 3),index=pd.date_range('1/1/2020', freq='S', periods=1000),
                  columns=['A', 'B', 'C'])
r = df_r.resample('3min')
r.sum()

2.采样聚合

df_r = pd.DataFrame(np.random.randn(1000, 3),index=pd.date_range('1/1/2020', freq='S', periods=1000),
                  columns=['A', 'B', 'C'])
r = df_r.resample('3T')

#只求一个值
r['A'].mean()
#表示多个
r['A'].agg([np.sum, np.mean, np.std])
#使用lambda
r.agg({'A': np.sum,'B': lambda x: max(x)-min(x)})

3.采样组的迭代

采样组的迭代和groupby迭代完全类似，对于每一个组都可以分别做相应操作

small = pd.Series(range(6),index=pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 00:30:00'
                                                 , '2020-01-01 00:31:00','2020-01-01 01:00:00'
                                                 ,'2020-01-01 03:00:00','2020-01-01 03:05:00']))
resampled = small.resample('H')
for name, group in resampled:
    print("Group: ", name)
    print("-" * 27)
    print(group, end="\n\n")

四、窗口函数

1.Rolling

（a）常用聚合

s = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2020', periods=1000))
#
s.rolling(window=50)
#
s.rolling(window=50).mean()
#min_periods是指需要的非缺失数据点数量阈值
s.rolling(window=50,min_periods=3).mean()

此外，还有count/sum/mean/median/min/max/std/var/skew/kurt/quantile/cov/corr都是常用的聚合函数

（b）rolling的apply聚合

使用apply聚合时，只需记住传入的是window大小的Series，输出的必须是标量即可，

#计算变异系数
s.rolling(window=50,min_periods=3).apply(lambda x:x.std()/x.mean()).head()

（c）基于时间的Rolling

可选closed='right'（默认）\'left'\'both'\'neither'参数，决定端点的包含情况

s.rolling('15D').mean().head()
#添加closed
s.rolling('15D', closed='right').sum().head()

2.Expanding

（a）expanding函数

普通的expanding函数等价与rolling(window=len(s),min_periods=1),是对序列的累计计算，apply也适用

#rolling
s.rolling(window=len(s),min_periods=1).sum().head()
#expanding
s.expanding().sum().head()
#apply
s.expanding().apply(lambda x:sum(x)).head()

（b）几个特别的Expanding类型函数

cumsum/cumprod/cummax/cummin都是特殊expanding累计计算方法

shift/diff/pct_change都是涉及到了元素关系

①shift是指序列索引不变，但值向后移动

②diff是指前后元素的差，period参数表示间隔，默认为1，并且可以为负

③pct_change是值前后元素的变化百分比，period参数与diff类似

Datawhale组队-Pandas（下）时序数据（打卡）

猜你喜欢