Pandas study notes (nine) - Pandas time series data

leading


For more article code details, please check the blogger’s personal website: https://www.iwtmbtly.com/


Import the required libraries and files:

>>> import pandas as pd
>>> import numpy as np

1. Creation of time series

(1) Four types of time variables

Now understanding may be a little confusing about ③ and ④, and some explanations will be made later:

image-20220529104826101

(2) Creation of time points

1. to_datetime method

Pandas gives a lot of freedom in specifying the input format of the time point. The following statements can correctly create the same time point:

>>> pd.to_datetime('2020.1.1')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('2020 1.1')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('2020 1 1')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('2020 1-1')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('2020-1 1')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('2020-1-1')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('2020/1/1')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('1.1.2020')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('1.1 2020')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('1 1 2020')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('1 1-2020')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('1-1 2020')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('1-1-2020')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('1/1/2020')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('20200101')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('2020.0101')
Timestamp('2020-01-01 00:00:00')

The following statements will report an error:

# pd.to_datetime('2020\\1\\1')
# pd.to_datetime('2020`1`1')
# pd.to_datetime('2020.1 1')
# pd.to_datetime('1 1.2020')

At this time, the format parameter can be used to force the match:

>>> pd.to_datetime('2020\\1\\1',format='%Y\\%m\\%d')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('2020`1`1',format='%Y`%m`%d')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('2020.1 1',format='%Y.%m %d')
Timestamp('2020-01-01 00:00:00')
>>> pd.to_datetime('1 1.2020',format='%d %m.%Y')
Timestamp('2020-01-01 00:00:00')

At the same time, using a list can convert it to a point-in-time index:

>>> pd.Series(range(2),index=pd.to_datetime(['2020/1/1','2020/1/2']))
2020-01-01    0
2020-01-02    1
dtype: int64
>>> type(pd.to_datetime(['2020/1/1','2020/1/2']))
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

For DataFrame, if the columns have been arranged in chronological order, using to_datetime can automatically convert:

>>> df = pd.DataFrame({
    
    'year': [2020, 2020],'month': [1, 1], 'day': [1, 2]})
>>> pd.to_datetime(df)
0   2020-01-01
1   2020-01-02
dtype: datetime64[ns]

2. Time precision and range limitation

In fact, the precision of Timestamp is far more than day, and can be as small as nanoseconds (ns):

>>> pd.to_datetime('2020/1/1 00:00:00.123456789')
Timestamp('2020-01-01 00:00:00.123456789')

At the same time, it brings range at the cost of only about 584 years of time being available:

>>> pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145224193')
>>> pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')

3. date_range method

Generally speaking, start/end/periods (number of time points)/freq (interval method) are the most important parameters of this method. Given 3 of them, the remaining one will be determined

>>> pd.date_range(start='2020/1/1',end='2020/1/10',periods=3)
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-05 12:00:00',
               '2020-01-10 00:00:00'],
              dtype='datetime64[ns]', freq=None)
>>> pd.date_range(start='2020/1/1',end='2020/1/10',freq='D')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')
>>> pd.date_range(start='2020/1/1',periods=3,freq='D')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')
>>> pd.date_range(end='2020/1/3',periods=3,freq='D')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')

Among them, the freq parameter has many options. The commonly used parts are listed below. More options can be found here

image-20220529110254477

>>> pd.date_range(start='2020/1/1',periods=3,freq='T')
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:01:00',
               '2020-01-01 00:02:00'],
              dtype='datetime64[ns]', freq='T')
>>> pd.date_range(start='2020/1/1',periods=3,freq='M')
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31'], dtype='datetime64[ns]', freq='M')
>>> pd.date_range(start='2020/1/1',periods=3,freq='BYS')
DatetimeIndex(['2020-01-01', '2021-01-01', '2022-01-03'], dtype='datetime64[ns]', freq='BAS-JAN')

bdate_range is a method similar to date_range, and its feature is that you can select the weekmask parameter and holidays parameter on the built-in working day interval setting

It has a special 'C'/'CBM'/'CBMS' option in its freq, which means customization, which needs to be used in conjunction with the weekmask parameter and the holidays parameter

For example, it is now necessary to keep Monday, Tuesday, and Friday in the working day and remove some holidays:

>>> weekmask = 'Mon Tue Fri'
>>> holidays = [pd.Timestamp('2020/1/%s'%i) for i in range(7,13)]
>>> #注意holidays
>>> pd.bdate_range(start='2020-1-1',end='2020-1-15',freq='C',weekmask=weekmask,holidays=holidays)
DatetimeIndex(['2020-01-03', '2020-01-06', '2020-01-13', '2020-01-14'], dtype='datetime64[ns]', freq='C')

(3) DateOffset object

1. The difference between DataOffset and Timedelta

The characteristic of Timedelta absolute time difference means that whether it is winter time or daylight saving time, the increase or decrease of 1 day is only calculated for 24 hours

The relative time difference of DataOffset means that whether a day is 23\24\25 hours, the increase or decrease of 1 day is consistent with the same time of the day

For example, on March 29, 2020, UK local time, 01:00:00, the clock is adjusted forward by 1 hour to March 29, 2020, 02:00:00, and daylight saving time begins:

>>> ts = pd.Timestamp('2020-3-29 01:00:00', tz='Europe/Helsinki')

>>> ts + pd.Timedelta(days=1)
Timestamp('2020-03-30 02:00:00+0300', tz='Europe/Helsinki')
>>>
>>> ts + pd.DateOffset(days=1)
Timestamp('2020-03-30 01:00:00+0300', tz='Europe/Helsinki')

This may seem a bit of a headache, but as long as you remove tz (time zone) you can ignore it, the two remain the same, unless you need to use time zone conversion:

>>> ts = pd.Timestamp('2020-3-29 01:00:00')
>>> ts + pd.Timedelta(days=1)
Timestamp('2020-03-30 01:00:00')
>>> ts + pd.DateOffset(days=1)
Timestamp('2020-03-30 01:00:00')

2. Add or subtract a period of time

Optional parameters for DateOffset include years/months/weeks/days/hours/minutes/seconds

>>> pd.Timestamp('2020-01-01') + pd.DateOffset(minutes=20) - pd.DateOffset(weeks=2)
Timestamp('2019-12-18 00:20:00')

3. Various commonly used offset objects

image-20220529111041065

>>> pd.Timestamp('2020-01-01') + pd.offsets.Week(2)
Timestamp('2020-01-15 00:00:00')
>>> pd.Timestamp('2020-01-01') + pd.offsets.BQuarterBegin(1)
Timestamp('2020-03-02 00:00:00')

4. Sequence offset operation

Use the apply function:

>>> pd.Series(pd.offsets.BYearBegin(3).apply(i) for i in pd.date_range('20200101',periods=3,freq='Y'))
0   2023-01-02
1   2024-01-01
2   2025-01-01
dtype: datetime64[ns]

Add and subtract directly using objects:

>>> pd.date_range('20200101',periods=3,freq='Y') + pd.offsets.BYearBegin(3)
DatetimeIndex(['2023-01-02', '2024-01-01', '2025-01-01'], dtype='datetime64[ns]', freq=None)

To customize the offset, you can specify the weekmask and holidays parameters (think why all three are the same value)

>>> pd.Series(pd.offsets.CDay(3,weekmask='Wed Fri',holidays='2020010').apply(i)
...                                   for i in pd.date_range('20200105',periods=3,freq='D'))
0   2020-01-15
1   2020-01-15
2   2020-01-15
dtype: datetime64[ns]

2. Timing index and attributes

(1) Index slice

>>> rng = pd.date_range('2020','2021', freq='W')
>>> ts = pd.Series(np.random.randn(len(rng)), index=rng)
>>> ts.head()
2020-01-05   -0.748400
2020-01-12    0.486114
2020-01-19    0.510675
2020-01-26    0.757519
2020-02-02   -0.839067
Freq: W-SUN, dtype: float64
>>> ts['2020-01-26']
0.757519483225889

Legal characters are automatically converted to time points:

>>> ts['2020-01-26':'20200726'].head()
2020-01-26    0.757519
2020-02-02   -0.839067
2020-02-09    0.448796
2020-02-16    0.420513
2020-02-23   -1.340417
Freq: W-SUN, dtype: float64

(2) Subset index

>>> ts['2020-7'].head()
2020-07-05   -0.887375
2020-07-12    0.068180
2020-07-19   -0.000156
2020-07-26    1.562112
Freq: W-SUN, dtype: float64

Mixed-modal indexes are supported:

>>> ts['2011-1':'20200726'].head()
2020-01-05   -0.748400
2020-01-12    0.486114
2020-01-19    0.510675
2020-01-26    0.757519
2020-02-02   -0.839067
Freq: W-SUN, dtype: float64

(3) Attributes of time points

Information about time can be easily obtained using the dt object:

>>> pd.Series(ts.index).dt.isocalendar().week.head()
0    1
1    2
2    3
3    4
4    5
Name: week, dtype: UInt32
>>> pd.Series(ts.index).dt.day.head()
0     5
1    12
2    19
3    26
4     2
dtype: int64

Use strftime to re-modify the time format:

>>> pd.Series(ts.index).dt.strftime('%Y-间隔1-%m-间隔2-%d').head()
0    2020-间隔1-01-间隔2-05
1    2020-间隔1-01-间隔2-12
2    2020-间隔1-01-间隔2-19
3    2020-间隔1-01-间隔2-26
4    2020-间隔1-02-间隔2-02
dtype: object

For datetime objects, information can be obtained directly through attributes:

>>> pd.date_range('2020','2021', freq='W').month
Int64Index([ 1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  3,  4,  4,  4,  4,
             5,  5,  5,  5,  5,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8,
             8,  9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12, 12,
            12],
           dtype='int64')
>>> pd.date_range('2020','2021', freq='W').weekday
Int64Index([6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
            6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
            6, 6, 6, 6, 6, 6, 6, 6],
           dtype='int64')

3. Resampling

The so-called resampling refers to the resample function, which can be regarded as the groupby function of the time series version

(1) Basic operations of the resample object

The sampling frequency is generally set to the offset character mentioned above:

>>> df_r = pd.DataFrame(np.random.randn(1000, 3),index=pd.date_range('1/1/2020', freq='S', periods=1000),
...                   columns=['A', 'B', 'C'])
>>> r = df_r.resample('3min')
>>> r
<pandas.core.resample.DatetimeIndexResampler object at 0x7f82387e4340>
>>> r.sum()
                             A          B          C
2020-01-01 00:00:00  -7.516439 -27.783036 -11.448831
2020-01-01 00:03:00  -9.991624   7.390296   8.338640
2020-01-01 00:06:00   7.468198 -22.687593  10.293133
2020-01-01 00:09:00 -26.955084 -23.255671 -10.254862
2020-01-01 00:12:00   9.351612 -16.941258   9.323046
2020-01-01 00:15:00  -5.380861  -0.258748  -9.376369
>>> df_r2 = pd.DataFrame(np.random.randn(200, 3),index=pd.date_range('1/1/2020', freq='D', periods=200),
...                   columns=['A', 'B', 'C'])
>>> r = df_r2.resample('CBMS')
>>> r.sum()
                   A         B         C
2020-01-01 -2.941740  5.320574 -6.844297
2020-02-03  5.239486 -8.492715  3.398018
2020-03-02  5.122721 -6.177475  1.329978
2020-04-01 -3.582743  0.851905 -2.708295
2020-05-01  1.538799  0.209188  7.031907
2020-06-01  8.507732 -0.766705 -1.486927
2020-07-01 -2.576345  2.197384 -3.776819

(2) Sampling collection

>>> r = df_r.resample('3T')
>>> r['A'].mean()
2020-01-01 00:00:00   -0.041758
2020-01-01 00:03:00   -0.055509
2020-01-01 00:06:00    0.041490
2020-01-01 00:09:00   -0.149750
2020-01-01 00:12:00    0.051953
2020-01-01 00:15:00   -0.053809
Freq: 3T, Name: A, dtype: float64
>>> r['A'].agg([np.sum, np.mean, np.std])
                           sum      mean       std
2020-01-01 00:00:00  -7.516439 -0.041758  1.031633
2020-01-01 00:03:00  -9.991624 -0.055509  1.058948
2020-01-01 00:06:00   7.468198  0.041490  0.985695
2020-01-01 00:09:00 -26.955084 -0.149750  0.942381
2020-01-01 00:12:00   9.351612  0.051953  0.933944
2020-01-01 00:15:00  -5.380861 -0.053809  1.033877

Similarly, functions/lambda expressions can be used:

>>> r.agg({
    
    'A': np.sum,'B': lambda x: max(x)-min(x)})
                             A         B
2020-01-01 00:00:00  -7.516439  5.848965
2020-01-01 00:03:00  -9.991624  5.735483
2020-01-01 00:06:00   7.468198  5.503003
2020-01-01 00:09:00 -26.955084  5.264593
2020-01-01 00:12:00   9.351612  5.774718
2020-01-01 00:15:00  -5.380861  4.630647

(3) Iteration of sampling groups

The iteration of the sampling group is completely similar to the groupby iteration, and corresponding operations can be performed for each group:

>>> small = pd.Series(range(6),index=pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 00:30:00'
...                                                  , '2020-01-01 00:31:00','2020-01-01 01:00:00'
...                                                  ,'2020-01-01 03:00:00','2020-01-01 03:05:00']))
>>> resampled = small.resample('H')
>>> for name, group in resampled:
...     print("Group: ", name)
...     print("-" * 27)
...     print(group, end="\n\n")
...
Group:  2020-01-01 00:00:00
---------------------------
2020-01-01 00:00:00    0
2020-01-01 00:30:00    1
2020-01-01 00:31:00    2
dtype: int64

Group:  2020-01-01 01:00:00
---------------------------
2020-01-01 01:00:00    3
dtype: int64

Group:  2020-01-01 02:00:00
---------------------------
Series([], dtype: int64)

Group:  2020-01-01 03:00:00
---------------------------
2020-01-01 03:00:00    4
2020-01-01 03:05:00    5
dtype: int64

Four, window function

The following mainly introduces two main types of window functions in pandas: rolling/expanding:

>>> s = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2020', periods=1000))
>>> s.head()
2020-01-01   -0.504213
2020-01-02   -0.481141
2020-01-03   -0.799043
2020-01-04    0.382436
2020-01-05   -1.933380
Freq: D, dtype: float64

(1) Rolling

1. Common aggregation

The so-called rolling method is to specify a window, which is the same as the groupby object, it will not operate itself, and it needs to cooperate with the aggregation function to calculate the result:

>>> s.rolling(window=50)
Rolling [window=50,center=False,axis=0,method=single]
>>> s.rolling(window=50).mean()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04         NaN
2020-01-05         NaN
                ...
2022-09-22    0.061305
2022-09-23    0.006119
2022-09-24    0.020960
2022-09-25   -0.004617
2022-09-26   -0.000460
Freq: D, Length: 1000, dtype: float64

The min_periods parameter refers to the threshold number of non-missing data points required:

>>> s.rolling(window=50,min_periods=3).mean().head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03   -0.594799
2020-01-04   -0.350490
2020-01-05   -0.667068
Freq: D, dtype: float64

count/sum/mean/median/min/max/std/var/skew/kurt/quantile/cov/corr are commonly used aggregation functions

2. Rolling apply aggregation

When using apply aggregation, you only need to remember that the input is a series of window size, and the output must be a scalar. For example, the coefficient of variation is calculated as follows:

>>> s.rolling(window=50,min_periods=3).apply(lambda x:x.std()/x.mean()).head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03   -0.298010
2020-01-04   -1.453968
2020-01-05   -1.250537
Freq: D, dtype: float64

3. Time-based rolling

>>> s.rolling('15D').mean().head()
2020-01-01   -0.504213
2020-01-02   -0.492677
2020-01-03   -0.594799
2020-01-04   -0.350490
2020-01-05   -0.667068
Freq: D, dtype: float64

optional closed='right' (default) 'left' 'both' 'neither' parameters, determine the inclusion of endpoints

>>> s.rolling('15D', closed='right').sum().head()
2020-01-01   -0.504213
2020-01-02   -0.985354
2020-01-03   -1.784397
2020-01-04   -1.401961
2020-01-05   -3.335340
Freq: D, dtype: float64

(2) Expanding

1. The expanding function

The ordinary expanding function is equivalent to rolling(window=len(s),min_periods=1), which is the cumulative calculation of the sequence:

>>> s.rolling(window=len(s),min_periods=1).sum().head()
2020-01-01   -0.504213
2020-01-02   -0.985354
2020-01-03   -1.784397
2020-01-04   -1.401961
2020-01-05   -3.335340
Freq: D, dtype: float64
>>> s.expanding().sum().head()
2020-01-01   -0.504213
2020-01-02   -0.985354
2020-01-03   -1.784397
2020-01-04   -1.401961
2020-01-05   -3.335340
Freq: D, dtype: float64

The apply method is also available:

>>> s.expanding().apply(lambda x:sum(x)).head()
2020-01-01   -0.504213
2020-01-02   -0.985354
2020-01-03   -1.784397
2020-01-04   -1.401961
2020-01-05   -3.335340
Freq: D, dtype: float64

2. Several special Expanding type functions

>>> s.cumsum().head()
2020-01-01   -0.504213
2020-01-02   -0.985354
2020-01-03   -1.784397
2020-01-04   -1.401961
2020-01-05   -3.335340
Freq: D, dtype: float64
>>> s.cumprod().head()
2020-01-01   -0.504213
2020-01-02    0.242598
2020-01-03   -0.193846
2020-01-04   -0.074134
2020-01-05    0.143329
Freq: D, dtype: float64

shift/diff/pct_change all involve element relationships:

  • shift means that the sequence index remains unchanged, but the value moves backward

  • diff refers to the difference between the front and rear elements, the period parameter indicates the interval, the default is 1, and it can be negative

  • pct_change is the percentage change of elements before and after the value, and the period parameter is similar to diff

>>> s.shift(2).head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03   -0.504213
2020-01-04   -0.481141
2020-01-05   -0.799043
Freq: D, dtype: float64
>>> s.diff(3).head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04    0.886649
2020-01-05   -1.452239
Freq: D, dtype: float64
>>> s.pct_change(3).head()
2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04   -1.758481
2020-01-05    3.018323
Freq: D, dtype: float64

Guess you like

Origin blog.csdn.net/qq_43300880/article/details/125029031