Pandas window function rolling and expanding usage instructions

Pandas window function rolling and expanding

1. rolling mobile window

rolling() is a moving window function that can be used with aggregate functions such as mean, count, sum, median, std, etc. For ease of use, Pandas defines special method aggregation methods for moving functions, such as rolling_mean(), rolling_count(), rolling_sum(), etc.
Its syntax format is as follows:

rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)

(1) Parameters:

window: Indicates the size of the time window, and has two forms:
1) Use the value int to indicate the number of observations, that is, the number of data forward;
2) You can also use the offset type, which is more complicated and uses more scenarios Less, no introduction here;
min_periods: The minimum number of observations contained in each window, and the result of the window smaller than this value is NA. Value can be int, default None. In the case of offset, the default is 1;
center: set the label of the window to center, Boolean, default False, right
win_type: The type of window. Various functions for intercepting windows. String type, default is None;
on: optional parameter. For a dataframe, specifies the columns over which to calculate the tumbling window. The value is the column name.
axis: The default is 0, that is, calculate the column
closed: defines the opening and closing of the interval, and supports int type window. For the offset type, the default is left open and right closed, that is, the default is right. It can be specified as left, both, etc. according to the situation.

(2) Use case:

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np

seed = 8
s = pd.Series(range(seed), index=pd.date_range('2023-01-01', periods=seed, freq='1D'))
print(s)

print(s.rolling(window=3).sum())
print(s.rolling(window=3, min_periods=1).sum())
print(s.rolling(window='3D').sum())

The result is as follows:

2023-01-01    0
2023-01-02    1
2023-01-03    2
2023-01-04    3
2023-01-05    4
2023-01-06    5
2023-01-07    6
2023-01-08    7
Freq: D, dtype: int64
2023-01-01     NaN
2023-01-02     NaN
2023-01-03     3.0
2023-01-04     6.0
2023-01-05     9.0
2023-01-06    12.0
2023-01-07    15.0
2023-01-08    18.0
Freq: D, dtype: float64
2023-01-01     0.0
2023-01-02     1.0
2023-01-03     3.0
2023-01-04     6.0
2023-01-05     9.0
2023-01-06    12.0
2023-01-07    15.0
2023-01-08    18.0
Freq: D, dtype: float64
2023-01-01     0.0
2023-01-02     1.0
2023-01-03     3.0
2023-01-04     6.0
2023-01-05     9.0
2023-01-06    12.0
2023-01-07    15.0
2023-01-08    18.0
Freq: D, dtype: float64

print(s.rolling(window=3).sum())
window width is 3, there is no data for two days before January 1st, it is a null value, so the results on the 1st and 2nd are NaN

2023-01-01     NaN
2023-01-02     NaN
2023-01-03     3.0

print(s.rolling(window=3, min_periods=1).sum())
In order to solve the above problem, use min_periods, if there is no data, the minimum window is 1, so there are data on the 1st and 2nd.

2023-01-01     0.0
2023-01-02     1.0
2023-01-03     3.0

print(s.rolling(window='3D').sum())
The index is the date, and the index date is 3 days as the window width, which has the same effect as using the min_periods=1 parameter.

(3) Usage expansion

In addition to sum(), the rolling() function also supports many functions, such as:
count() the number of non-null observations
mean() the average value of values
median() the arithmetic median of values
min() minimum value
max() maximum
std( ) Bessel corrected sample standard deviation
var() unbiased variance
skew() sample skewness (third moment)
kurt() sample kurtosis (fourth moment)
quantile() sample quantile (percentile value)
cov() unbiased covariance (binary)
corr() correlation (binary)

Can also be used with the agg aggregate function

df = pd.DataFrame({"A": range(5), "B": range(10, 15)})

df.rolling(window=2).agg([np.sum, np.mean, np.std])

The result is as follows:
insert image description here

2, expanding extended window function

expanding() expands the window function. Expansion refers to starting from the first element of the sequence and calculating the aggregate value of the elements backward one by one.

(1) Parameters:

The parameters of the expanding() function are used in the same way as the parameters of the rolling() function; there is no window parameter . The window size is not fixed, and the cumulative calculation is realized, that is, continuous expansion.
The rolling() function is a fixed window size for sliding calculations, and the expanding() function only sets the minimum number of observations.

The expanding() function, similar to the cumulative summation of the cumsum() function, has the advantage that more clustering calculations can be performed;

(2) Use case:

seed =21
df = pd.DataFrame({'a':np.random.randint(20,500,seed),'b':np.random.randint(20,100,seed)}, index=pd.date_range('2020-01-01', periods=seed, freq='1D'))
df.index.name='testdate'

print(df)
print(df.expanding(min_periods=7).max())
print(df.expanding(min_periods=7).min())

The minimum window is 7 days, so the maximum and minimum values, 1st to 6th, are all NaN.
The window is continuously extended from 7 days to all dataframe records.
column a:

2020-01-01 492,
until the last day, exceeding the maximum value of the first day, and the max value changed to
2020-01-21 499.0
The same is true for the maximum value of column b.

The result is as follows:

              a   b
testdate           
2020-01-01  492  80
2020-01-02  399  73
2020-01-03  309  58
2020-01-04   66  38
2020-01-05  318  20
2020-01-06  488  29
2020-01-07  328  33
2020-01-08  347  25
2020-01-09  240  47
2020-01-10   80  71
2020-01-11  265  54
2020-01-12   33  65
2020-01-13  396  61
2020-01-14   71  54
2020-01-15   39  49
2020-01-16  189  48
2020-01-17   79  67
2020-01-18  303  62
2020-01-19  482  98
2020-01-20  208  27
2020-01-21  499  43
                a     b
testdate               
2020-01-01    NaN   NaN
2020-01-02    NaN   NaN
2020-01-03    NaN   NaN
2020-01-04    NaN   NaN
2020-01-05    NaN   NaN
2020-01-06    NaN   NaN
2020-01-07  492.0  80.0
2020-01-08  492.0  80.0
2020-01-09  492.0  80.0
2020-01-10  492.0  80.0
2020-01-11  492.0  80.0
2020-01-12  492.0  80.0
2020-01-13  492.0  80.0
2020-01-14  492.0  80.0
2020-01-15  492.0  80.0
2020-01-16  492.0  80.0
2020-01-17  492.0  80.0
2020-01-18  492.0  80.0
2020-01-19  492.0  98.0
2020-01-20  492.0  98.0
2020-01-21  499.0  98.0
               a     b
testdate              
2020-01-01   NaN   NaN
2020-01-02   NaN   NaN
2020-01-03   NaN   NaN
2020-01-04   NaN   NaN
2020-01-05   NaN   NaN
2020-01-06   NaN   NaN
2020-01-07  66.0  20.0
2020-01-08  66.0  20.0
2020-01-09  66.0  20.0
2020-01-10  66.0  20.0
2020-01-11  66.0  20.0
2020-01-12  33.0  20.0
2020-01-13  33.0  20.0
2020-01-14  33.0  20.0
2020-01-15  33.0  20.0
2020-01-16  33.0  20.0
2020-01-17  33.0  20.0
2020-01-18  33.0  20.0
2020-01-19  33.0  20.0
2020-01-20  33.0  20.0
2020-01-21  33.0  20.0

(3) Usage expansion:

df = pd.DataFrame(
    np.random.rand(6, 4),
    index=pd.date_range("2022-01-01", periods=6),
    columns=["A", "B", "C", "D"],
)
print(df)
print (df.expanding(min_periods=3).mean())

print('mini window 3 mean result:')
print(df.iloc[0:3].mean())

print('last mean result')
print(df['A'].mean(),df['B'].mean(),df['C'].mean(),df['D'].mean())

The average value of 3 days, equivalent to df.iloc[0:3].mean().
6-day mean, with df['A'].mean(), df['B'].mean(), df['C'].mean(), df['D'].mean() Equivalent, or df.mean() equivalent.
The result is as follows:

                   A         B         C         D
2022-01-01  0.697834  0.100287  0.652869  0.207896
2022-01-02  0.495769  0.010228  0.033768  0.311194
2022-01-03  0.404906  0.814433  0.447700  0.369165
2022-01-04  0.148014  0.980413  0.869525  0.760739
2022-01-05  0.925820  0.322119  0.363028  0.927978
2022-01-06  0.988882  0.997867  0.419070  0.276822
                   A         B         C         D
2022-01-01       NaN       NaN       NaN       NaN
2022-01-02       NaN       NaN       NaN       NaN
2022-01-03  0.532836  0.308316  0.378112  0.296085
2022-01-04  0.436631  0.476340  0.500966  0.412248
2022-01-05  0.534469  0.445496  0.473378  0.515394
2022-01-06  0.610204  0.537558  0.464327  0.475632
mini window 3 mean result:
A    0.532836
B    0.308316
C    0.378112
D    0.296085
dtype: float64
last mean result
0.6102041426142952 0.5375578633503776 0.4643265893696882 0.47563227350705956

Understand the meaning of the extended window again:
it means to calculate the average of at least 3 numbers, the calculation method is
A2=(A0+A1+A2)/3,
A3=(A0+A1+A2+A3)/4
A4=(A0+ A1+A2+A3+A4)/5
A5 A6 and so on.

df = pd.DataFrame({'a':range(5)})
print(df)
print(df.rolling(window=len(df), min_periods=1).mean())
print(df.expanding(min_periods=1).mean())

When the parameter window=len(df) of the rolling() function, the effect is the same as that of the expanding() function. The result is as follows: