xiaodai :
Say I have dataset that is index on date
id, date, col1, col2
1, 4, 1, 12
1, 5, 2, 13
1, 6, 6, 14
2, 4, 20, 16
2, 5, 8, 17
2, 6, 11, 18
...
and I wish to compute the rolling mean, sum, min, max
for col1
and col2
grouped by id
, with window size 2 and 3. I can do that in a loop like so
def multi_rolling(df, winsize, column):
[df.groupby("id")[column].rolling(winsize).mean(),
df.groupby("id")[column].rolling(winsize).sum(),
df.groupby("id")[column].rolling(winsize).min(),
df.groupby("id")[column].rolling(winsize).max(),
df.groupby("id")[column].rolling(winsize).count()]
Then I just have to call the above in a loop. But this feels inefficient. Is there a way to call it on all combinations of all functions and all columns and all window size more efficiently? E.g. run them in parallel?
Chris :
Use pandas.DataFrame.agg
:
new_df = df.groupby("id").rolling(2)[["col1","col2"]].agg(['mean','sum','min','max','count'])
print(new_df)
Output:
col1 col2 \
mean sum min max count mean
col1 col2 col1 col2 col1 col2 col1 col2 col1 col2 col1 col2
id
1 0 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN
1 1.5 12.5 3.0 25.0 1.0 12.0 2.0 13.0 2.0 2.0 1.5 12.5
2 4.0 13.5 8.0 27.0 2.0 13.0 6.0 14.0 2.0 2.0 4.0 13.5
2 3 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN
4 14.0 16.5 28.0 33.0 8.0 16.0 20.0 17.0 2.0 2.0 14.0 16.5
5 9.5 17.5 19.0 35.0 8.0 17.0 11.0 18.0 2.0 2.0 9.5 17.5
sum min max count
col1 col2 col1 col2 col1 col2 col1 col2
id
1 0 NaN NaN NaN NaN NaN NaN 1.0 1.0
1 3.0 25.0 1.0 12.0 2.0 13.0 2.0 2.0
2 8.0 27.0 2.0 13.0 6.0 14.0 2.0 2.0
2 3 NaN NaN NaN NaN NaN NaN 1.0 1.0
4 28.0 33.0 8.0 16.0 20.0 17.0 2.0 2.0
5 19.0 35.0 8.0 17.0 11.0 18.0 2.0 2.0