Blair :
From this table, I try to interpolate missing dates by the min/max weekly dates available in the dataframe. Then, calculate the occurrence of 0 sales for each category.
df=pd.DataFrame({'category_id': ['aaa','aaa','aaa','aaa','bbb','bbb','bbb','ccc','ccc'],
'week': ['2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26','2015-01-12', '2015-01-19', '2015-01-26','2015-01-05', '2015-01-12'],
'sales': [0,20,30,10,45,0,47,0,10]})
First step: Add missing weekly dates to all categories and fill 0 to missing dates (Q1: I'm not sure how to get this df_add_missing_dates result)
# expected dates interpolation output
df_add_missing_dates=pd.DataFrame({'category_id': ['aaa','aaa','aaa','aaa','bbb','bbb','bbb','bbb','ccc','ccc','ccc','ccc'],
'week': ['2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26',
'2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26',
'2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26'],
'sales': [0,20,30,10,
0,45,0,47,
0,10,0,0]})
Second step: Count the occurrence of 0 weekly sales (Q2: How to aggregate the sales=0 for each category?)
# expected final output
category_id | sales_0_count
aaa | 1
bbb | 2
ccc | 3
Current code and logics:
# convert string to datetime and set as index
df['week'] = pd.to_datetime(df['week'], format='%Y-%m-%d')
# find min/max weekly dates in the dataframe --> I couldn't add missing dates with 0 sales though
idx = pd.period_range(start=df.week.min(),end=df.week.max(),freq='W')
df = df.reindex(idx, fill_value=0).reset_index(drop=True)
df_add_missing_dates = df
# group by category to count how many times weekly sales is 0
Scott Boston :
IIUC, you can use pd.MultiIndex.from_products
with reindex
and fill_value = 0
then use a boolean matrix and groupby
with sum
:
idx = pd.MultiIndex.from_product([df['category_id'].unique(),
df['week'].unique()],
names=['category_id', 'week'])
df_missing = (df.set_index(['category_id', 'week'])
.reindex(idx, fill_value=0)
.reset_index())
df_missing
Output:
category_id week sales
0 aaa 2015-01-05 0
1 aaa 2015-01-12 20
2 aaa 2015-01-19 30
3 aaa 2015-01-26 10
4 bbb 2015-01-05 0
5 bbb 2015-01-12 45
6 bbb 2015-01-19 0
7 bbb 2015-01-26 47
8 ccc 2015-01-05 0
9 ccc 2015-01-12 10
10 ccc 2015-01-19 0
11 ccc 2015-01-26 0
Now, group and sum:
(df_missing == 0).groupby(df_missing['category_id'])['sales'].sum()
Output:
category_id
aaa 1.0
bbb 2.0
ccc 3.0
Name: sales, dtype: float64