Pandas groupby and then count the occurrence of 0

Blair :

From this table, I try to interpolate missing dates by the min/max weekly dates available in the dataframe. Then, calculate the occurrence of 0 sales for each category.

df=pd.DataFrame({'category_id': ['aaa','aaa','aaa','aaa','bbb','bbb','bbb','ccc','ccc'],
                 'week': ['2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26','2015-01-12', '2015-01-19', '2015-01-26','2015-01-05', '2015-01-12'],
                 'sales': [0,20,30,10,45,0,47,0,10]})

First step: Add missing weekly dates to all categories and fill 0 to missing dates (Q1: I'm not sure how to get this df_add_missing_dates result)

# expected dates interpolation output
df_add_missing_dates=pd.DataFrame({'category_id': ['aaa','aaa','aaa','aaa','bbb','bbb','bbb','bbb','ccc','ccc','ccc','ccc'],
                                   'week': ['2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26',
                                            '2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26',
                                            '2015-01-05', '2015-01-12', '2015-01-19', '2015-01-26'],
                                   'sales': [0,20,30,10,
                                             0,45,0,47,
                                             0,10,0,0]})

Second step: Count the occurrence of 0 weekly sales (Q2: How to aggregate the sales=0 for each category?)

# expected final output
category_id | sales_0_count
aaa         | 1
bbb         | 2
ccc         | 3

Current code and logics:

# convert string to datetime and set as index
df['week'] = pd.to_datetime(df['week'], format='%Y-%m-%d')
# find min/max weekly dates in the dataframe --> I couldn't add missing dates with 0 sales though
idx = pd.period_range(start=df.week.min(),end=df.week.max(),freq='W')
df = df.reindex(idx, fill_value=0).reset_index(drop=True)
df_add_missing_dates = df
# group by category to count how many times weekly sales is 0 
Scott Boston :

IIUC, you can use pd.MultiIndex.from_products with reindex and fill_value = 0 then use a boolean matrix and groupby with sum:

idx = pd.MultiIndex.from_product([df['category_id'].unique(), 
                                  df['week'].unique()], 
                                 names=['category_id', 'week'])
df_missing = (df.set_index(['category_id', 'week'])
                .reindex(idx, fill_value=0)
                .reset_index())
df_missing

Output:

   category_id        week  sales
0          aaa  2015-01-05      0
1          aaa  2015-01-12     20
2          aaa  2015-01-19     30
3          aaa  2015-01-26     10
4          bbb  2015-01-05      0
5          bbb  2015-01-12     45
6          bbb  2015-01-19      0
7          bbb  2015-01-26     47
8          ccc  2015-01-05      0
9          ccc  2015-01-12     10
10         ccc  2015-01-19      0
11         ccc  2015-01-26      0

Now, group and sum:

(df_missing == 0).groupby(df_missing['category_id'])['sales'].sum()

Output:

category_id
aaa    1.0
bbb    2.0
ccc    3.0
Name: sales, dtype: float64

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=198502&siteId=1