Grouping by key and aggregating with custom criteria

Michał Przybylak :

I have data frame as follows:

df = pd.DataFrame([['A', 'a', 'web'],
                   ['A', 'b', 'mobile'],
                   ['B', 'c', 'web'],
                   ['C', 'd', 'web'],
                   ['D', 'e', 'mobile'],
                   ['D', 'f', 'web'],
                   ['D', 'g', 'web'],
                   ['D', 'g', 'web']],

columns=['seller_id', 'item_id', 'selling_channel'])

It shows selled items with information about who was the seller and what selling channel (in example above it can be web or mobile, but there are more potential channels in real data) was used to sell item

I would like to determine which of the selling channels is the main one for given sell id - but there are additional constraints into that:

  1. if one of the channels was used for 75% or more of the sells - this channel will be the main one
  2. if none of the channels has at least 75% - name of the main channel should be mixed

so for input above I am expecting following output:

df = pd.DataFrame([['A', 'mixed'],
                   ['B', 'web'],
                   ['C', 'web'],
                   ['D', 'web']],

columns=['seller_id', 'main_selling_channel'])

Right now I am doing that by manually iterating over every dataframe's row to build map where per seller_id I am listing each channel and how much occurrences it was. Then I am iterating over that data again to determine which channel is main. But this manual iteration takes lots of time already when I have 10k lines of input - and the actual data contains couple millions of entries.

I was wondering if there is any effective way of doing that with pandas api instead of manual iteration?

anky_91 :

Here is one way using df.groupby with value counts with normalize=True to check pct of values in each group , then check if % is greater than or equal to 0.75 , then using np.where set the values which return Tue to mixed , finally df.groupby() with idxmax will return 1 value else mixed

a = (df.groupby('seller_id')['selling_channel'].value_counts(normalize=True).ge(0.75)
       .rename('Pct').reset_index())

out = (a.assign(selling_channel=np.where(a['Pct'],a['selling_channel'],'mixed'))
       .loc[lambda x: x.groupby('seller_id')['Pct'].idxmax()].drop('Pct',1))

print(out)

  seller_id selling_channel
0         A           mixed
2         B             web
3         C             web
4         D             web

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=169240&siteId=1