I have data frame as follows:
df = pd.DataFrame([['A', 'a', 'web'],
['A', 'b', 'mobile'],
['B', 'c', 'web'],
['C', 'd', 'web'],
['D', 'e', 'mobile'],
['D', 'f', 'web'],
['D', 'g', 'web'],
['D', 'g', 'web']],
columns=['seller_id', 'item_id', 'selling_channel'])
It shows selled items with information about who was the seller and what selling channel (in example above it can be web or mobile, but there are more potential channels in real data) was used to sell item
I would like to determine which of the selling channels is the main one for given sell id - but there are additional constraints into that:
- if one of the channels was used for 75% or more of the sells - this channel will be the main one
- if none of the channels has at least 75% - name of the main channel should be
mixed
so for input above I am expecting following output:
df = pd.DataFrame([['A', 'mixed'],
['B', 'web'],
['C', 'web'],
['D', 'web']],
columns=['seller_id', 'main_selling_channel'])
Right now I am doing that by manually iterating over every dataframe's row to build map where per seller_id I am listing each channel and how much occurrences it was. Then I am iterating over that data again to determine which channel is main. But this manual iteration takes lots of time already when I have 10k lines of input - and the actual data contains couple millions of entries.
I was wondering if there is any effective way of doing that with pandas api instead of manual iteration?
Here is one way using df.groupby
with value counts with normalize=True
to check pct of values in each group , then check if % is greater than or equal to 0.75 , then using np.where
set the values which return Tue to mixed
, finally df.groupby()
with idxmax
will return 1 value else mixed
a = (df.groupby('seller_id')['selling_channel'].value_counts(normalize=True).ge(0.75)
.rename('Pct').reset_index())
out = (a.assign(selling_channel=np.where(a['Pct'],a['selling_channel'],'mixed'))
.loc[lambda x: x.groupby('seller_id')['Pct'].idxmax()].drop('Pct',1))
print(out)
seller_id selling_channel
0 A mixed
2 B web
3 C web
4 D web