I have a dataframe like below.
data
Index ID AA BB CC BIN
0 Z1 10 11 12 1
1 Z1 0 12 13 1
2 Z1 20 13 14 2
3 Z1 34 14 15 3
4 Z1 54 52 16 3
5 Z1 67 53 17 3
6 Z7 45 54 18 1
7 Z7 34 55 19 2
8 Z7 45 56 57 2
9 Z7 45 56 58 3
10 Z7 67 67 59 3
I want to get a dataframe that looks like below
data2
ID AA_SUM_12 AA_MEAN_12 BB_SUM_12 BB_MEAN_12 CC_SUM_12 CC_MEAN_12
Z1 30 10 36 12 39 13
Z7 124 41.33 165 55 94 31.33
Where SUM_12
gives a sum where 'BIN' = 1 and 2
, the concept is the same for MEAN_12
.
In the real dataset, there are above 3000 different IDs, and 'BIN'
ranges from 1
to 5
.
I want to pick up 'BIN' randomly like taking mean where 'BIN'
is 1
, 3
, 5
or taking sum where 'BIN'
is 4
, 5
and so on in a form of dataframe.
How to do that?
I understand question need random unique BIN
s with length 2
or 3
:
print (df)
ID AA BB CC BIN
0 Z1 10 11 12 1
1 Z1 0 12 13 1
2 Z1 20 13 14 2
3 Z1 34 14 15 4
4 Z1 54 52 16 5
5 Z1 67 53 17 3
6 Z7 45 54 18 4
7 Z7 34 55 19 2
8 Z7 45 56 57 4
9 Z7 45 56 58 3
10 Z7 67 67 59 3
So first get all unique values:
v = df['BIN'].unique()
print (v)
[1 2 4 5 3]
And pass to numpy.random.choice
with generated random length 2
or 3
:
r = np.random.choice(v, size=np.random.choice([2,3]))
print (r)
[3 5 1]
new = ''.join((str(x) for x in r))
Then filter rows by Series.isin
and boolean indexing
and aggregate sum
with mean
s, last add to columns names generated BINS
converted to string
s with join
:
df1 = df[df['BIN'].isin(r)].groupby('ID')[ 'AA', 'BB', 'CC'].agg(['mean','sum'])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}_{new}')
print (df1)
AA_mean_351 AA_sum_351 BB_mean_351 BB_sum_351 CC_mean_351 CC_sum_351
ID
Z1 32.75 131 32.0 128 14.5 58
Z7 56.00 112 61.5 123 58.5 117