dh762 :
Consider the df1
pandas DataFrame. I would like to transform this DataFrame to have a count
per each date
and concept
(see df2
).
import pandas as pd
inp_data = [
{'date': '2020-02-01', 'concepts': [{'surfaceForm': 'ABC'}, {'surfaceForm': 'DEF'}]},
{'date': '2020-02-01', 'concepts': [{'surfaceForm': 'ABC'}, {'surfaceForm': 'XYZ'}]},
{'date': '2020-02-02', 'concepts': [{'surfaceForm': 'XYZ'}]}
]
df1 = pd.DataFrame(inp_data, columns=['date', 'concepts'])
# transform df1 into df2...
# goal
out_data = [
{'day': '2020-02-01', 'concept': 'ABC', 'count': 2},
{'day': '2020-02-01', 'concept': 'DEF', 'count': 1},
{'day': '2020-02-01', 'concept': 'XYZ', 'count': 1},
{'day': '2020-02-02', 'concept': 'XYZ', 'count': 1},
]
df2 = pd.DataFrame(out_data, columns=['day', 'concept', 'count'])
Note that the df1
date
becomes day
in df2
; and each object in concepts
in df1
is regarded its own concept
in df2
. I could hack it together with iterating over the rows of df1
which obviously has lots of performance problems and isn't the pandas way. Then I wanted to run it for a magnitude bigger DataFrame which didn't work in a timely manner.
For reference, here's the hacky way:
import pandas as pd
columns = ['concept', 'day']
def concept_occurence(row, columns):
insert_list = list()
for c in row['concepts']:
sf = c['surfaceForm']
insert_list.append({'concept': sf, 'day': row['date']})
return pd.DataFrame(insert_list, columns=columns)
df2 = pd.DataFrame(columns=columns)
for index, row in df1.iterrows():
concept_map = concept_occurence(row, columns)
df2 = df2.append(concept_map, ignore_index=True)
anky_91 :
Here is a way using str.get
and groupby
after explode
a = df1.explode('concepts')
out = (a.assign(concepts = a['concepts'].str.get('surfaceForm'))
.groupby(['date','concepts'])['concepts'].count().reset_index(name='Count'))
print(out)
date concepts Count
0 2020-02-01 ABC 2
1 2020-02-01 DEF 1
2 2020-02-01 XYZ 1
3 2020-02-02 XYZ 1