Group by date and count value in nested dict value in a pandas dataframe

dh762 :

Consider the df1 pandas DataFrame. I would like to transform this DataFrame to have a count per each date and concept (see df2).

import pandas as pd

inp_data = [
{'date': '2020-02-01', 'concepts': [{'surfaceForm': 'ABC'}, {'surfaceForm': 'DEF'}]},
{'date': '2020-02-01', 'concepts': [{'surfaceForm': 'ABC'}, {'surfaceForm': 'XYZ'}]},
{'date': '2020-02-02', 'concepts': [{'surfaceForm': 'XYZ'}]}
]

df1 = pd.DataFrame(inp_data, columns=['date', 'concepts'])

# transform df1 into df2...

# goal
out_data = [
 {'day': '2020-02-01', 'concept': 'ABC', 'count': 2},
 {'day': '2020-02-01', 'concept': 'DEF', 'count': 1},
 {'day': '2020-02-01', 'concept': 'XYZ', 'count': 1},
 {'day': '2020-02-02', 'concept': 'XYZ', 'count': 1},
]
df2 = pd.DataFrame(out_data, columns=['day', 'concept', 'count'])

Note that the df1 date becomes day in df2; and each object in concepts in df1 is regarded its own concept in df2. I could hack it together with iterating over the rows of df1 which obviously has lots of performance problems and isn't the pandas way. Then I wanted to run it for a magnitude bigger DataFrame which didn't work in a timely manner.

For reference, here's the hacky way:

import pandas as pd

columns = ['concept', 'day']

def concept_occurence(row, columns):
    insert_list = list()
    for c in row['concepts']:
        sf = c['surfaceForm']
        insert_list.append({'concept': sf, 'day': row['date']})
    return pd.DataFrame(insert_list, columns=columns)

df2 = pd.DataFrame(columns=columns)

for index, row in df1.iterrows():
    concept_map = concept_occurence(row, columns)
    df2 = df2.append(concept_map, ignore_index=True)
anky_91 :

Here is a way using str.get and groupby after explode

a = df1.explode('concepts')
out = (a.assign(concepts = a['concepts'].str.get('surfaceForm'))
       .groupby(['date','concepts'])['concepts'].count().reset_index(name='Count'))
print(out)

         date concepts  Count
0  2020-02-01      ABC      2
1  2020-02-01      DEF      1
2  2020-02-01      XYZ      1
3  2020-02-02      XYZ      1

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=193566&siteId=1
Recommended