Replace pandas groupby and apply to increase performance

Josef :

I am using pandas groupby and apply to go from a DataFrame containing 150 million rows with the following columns:

Id  Created     Item    Stock   Price
1   2019-01-01  Item 1  200     10
1   2019-01-01  Item 2  100     15
2   2019-01-01  Item 1  200     10

To a list of 2,2 million records that looks like this:

[{
  "Id": 1,
  "Created": "2019-01-01",
  "Items": [
    {"Item":"Item 1", "Stock": 200, "Price": 10},
    {"Item":"Item 2", "Stock": 100, "Price": 5}
    ]
},
{
  "Id": 2,
  "Created": "2019-01-01",
  "Items": [
    {"Item":"Item 1", "Stock": 200, "Price": 10}
    ]
}]

Mainly using this line of code:

df.groupby(['Id', 'Created']).apply(lambda x: x[['Item', 'Stock', 'Price']].to_dict(orient='records'))

This takes quite some time and as I understand it, operations like this is heavy for pandas to perform. Is there a none-pandas way to accomplish the same but with greater performance?

Edit: The operation takes 55 minutes, I am using ScriptProcessor in AWS that lets me specify the amount of power I want.

Edit 2: So with artonas solution I am getting close: This is what I manage to produce now:

defaultdict(<function __main__.<lambda>()>,
            {'1': defaultdict(list,
                         {'Id': '1',
                          'Created':'2019-01-01',
                          'Items': [{'Item': Item2, 'Stock': 100, 'Price': 15},
                                    {'Item': Item1, 'Stock': 200, 'Price': 10}]
                         })
            },
           {'2': defaultdict(list,
                         {'Id': '2',
                          'Created':'2019-01-01',
                          'Items': [{'Item': Item1, 'Stock': 200, 'Price': 10}]
                         })
            },

But how to go from the above, to this?

[{
  "Id": 1,
  "Created": "2019-01-01",
  "Items": [
    {"Item":"Item 1", "Stock": 200, "Price": 10},
    {"Item":"Item 2", "Stock": 100, "Price": 5}
    ]
},
{
  "Id": 2,
  "Created": "2019-01-01",
  "Items": [
    {"Item":"Item 1", "Stock": 200, "Price": 10}
    ]
}]

Basically Im only intrested in the part after "defaultdict(list, " for all records. I need to have it in a list that is not dependent on the Id as the key.

Edit 3: Last update containing the results for my production dataset. With the accepted answer provided by artona I managed to go from 55 minutes to 7(!) minutes. And without any major changes to my code. The solution provided by Phung Duy Phong took me from 55 minutes to 17, not to bad either.

artona :

Use collections.defaultdict and itertuples. It iterates over row only one time.

In [105]: %timeit df.groupby(['Id', 'Created']).apply(lambda x: x[['Item', 'Stock', 'Price']].to_dict(orient='records'))
10.1 s ± 44.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [107]:from collections import defaultdict
     ...:def create_dict():
     ...:     dict_ids = defaultdict(lambda : defaultdict(list))
     ...:     for row in df.itertuples():
     ...:          dict_ids[row.Id][row.Created].append({"Item": row.Item, "Stock": row.Stock, "Price": row.Price})
     ...:     list_of_dicts = [{"Id":key_id, "Created":key_created, "Items": values} for key_id, value_id in dict_ids.items() for key_created, values in value_id.items()]
     ...:     return list_of_dicts

In [108]: %timeit create_dict()
4.58 s ± 417 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=3920&siteId=1