Pandas aggregate grouping with the max of another column?

rb612 :

Currently, I'm working with the COVID dataset to do some insights.

The dataset is of this form:

    Country Province    Lat Lon         Date                    Cases   Status
0   Thailand        15.0000 101.0000    2020-01-22 00:00:00+00:00   2   confirmed
1   Thailand        15.0000 101.0000    2020-01-23 00:00:00+00:00   3   confirmed
2   Thailand        15.0000 101.0000    2020-01-24 00:00:00+00:00   5   confirmed
3   Thailand        15.0000 101.0000    2020-01-25 00:00:00+00:00   7   confirmed
4   Thailand        15.0000 101.0000    2020-01-26 00:00:00+00:00   8   confirmed

I want to group by country, summing over the "Cases" column (we'll call this case sum column), but I run into a problem with latitude and longitude: I want to take the lat/long of the max of the case column. In other words, I would like the latitude and longitude from the row with the largest number of cases. To clarify, the use case is that a country like France has rows with multiple latitude and longitudes (like French Polynesia for example) but I just want to take the lat/long in the grouping from the area which has the most cases.

I currently am running an aggregation as follows:

nonzero_cases[(nonzero_cases['Date'] == "03/13/2020")].groupby("Country").agg({"Lat":"first","Lon":"first","Cases":"sum"})

This yields:

Country     Lat     Lon     Cases
Afghanistan 33.0000 65.0000 7
Albania 41.1533 20.1683 33
Algeria 28.0339 1.6596  26
Andorra 42.5063 1.5218  1
...

But this is not quite what I want since it doesn't take into account the case numbers, and just picks the first Lat/Lon.

jezrael :

Add DataFrame.sort_values by column Cases, so now first values are rows with maximum of Cases per groups:

print (df)
    Country   Lat    Lon                       Date  Cases     Status
0  Thailand  15.0  101.0  2020-01-22 00:00:00+00:00      2  confirmed
1  Thailand  15.0  101.0  2020-01-23 00:00:00+00:00      3  confirmed
2  Thailand  15.0  101.0  2020-01-24 00:00:00+00:00      5  confirmed
3  Thailand  15.0  101.0  2020-01-25 00:00:00+00:00      7  confirmed
4  Thailand  14.0  103.0  2020-01-26 00:00:00+00:00      8  confirmed <- changed data

df1 = (df.sort_values('Cases', ascending=False)
         .groupby("Country")
         .agg({"Lat":"first","Lon":"first","Cases":"sum"}))

print (df1)
           Lat    Lon  Cases
Country                     
Thailand  14.0  103.0     25

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=297383&siteId=1