How to Aggregate and Sum in Python (or R) with a Specific Condition

Tanisha Hudson :

Objective

I have a dataset, df, that I would like to group the length column, take its sum, and display the endtime that's associated with it:

 length start                      end                      duration
 6330   12/17/2019 10:34:23 AM     12/17/2019 10:34:31 AM   8
 57770  12/19/2019 5:19:56 PM      12/17/2019 5:24:19 PM    263
 6330   12/17/2019 10:34:54 AM     12/17/2019 10:35:00 AM   6
 6330   12/18/2019 4:36:44 PM      12/18/2019 4:37:13 PM    29
 57770  12/19/2019 5:24:47 PM      12/19/2019 5:26:44 PM    117

Desired Output

length  end                     total Duration
6330    12/18/2019 4:37:13 PM   43  
57770   12/19/2019 5:26:44 PM   380 

Dput

structure(list(length = c(6330L, 57770L, 6330L, 6330L, 57770L
), start = structure(c(1L, 4L, 2L, 3L, 5L), .Label = c("12/17/2019 10:34:23 AM", 
"12/17/2019 10:34:54 AM", "12/18/2019 4:36:44 PM", "12/19/2019 5:19:56 PM", 
"12/19/2019 5:24:47 PM"), class = "factor"), end = structure(c(1L, 
3L, 2L, 4L, 5L), .Label = c("12/17/2019 10:34:31 AM", "12/17/2019 10:35:00 AM", 
"12/17/2019 5:24:19 PM", "12/18/2019 4:37:13 PM", "12/19/2019 5:26:44 PM"
), class = "factor"), duration = c(8L, 263L, 6L, 29L, 117L)), class = "data.frame", row.names =    c(NA, 
-5L))

This is what I have tried:, but how do I also display the end column that's associated with the 'latest' length value? For instance, length, 6330 has 3 end values, with 3 durations attached to it:

           12/17/2019 10:34:31 AM            8
           12/17/2019 10:35:00 AM            6
           12/18/2019 4:37:13 PM            29


12/18/2019 4:37:13 PM is the latest end time, so I would like to output the end time, 
along with the sum of durations for this particular length value. 

Desired Output

length  end                     total Duration
6330    12/18/2019 4:37:13 PM   43  
57770   12/19/2019 5:26:44 PM   380 

This is what I have tried:

import pandas as pd
import numpy as np

df1 = df.groupby('length')['duration'].sum()

However, it only outputs the length and total duration. How would I output the length, the latest end as well as the total duration for that particular length?

Any help is appreciated.

akrun :

In R, we can group by 'length', use summarise and get the sum of 'duration' and extract the max element of 'end' after converting to DateTime class with mdy_hms (from lubridate)

library(dplyr)
library(lubridate)
df %>%
   group_by(length) %>% 
   summarise(duration = sum(duration), end = end[which.max(mdy_hms(end))])

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=21270&siteId=1