Pandas Datetime index: Number of current events over time

user1680772 :

I'm analyzing a set of events, each of which has a type, start, and end timestamp. I'm trying to summarize the concurrent number of each event time that's in progress over the time range.

Consider the dataset below, listing events N1-N4, each with overlapping ranges:

>>> data = {
...    'name' : [ 'N1', 'N2', 'N3', 'N4', 'N1',  'N2', 'N7'],
...    'start_dt_str' : ['01-01-2020', '01-03-2020', '01-01-2020', '01-01-2020', '01-03-2020', '01-04-2020','01-10-2020'],
...    'end_dt_str' : ['01-03-2020', '01-05-2020', '01-05-2020', '01-02-2020', '01-04-2020', '01-05-2020', '01-11-2020']
... }
>>> df = pd.DataFrame(data)
>>> df['start_dt'] = pd.to_datetime(df['start_dt_str'])
>>> df['end_dt'] = pd.to_datetime(df['end_dt_str'])
>>> del df['start_dt_str']
>>> del df['end_dt_str']
>>> df 
  name   start_dt     end_dt
0   N1 2020-01-01 2020-01-03
1   N2 2020-01-03 2020-01-05
2   N3 2020-01-01 2020-01-05
3   N4 2020-01-01 2020-01-02
4   N1 2020-01-03 2020-01-04
5   N2 2020-01-04 2020-01-05
6   N7 2020-01-10 2020-01-11

My goal is to produce this summary, the number of concurrent events, by type, for each date in the range. This would be the right answer:

               N1 N2 N3 N4 N7
2020-01-01     1  0  1  1  0
2020-01-02     1  0  1  1  0 
2020-01-03     2  1  1  0  0
2020-01-04     1  2  1  0  0
2020-01-05     1  2  0  0  0
2020-01-06     0  0  0  0  0
2020-01-07     0  0  0  0  0
2020-01-08     0  0  0  0  0
2020-01-09     0  0  0  0  0
2020-01-10     0  0  0  0  1
2020-01-11     0  0  0  0  1

Note that there are duplicate dates in both the start_dt and end_dt columns.

Also note that The solution must provide ability to resample the data so fill missing dates with rows containing all zeros. In this example, date 01-09 does not appear as a start or end date, but must be present in the output. In the general case, I wish to be able to do a resample to select any arbitrary intervals.

For simplicity of explaining the problem both the reporting period and the data are at day precision in the dataset above. In the actual dataset, start_dt and end_dt are at millisecond precision ( but still contain duplicates ), and the reporting period could be hours, days, weeks, etc.

Also note that there are gaps in the data, so resampling is needed to produce the datetime series. ( IE, even though the data is at milliscond precision, there are entire days missing).

I've tried several approaches that do NOT work. A first, it seemed this would be simple, I tried:

df.set_index(['name','start_dt']).groupby('name').resample('D',level='start_dt').ffill()

ValueError: Upsampling from level= or on= selection is not supported, use .set_index(...) to explicitly set index to datetime-like

Which leads to this pandas issue regarding upsampling is open, and provides some workarounds. Unfortunately, We can't use only start_dt (or end_dt) as the index because it is non-unique:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dcowden/envs/analysis-env/lib/python3.6/site-packages/pandas/core/resample.py", line 453, in pad
    return self._upsample("pad", limit=limit)
  File "/home/dcowden/envs/analysis-env/lib/python3.6/site-packages/pandas/core/resample.py", line 1095, in _upsample
    res_index, method=method, limit=limit, fill_value=fill_value
  File "/home/dcowden/envs/analysis-env/lib/python3.6/site-packages/pandas/util/_decorators.py", line 227, in wrapper
    return func(*args, **kwargs)
  File "/home/dcowden/envs/analysis-env/lib/python3.6/site-packages/pandas/core/frame.py", line 3856, in reindex
    return super().reindex(**kwargs)
  File "/home/dcowden/envs/analysis-env/lib/python3.6/site-packages/pandas/core/generic.py", line 4544, in reindex
    axes, level, limit, tolerance, method, fill_value, copy
  File "/home/dcowden/envs/analysis-env/lib/python3.6/site-packages/pandas/core/frame.py", line 3744, in _reindex_axes
    index, method, copy, level, fill_value, limit, tolerance
  File "/home/dcowden/envs/analysis-env/lib/python3.6/site-packages/pandas/core/frame.py", line 3760, in _reindex_index
    new_index, method=method, level=level, limit=limit, tolerance=tolerance
  File "/home/dcowden/envs/analysis-env/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3149, in reindex
    "cannot reindex a non-unique index "
ValueError: cannot reindex a non-unique index with a method or limit

This question that seems similar to my problem, but doesnt fill all of the dates in the range for each event type:

>>> df.set_index('start_dt').groupby('name').resample('D').asfreq()
                name     end_dt
name start_dt                  
N1   2020-01-01   N1 2020-01-03
     2020-01-02  NaN        NaT
     2020-01-03   N1 2020-01-04
N2   2020-01-03   N2 2020-01-05
     2020-01-04   N2 2020-01-05
N3   2020-01-01   N3 2020-01-05
N4   2020-01-01   N4 2020-01-02

This solution seemed promising, but isn't exactly what I need either. It essentially looks up a single event within a range, but doesn't count the total number in progress. Though using an IntervalIndex does seem like a good start.

I feel like this should be pretty easy, but clearly my pandas foo is woefully inadequate.

Help is much appreciated!

EDIT:

jezrael :

Idea is repeat values per date_range to helper DataFrame and then use SeriesGroupBy.value_counts with Series.unstack:

L = [pd.Series(r.name, pd.date_range(r.start_dt, r.end_dt)) for r in df.itertuples()]
s = pd.concat(L)

df1 = s.groupby(level=0).value_counts().unstack(fill_value=0)
print (df1)
            N1  N2  N3  N4
2020-01-01   1   0   1   1
2020-01-02   1   0   1   1
2020-01-03   2   1   1   0
2020-01-04   1   2   1   0
2020-01-05   0   2   1   0

Another solution with reshape by DataFrame.melt, but first is necessary distinguish consecutive values by Series.shift with Series.cumsum trick, then use DataFrameGroupBy.resample and last crosstab:

df['g'] = df['name'].ne(df['name'].shift()).cumsum()
df1 = (df.melt(['name','g'])
         .set_index('value')
         .groupby(['g','name'])['variable']
         .resample('d')
         .first()
         .reset_index())

df1 = pd.crosstab(df1['value'], df1['name'])
print (df1)
name        N1  N2  N3  N4
value                     
2020-01-01   1   0   1   1
2020-01-02   1   0   1   1
2020-01-03   2   1   1   0
2020-01-04   1   2   1   0
2020-01-05   0   2   1   0

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=371893&siteId=1