easy way to remove certain observations (time series) in pandas dataframe?

whj123 :

I have apandas dataframe where there is a column called "date", which is in the format of YYQQ, e.g. 1702.

The column is a time series. However, there are some missing observations. For example, below dataframe is disrupted at date 1702. I have multiple dataframes like this and each disrupted at different dates. I only want to get the dates after the disruption. In the example below, I would just want rows on and after 1702. Is there a easy and pythonic way to do it?

date
1301
1302
1303
1304
1401
1702
1703
1704
1801
1802
rahlf23 :

First, let's recreate your dataframe and create new columns to represent the year (YY) and quarter (QQ) for each respective date:

import pandas as pd

df = pd.DataFrame({'date': [1301,1302,1303,1304,1401,1702,1703,1704,1801,1802]})

df[['YY','QQ']] = df['date'].astype(str).str.extract(r'(\d{2})(\d{2})')

Now we have:

   date  YY  QQ
0  1301  13  01
1  1302  13  02
2  1303  13  03
3  1304  13  04
4  1401  14  01
5  1702  17  02
6  1703  17  03
7  1704  17  04
8  1801  18  01
9  1802  18  02

Let's convert each quarter to a month in that quarter and then we can convert back to a datetime series and find the first time difference that exceeds 3 months. Finally, we can slice the dataframe according to the first index that satisfies this boolean mask (i.e. the disruption):

df['MM'] = df['QQ'].astype(int)*3

df['date_t'] = pd.to_datetime(df['YY'].astype(str)+df['MM'].astype(str), format='%y%m')

df.iloc[df[df['date_t'].diff().astype('timedelta64[M]')>3].index[0]:]

Yields:

   date  YY  QQ  MM     date_t
5  1702  17  02   6 2017-06-01
6  1703  17  03   9 2017-09-01
7  1704  17  04  12 2017-12-01
8  1801  18  01   3 2018-03-01
9  1802  18  02   6 2018-06-01

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=293349&siteId=1