Finding monthly repeating values in dataframe (pandas)

Sascha Stenger :

So I have a dataset that includes dates and values, corresponding to those dates.

date  value  category
1951-07 199 1
1951-07 130 3
1951-07 50  5
1951-08 199 1
1951-08 50  5
1951-08 199 1
1951-09 184 2
1951-09 50  5
1951-09 13  13

Now my goal is to find the values, that repeat each month. Resulting in a frame like this:

date  value  category
1951-07 50  5
1951-08 50  5
1951-09 50  5

Also not regarding values that repeat inside a month, or that repeat only for a few months, but not all.

The categories do often pai with the value (like shown in the example), but sometimes they don't. So I tried doing it by category, but it didn't give me exact results.

My current approach is to filter for duplicates and then get those, that occure 12 times (as i'm searching per year). But it also gives me values, that repeat 12 sides inside a month.

df = df[df.duplicated(['value'],keep=False)]
v = df.value.value_counts()
df_12 = df[df.value.isin(v.index[v.gt(12)])]

Any help would be appreciated.

Serge Ballesta :

I would first group by values and remove duplicates on dates:

tmp = df.groupby('value')['date'].apply(lambda x: x.drop_duplicates())

Your sample would give:

value   
13     8    1951-09
50     2    1951-07
       4    1951-08
       7    1951-09
130    1    1951-07
184    6    1951-09
199    0    1951-07
       3    1951-08
Name: date, dtype: object

Then we can safely count the values and only keep the ones having the expected count::

total = tmp.groupby(level=0).count()
total = total[total == 3]

We get:

value
50    3
Name: date, dtype: int64

We can finaly filter the original dataframe:

df[df['value'].isin(total.index)]

giving the expected:

      date  value  category
2  1951-07     50         5
4  1951-08     50         5
7  1951-09     50         5

From Jezrael comment, the first steps to build total should become:

total = df.drop_duplicates(['date', 'value'])[['date', 'value']
                                              ].groupby('value').count()['date']
total = total[total == 3]

it is both simpler and faster...

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=25623&siteId=1