So I have a dataset that includes dates and values, corresponding to those dates.
date value category
1951-07 199 1
1951-07 130 3
1951-07 50 5
1951-08 199 1
1951-08 50 5
1951-08 199 1
1951-09 184 2
1951-09 50 5
1951-09 13 13
Now my goal is to find the values, that repeat each month. Resulting in a frame like this:
date value category
1951-07 50 5
1951-08 50 5
1951-09 50 5
Also not regarding values that repeat inside a month, or that repeat only for a few months, but not all.
The categories do often pai with the value (like shown in the example), but sometimes they don't. So I tried doing it by category, but it didn't give me exact results.
My current approach is to filter for duplicates and then get those, that occure 12 times (as i'm searching per year). But it also gives me values, that repeat 12 sides inside a month.
df = df[df.duplicated(['value'],keep=False)]
v = df.value.value_counts()
df_12 = df[df.value.isin(v.index[v.gt(12)])]
Any help would be appreciated.
I would first group by values and remove duplicates on dates:
tmp = df.groupby('value')['date'].apply(lambda x: x.drop_duplicates())
Your sample would give:
value
13 8 1951-09
50 2 1951-07
4 1951-08
7 1951-09
130 1 1951-07
184 6 1951-09
199 0 1951-07
3 1951-08
Name: date, dtype: object
Then we can safely count the values and only keep the ones having the expected count::
total = tmp.groupby(level=0).count()
total = total[total == 3]
We get:
value
50 3
Name: date, dtype: int64
We can finaly filter the original dataframe:
df[df['value'].isin(total.index)]
giving the expected:
date value category
2 1951-07 50 5
4 1951-08 50 5
7 1951-09 50 5
From Jezrael comment, the first steps to build total
should become:
total = df.drop_duplicates(['date', 'value'])[['date', 'value']
].groupby('value').count()['date']
total = total[total == 3]
it is both simpler and faster...