The following is a sample dataframe. My actual dataset has 30k rows.
df = pd.DataFrame({'Account': [30, 30, 30, 30, 30, 30, 30, 40, 40, 40],
'Start': [2, 2, 2, 2, 2, 3, 3, 1, 1, 1],
'Amount' : [500, 600, 800, 200, 700, 10, 800, 10, 50, 70]})
Account Start Amount
0 30 2 500
1 30 2 600
2 30 2 800
3 30 2 200
4 30 2 700
5 30 3 10
6 30 3 800
7 40 1 10
8 40 1 50
9 40 1 70
I want to find all rows (grouped by Account and Start) where Amount in row 1 differs from Amount in row 2 by ± 50%. I am expecting the result to look like this.
Account Start Amount
0 30 2 500
1 600
2 800
8 40 1 50
9 70
Row 3 is excluded as 200 in row 3 is less than 50% of the amount in row 2 as well as the amount in row 3.
Row 4 is excluded as it is the last element in start = 2 and the previous row is also excluded.
Similarly, Row 5 and 6 are excluded.
Row 7 is excluded as 10 is less than 50% of the amount in row 8.
PS: In the final dataset, each group of Account and Start should have at least 4 rows.
Is there a way to do this efficiently?
We use pct_change
, checking if it's between -50% and 50%. Because you want pairs of rows we need to check this mask or the shifted mask (shifting in the opposite direction in which we calculated the pct_change). We'll apply this function to each group separately.
def keep_within_pct(gp, shift=1, pcts=(-0.5, 0.5)):
m = gp['Amount'].pct_change(-shift).between(*pcts)
return gp[m | m.shift(shift).fillna(False)]
df.groupby(['Account', 'Start'], group_keys=False).apply(keep_within_pct)
Account Start Amount
0 30 2 500
1 30 2 600
2 30 2 800
8 40 1 50
9 40 1 70