Reduce dataframe using a running flag

dokondr :

I have a dataframe with x, y coordinates of some points. Every point (coordinate pair) is also tagged with a True/False flag:

xs = [1,3,7,5,4,6,2,8,9,0]
ys = [0,7,4,5,2,6,9,1,3,8]
flags = [True,False,False,False,True,True,False,True,True,True]
df = pd.DataFrame({'x':xs, 'y':ys,'flag':flags})


    x   y   flag
0   1   0   True
1   3   7   False
2   7   4   False
3   5   5   False
4   4   2   True
5   6   6   True
6   2   9   False
7   8   1   True
8   9   3   True
9   0   8   True

What reduce function can be used to compute 2 total distances:

1) Total distance for routes that start with a True point right after False point or with a very first point that is True, include all True points and terminate with a False point or end of all points

2) Total distance for routes that start with a False point right after True point or with a very first point that is False, include all False points and terminate with a True point or end of all points

In this example, the follwing sections need to be summed up to get total distances:

1) Route built from  True points:
(1,0) - (3,7)
---
(4,2) - (6,6)
(6,6) - (2,9)
---
(8,1) - (9,3)
(9,3) - (0,8)

2) Route built from False points:
(3,7) - (7,4)
(7,4) - (5,5)
(5,5) - (4,2)
---
(2,9) - (8,1)

So, for example, with these points:

points = [((1,0),(3,7)), ((4,2),(6,6)), ((6,6),(2,9)), 
           ((8,1),(9,3)), ((9,3),(0,8))]

# Compute distance between two points:
def distance(x1,y1,x2,y2):
    return math.sqrt((x2-x1)**2 + (y2-y1)**2)

Total distance:

total_distance = 0
for t in points:
   total_distance += distance(t[0][0],t[0][1], t[1][0],t[1][1])  

print(total_distance)

29.283943962766887

How to calculate distances with a reduce function, not using pandas.DataFrame.iterrows ?

Ben.T :

First you can calculate the distance in a vectorize way with shift:

df['dist'] = np.sqrt((df['x']-df['x'].shift(-1))**2 + (df['y']-df['y'].shift(-1))**2)

then you can create a mask to meet the True condition with cumsum and diff on the column flag:

mask_true = df['flag'].cumsum().diff().fillna(df['flag']).gt(0)
# now use loc to select these rows and the dist column plus sum
print (df.loc[mask_true,'dist'].sum())
# 29.283943962766887

for the False condition, then I guess it is the complementary, so you get:

print (df.loc[~mask_true,'dist'].sum())
# 20.39834563766817

EDIT: sometimes, easiest solution does not come first, but actually, mask_true is df['flag'] so once you have created the column dist, you can do directly:

print (df.loc[df['flag'],'dist'].sum())
# 29.283943962766887
print (df.loc[~df['flag'],'dist'].sum())
# 20.39834563766817

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=387192&siteId=1