Efficiently select elements from an (x,y) field with a 2D mask in Python

Orca :

I have a large field of 2D-position data, given as two arrays x and y, where len(x) == len(y). I would like to return the array of indices idx_masked at which (x[idx_masked], y[idx_masked]) is masked by an N x N int array called mask. That is, mask[x[idx_masked], y[idx_masked]] == 1. The mask array consists of 0s and 1s only.

I have come up with the following solution, but it (specifically, the last line below) is very slow, given that I have N x N = 5000 x 5000, repeated 1000s of times:

import numpy as np
import matplotlib.pyplot as plt

# example mask of one corner of a square
N = 100
mask = np.zeros((N, N))
mask[0:10, 0:10] = 1

# example x and y position arrays in arbitrary units
x = np.random.uniform(0, 1, 1000)
y = np.random.uniform(0, 1, 1000)

x_bins = np.linspace(np.min(x), np.max(x), N)
y_bins = np.linspace(np.min(y), np.max(y), N)

x_bin_idx = np.digitize(x, x_bins)
y_bin_idx = np.digitize(y, y_bins)

idx_masked = np.ravel(np.where(mask[y_bin_idx - 1, x_bin_idx - 1] == 1))

plt.imshow(mask[::-1, :])

the mask itself

plt.scatter(x, y, color='red')
plt.scatter(x[idx_masked], y[idx_masked], color='blue')

the masked particle field

Is there a more efficient way of doing this?

Mad Physicist :

Given that mask overlays your field with identically-sized bins, you do not need to define the bins explicitly. *_bin_idx can be determined at each location from a simple floor division, since you know that each bin is 1 / N in size. I would recommend using 1 - 0 for the total width (what you passed into np.random.uniform) instead of x.max() - x.min(), if of course you know the expected size of the range.

x0 = 0   # or x.min()
x1 = 1   # or x.max()
x_bin = (x1 - x0) / N
x_bin_idx = ((x - x0) // x_bin).astype(int)

# ditto for y

This will be faster and simpler than digitizing, and avoids the extra bin at the beginning.

For most purposes, you do not need np.where. 90% of the questions asking about it (including this one) should not be using where. If you want a fast way to access the necessary elements of x and y, just use a boolean mask. The mask is simply

selction = mask[x_bin_idx, y_bin_idx].astype(bool)

If mask is already a boolean (which it should be anyway), the expression mask[x_bin_idx, y_bin_idx] is sufficient. It results in an array of the same size as x_bin_idx and y_bin_idx (which are the same size as x and y) containing the mask value for each of your points. You can use the mask as

x[selection]   # Elements of x in mask
y[selection]   # Elements of y in mask

If you absolutely need the integer indices, where is sill not your best option.

indices = np.flatnonzero(selection)

OR

indices = selection.nonzero()[0]

If your goal is simply to extract values from x and y, I would recommend stacking them together into a single array:

coords = np.stack((x, y), axis=1)

This way, instead of having to apply indices twice, you can extract the values with just

coords[selection, :]

OR

coords[indices, :]

Depending on the relative densities of mask and x and y, either the boolean masking or linear indexing may be faster. You will have to time some relevant cases to get a better intuition.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=387023&siteId=1