NumPy apply function to groups of rows corresponding to another numpy array

cmed123 :

I have a NumPy array with each row representing some (x, y, z) coordinate like so:

a = array([[0, 0, 1],
           [1, 1, 2],
           [4, 5, 1],
           [4, 5, 2]])

I also have another NumPy array with unique values of the z-coordinates of that array like so:

b = array([1, 2])

How can I apply a function, let's call it "f", to each of the groups of rows in a which correspond to the values in b? For example, the first value of b is 1 so I would get all rows of a which have a 1 in the z-coordinate. Then, I apply a function to all those values.

In the end, the output would be an array the same shape as b.

I'm trying to vectorize this to make it as fast as possible. Thanks!

Example of an expected output (assuming that f is count()):

c = array([2, 2])

because there are 2 rows in array a which have a z value of 1 in array b and also 2 rows in array a which have a z value of 2 in array b.

A trivial solution would be to iterate over array b like so:

for val in b:
    apply function to a based on val
    append to an array c

My attempt:

I tried doing something like this, but it just returns an empty array.

func(a[a[:, 2]==b])
Andreas K. :

The problem is that the groups of rows with the same Z can have different sizes so you cannot stack them into one 3D numpy array which would allow to easily apply a function along the third dimension. One solution is to use a for-loop, another is to use np.split:

a = np.array([[0, 0, 1],
              [1, 1, 2],
              [4, 5, 1],
              [4, 5, 2],
              [4, 3, 1]])


a_sorted = a[a[:,2].argsort()]

inds = np.unique(a_sorted[:,2], return_index=True)[1]

a_split = np.split(a_sorted, inds)[1:]

# [array([[0, 0, 1],
#         [4, 5, 1],
#         [4, 3, 1]]),

#  array([[1, 1, 2],
#         [4, 5, 2]])]

f = np.sum  # example of a function

result = list(map(f, a_split))
# [19, 15]

But imho the best solution is to use pandas and groupby as suggested by FBruzzesi. You can then convert the result to a numpy array.

EDIT: For completeness, here are the other two solutions

List comprehension:

b = np.unique(a[:,2])
result = [f(a[a[:,2] == z]) for z in b]

Pandas:

df = pd.DataFrame(a, columns=list('XYZ'))
result = df.groupby(['Z']).apply(lambda x: f(x.values)).tolist()

This is the performance plot I got for a = np.random.randint(0, 100, (n, 3)):

enter image description here

As you can see, approximately up to n = 10^5 the "split solution" is the fastest, but after that the pandas solution performs better.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=16102&siteId=1