Detailed explanation: Calculation of v-mAP in action detection

I. Introduction

The calculation method of v-mAP refers to the method in "Action Tubelet Detector for Spatio-Temporal Action Localization", but this article is based on the caffe framework. I am familiar with pytorch, and finally saw an article on similar action detection. The method in this article is also used in it. The download address of the code is below:

https://github.com/burhanmudassar/pytorch-action-detection#visualizations-generated-from-act-mobilenet-v2-after-post-processing-of-tubelets-using-incremental-linking-algorithm-1

Let’s talk about the direction of the data first. The data input to the model is [b_s, tb, c, h, w]. For example: (8,2,3,300,300) means that 8 frames of pictures have been extracted, 2 means (image data of the current frame and the next frame), 3 Represents three channels (rgb), and the remaining two dimensions are width and height. That is to say, the input model includes the current frame data and the next frame data. 

1.Model output

The model is ACT-VGG16, and its output is in three parts:

1. Predicted bbox coordinate information: For example: [8,8116,2,4]-----8: indicates the extracted 8 frames of images, 8116: indicates that one frame of image predicts 8116 bboxes, 2: indicates the current frame and the next frame, 4: represents the coordinates of the predicted box. That is to say, for each frame of image, not only the bbox of the current frame is predicted , but also the bbox of the next frame is predicted .

2. The predicted score of each bbox category: [8,8116,22]-----22: means there are 22 categories, 21 categories + 1 background category

3. Error correction for predicted bbox: [8116,4]

2. How to generate tubes?

Let me first explain what tubes are, which are commonly understood as action sequences. As shown in Figure 1 below:

Assumption: This is the number of frames that an action in the jhmdb data set lasts, then the annotation file labels each frame and the range of the action (bbox). As shown in the green box in the picture. Then the bboxes of the same action at 40 frames are connected together to form tubes.

figure 1

The above introduces what tubes are. Figure 1 shows the tubes of GT (labeled), and the tubes we need to predict make the two tubes as close as possible. So how to get tubes through predicted bbox? To put it simply, we first filter out a part of the bbox through nms, then find the largest IOU in the next frame through the first frame as the next frame of tubes, and finally know how to generate tubes. The detailed generation process is explained in the next article "TubeR: Tubelet Transformer for Video Action Detection". Finally, the data of the entire tubes is as follows:

 It is worth noting that x1, y1, x2, y2 are followed by score, which represents the score of each box, and the entire tubes is also followed by a score, which indicates the score of the tubes, specifically the boxes that make up the tubes. Scores are obtained by averaging.

At this time, some friends may have questions. One frame output by the model predicts the box of the current frame and the next frame. That is, except for the first frame and the last frame, the others have two prediction results. Which one should be chosen? one? Here, the boxes of the two result frames are averaged. As the coordinates of the final box of the frame number, the score is also the same.

Finally, we got the predicted tubes, as shown in the red box in Figure 2.

 figure 2

The indicator of video-mAP is actually for tubes. Before asking for video-mAP, you must also know how to ask for iou3d. We may all ask for 2D IOUs, but the 3D ones are actually similar.

2. Calculation method of Iou3dt:

The following is the code of iou3dt. Each line is commented. The essence is: find the iou3d of the overlapped part of two tubes. The input is two tubes, and the output is a value (indicating the similarity of the two tubes)

def iou3dt(b1, b2, spatialonly=False):
    """Compute the spatio-temporal IoU between two tubes"""
    #====计算预测的tubes和GT-tubes重合的帧数=========
    tmin = max(b1[0, 0], b2[0, 0])
    tmax = min(b1[-1, 0], b2[-1, 0])

    if tmax < tmin: return 0.0
    #====计算两个tubes帧数的交集================
    temporal_inter = tmax - tmin + 1
    #====计算两个tubes帧数的并集================
    temporal_union = max(b1[-1, 0], b2[-1, 0]) - min(b1[0, 0], b2[0, 0]) + 1
    #====分别取出两个tubes交集的帧==============
    tube1 = b1[int(np.where(b1[:, 0] == tmin)[0]) : int(np.where(b1[:, 0] == tmax)[0]) + 1, :]
    tube2 = b2[int(np.where(b2[:, 0] == tmin)[0]) : int(np.where(b2[:, 0] == tmax)[0]) + 1, :]
    #=====求重合部分的iou3d,再乘上一个交并系数===========
    return iou3d(tube1, tube2) * (1. if spatialonly else temporal_inter / temporal_union)

 The following is the code for the iou3d part. The essence is: the iou corresponding to all boxes are averaged. The input of iou3d is two tubes, and the output is one value.

def iou3d(b1, b2):
    """Compute the IoU between two tubes with same temporal extent"""

    assert b1.shape[0] == b2.shape[0]
    assert np.all(b1[:, 0] == b2[:, 0])
    #求对应box的交集
    ov = overlap2d(b1[:,1:5],b2[:,1:5])
    #对所有box的iou取平均值
    return np.mean(ov / (area2d(b1[:, 1:5]) + area2d(b2[:, 1:5]) - ov) )

3. Calculation method of video-mAP:

Similar to f-mAP, to require mAP, AP is first required, and AP is for a certain category.

So for a certain category:

We first take out the predicted tubes belonging to this category, then take out the tubes belonging to this category, and map each video to the video. A total array pr is declared for precision and recall. The first column stores precision and the second column stores recall. It is worth noting that an extra line has been added here, that is, the first line has precision=1 and recall is 0. This is preparation for later calculating the area enclosed by the curve and the coordinate axis. The above article also mentioned that there is no distinction between fn and tn. The fn here is actually the number of GT, which is for recall below. 

Sort the predicted tubes from small to large. And calculate Iou3dt with GT-tubes if it is greater than the threshold we set, it will be recorded as TP.

 The sum of all tp and fp is actually the number of predicted tubes , and the sum of tp and fn adds up to sum(gt), which is the number of gt . Here we only find the precision and recall of one category. By taking all the videos of the class and following the above steps, we can get the precision and recall of all categories.

The method of finding AP is to find the area enclosed by the precision and recall curves and the coordinate axis . The area calculation here is somewhat different from the all-point interpolation method. The height of the all-point interpolation method is determined by the maximum precision in the small interval, and here it is determined by It is determined by the mean value of precision at both ends of the interval. The essence is to find the area enclosed by the curve.

 Calculating the average of the APs of all classes is video-mAP. This ends the video-mAP explanation. It is not difficult to find that the method of calculating AP is exactly the same as that of 2d, but there is a slight difference when calculating iou.

The essence of video-mAP is to measure the similarity of the range of motion in two video clips.

 

Guess you like

Origin blog.csdn.net/qq_58484580/article/details/131784103