[Paper Reading] Intensive Reading of Timing Action Detection Series Papers (2019)

1. BMN: Boundary-Matching Network for Temporal Action Proposal Generation

The purpose of the thesis - to solve the problem

  • Current bottom-up proposal generation methods can generate proposals with precise boundary, but cannot efficiently generate adequately reliable confidence scores for retrieving proposals. Reliable confidence scores to retrieve proposals.

Contribution - Innovation

  • The Boundary-Matching mechanism is proposed, and the 2d graph is used to represent the scores of continuous and densely distributed candidate boxes.
  • An efficient and end-to-end candidate box generation network BMN (Boundary-Matching Network) is proposed.

Implementation process

insert image description here
The BMN network simultaneously generates a boundary probability sequence (Boundary Probability Sequence) and a boundary matching confidence map (Bounding-Matching confidence map) .

BM confidence map: Proposals in the same row have the same time length, and proposals in the same column have the same start time.
insert image description here

detailed method

  • Boundary-Matching Mechanism:
    First, BMN: Boundary-Matching Network for Temporal Action Proposal Generationtemporal proposal ϕ is represented as a matching pair of its start boundary ts and end boundary te. The goal of the BM mechanism is to generate a two-dimensional BM confidence map Mc, which is constructed from BM pairs with different starting boundaries and time lengths.insert image description here
  • Boundary-Matching Network:
    The BMN model consists of three modules. The Base Module processes the input feature sequence, and the output sequence features are shared by the following two modules; the Temporal Evaluation Module evaluates the start and end probabilities of each action location in the video, and generates a boundary probability sequence; the Proposal Evaluation Module contains the BM layer, and the feature sequence Transfer to the BM feature map and include a series of 3D and 2D convolutional layers to generate a BM confidence map.
    insert image description here

Digging a hole: I read in the clouds, read some blogs, but I still don’t understand, and I will continue to read many times in the future.

2. MGG: Multi-granularity Generator for Temporal Action Proposal

The purpose of the thesis - to solve the problem

Both categories of methods for generating proposals have their own advantages and disadvantages.

  • Segment proposals: Since segments are regularly distributed or manually defined (fixed) , the generated candidate boxes naturally have imprecise boundary information.
  • Frame actionness: Densely evaluates the confidence score of each frame and groups consecutive frames as candidate frames. However, this approach tends to yield low confidence for long video clips , leading to missing real action clips and thus low recall.

Contribution - Innovation

  • An end-to-end MGG (multi-granularity generator ) is proposed for temporal action proposal, using a new method of integrating video features and position embedding information.
  • A bilinear matching model is proposed to exploit the rich local information in video sequences, which is then exploited by the following SPP and FAP.
  • SPP is implemented in a U-shaped structure with lateral connections to capture proposals of various spans with high recall , while FAP evaluates the probability of each frame as a starting point, ending point, and intermediate point .
  • Segment proposal boundaries are temporally adjusted by exploiting complementary information in frame actionivity .

Implementation process

insert image description here

  • Video visual features are first combined with position embedding information to form video representations;
  • Use BaseNet to further extract video features;
  • Use the candidate box generator (Segment Proposal Producer, SPP) to extract rough candidate boxes;
  • Use the Frame Actionness Producer (FAP) to obtain the start/end/action score of each frame at a fine scale;
  • Finally, the temporal boundary adjustment module (Temporal Boundary Adjustment, TBA) is used to synthesize the information of the above two steps to obtain the final accurate action box output.

detailed method

  • Use ConvNet to convert video sequence video sequence: s into visual feature sequence visual feature sequence fn . By computing the cosine and sine functions of different wavelengths, the position information of the visual feature fn is embedded into the dimension feature pn . Connect fn and pn to generate a new feature vector (dimension n*dl, dl=df+dp), and enter BaseNet as ln = [fn, pn]. [The location information is embedded to explicitly describe the sequence information of each visual feature, which is considered to be beneficial to the generation of action proposals]

  • The features H1 and H2 output by the two-layer convolution of BaseNet are used to fuse H1 and H2 to obtain T using a bilinear model. In the implementation, factorization is used to speed up the calculation: Tn represents the nth feature, and is used as the input of the following SPP and FAP to generate candidate boxes.
    insert image description here

  • Segment Proposal Producer: (SPP)
    insert image description here takes the generated matching video representation T as input, and SPP first stacks a convolutional layer and two maximum pooling downsampling to reduce the dimensionality and increase the size of the receptive field accordingly . The temporal feature Tc with dimension ls/8 is taken as the input of the U-shaped structure.
    The U-shape structure consists of contracting path, expansive path and lateral connections. Regarding the contracting path, the feature pyramid (FP) is obtained by repeated convolution and downsampling with a stride of 2. For the expansive path, deconvolution with a stride of 2 is used on multiple layers. Through lateral connections, high-level features from extended paths are combined with corresponding low-level features, and feature pyramids of different scales have different receptive fields, which are responsible for localizing proposals with different time spans.
    For the obtained pyramid features, anchors are applied to the pyramid sub-features of different scales to obtain candidate boxes, and the candidate boxes enter the subsequent two branches for action type judgment and boundary regression respectively. When judging the branch of the action type, the cross entropy loss function is used; when the boundary regression branch is used, the L1 smooth loss function is used.
    Experiments prove that the U structure of SPP helps to transfer high-level semantic information to lower layers, which is of great help in detecting actions with shorter durations.

  • Frame Actionness Producer: (FAP)
    FAP utilizes three double convolutional layers that do not share weights to obtain the start/progress/end score of each frame. FAP uses a cross-entropy loss function. Compared with segment proposals generated by SPP, frame action generated by FAP densely evaluates each frame in a finer-grained manner.

  • Temporal boundary adjustment: (TBA)
    Temporal boundary adjustment (TBA) module implemented in a two-stage fusion strategy to improve the boundary accuracy of segment proposals in terms of frame action.
    Stage1: Perform NMS screening on the candidate boxes obtained by SPP, and then adjust the candidate box boundaries according to the TAP score (adjust the start/end point of the candidate box to the time point with the largest start/end score in the neighborhood), and finally obtain the set of candidate boxes.
    Stage2: Use actions to score, use a grouping scheme similar to TAG, and classify consecutive frames with high intermediate probability into regions as the candidate frame set φ(tag). Calculate the tIoU of the candidate box p in φ( p) and all elements in φ(tag), if there is a tIoU greater than the threshold, replace p with the corresponding box of φ(tag)

3. P-GCN: Graph Convolutional Networks for Temporal Action Localization

The purpose of the thesis - to solve the problem

  • Existing TAD methods process each candidate box separately during training, ignoring the connection between candidate boxes.

Contribution - Innovation

  • The first study to exploit the relationship between candidate boxes for temporal action localization in videos.
  • In order to model the interaction between proposals, a proposal graph is constructed by establishing edges (contextual edges, surrounding edges) , and then GCN is applied to fuse information between candidate boxes .

Implementation process

As shown in the figure below, the contextual features provided by candidate boxes 2 and 3 are conducive to the boundary regression of candidate box 1, and the background information provided by candidate box 4 (eg, the scene where the action occurs) helps the network understand the specific action of candidate box 1.
insert image description here
If GCN is used in the real world, an overly large graph may lead to very inefficient calculations. Methods such as sampling strategy are often used to reduce its computational complexity as much as possible. This article uses the node-wise nearest neighbor method SAGE.

detailed method

insert image description here

  • Construct a graph of candidate boxes, each candidate box (proposal) is a node (node) , and the connection (relation) between two candidate boxes is an edge (edge) .
    There are two types of connections, one is to obtain the contextual information before and after each candidate frame (such as the relationship between P1, P2, and P3 in the first picture), which is called contextual edge; the other is to obtain the association between adjacent but disjoint candidate frames (such as the relationship between P1 and P4 in the first picture), called the surrounding edge ;
    the core logic of GCN is to use the connection between the candidate frames, that is, to use the context information provided by the adjacent frames to improve the information of the current frame. Two independent GCNs are used for the classification and regression of the candidate frame respectively; the sampling strategy is used during training, which can significantly reduce the computational complexity while maintaining performance. The core idea of ​​PGCN is to construct a graph that can reasonably fit the relationship between candidate frames.
  • I3D is used to extract video features, and some candidate frames are pre-extracted by the TAG method . The features and candidate frames are used as the input of GCN, and the enhanced candidate frame features output by GCN are used to make reasonable predictions of action types and action boundaries. During this process, the goal of GCN is to learn the connections between candidate boxes.insert image description hereinsert image description here
  • Simply concatenating all candidate boxes will not only increase unnecessary computation, but also introduce redundant information and noise. In this article, only two kinds of edges are connected, contextual edge and surrounding edge.
  • Proposal Graph Construction:
    The connection condition of the contextual edge is that the tIoU of the two candidate boxes is greater than the threshold, and the candidate boxes that meet this condition have a high probability of belonging to the same action. Based on this edge, overlapping candidate frames will automatically share semantic information, and this part of information will be further processed in the graph convolution GCN; the
    connection condition of the surrounding edge is that the distance between two non-overlapping candidate frames is less than the threshold (candidate frame distance = Distance between the center point of the candidate frame/the length of the two candidate frames and ) , the candidate frames that meet this situation are likely to belong to different actions, or belong to the action and its background. Based on this edge, non-overlapping but adjacent proposals share information across action instances.
  • Graph Convolution for Action Localization:
    Use GCN to learn the connection of candidate boxes on the basis of graphs and get TAD results. The paper applies the structure of K-layer GCN+ReLU. After each layer, the concate operation is performed on the network output and the features of the hidden layer, and the combined features are used as the input of the new layer.
    insert image description here
    Use two GCN branches to perform category and boundary regression tasks: one GCN branch processes the internal feature of the candidate frame (intern feature), and outputs the action type after passing through the softmax+FC layer; one GCN branch processes the extended candidate frame feature (intern feature) & context feature), output start boundary/end boundary/action integrity through three FC layers respectively.

Guess you like

Origin blog.csdn.net/weixin_45751396/article/details/127777735