[Paper reading] Intensive reading of time series motion detection papers (2018)

1. Precise Temporal Action Localization by Evolving Temporal Proposals

The purpose of the thesis - to solve the problem

  • Existing action localization methods perform unsatisfactorily in precisely locating the start and end of actions
  • Because the changes in consecutive video frames are tiny. This can lead to unstable or even incorrect border regressions , especially at frame-level granularity.
  • Judging the integrity of an action from its context is very subjective. Actions are often complex and varied, which makes it difficult for us to distinguish action clips from the background .

Contribution - Innovation

  • A three-stage candidate frame generation network ETP (Evolving Temporal Proposals) is proposed , based on features of different granularities (unit-based feature and non-local pyramid feature), multi-stage candidate frame regression is performed, and the action boundaries obtained are more accurate.
  • Use the unit-based space-time coordinates to return to the boundary of the preselection box to precisely locate the action .
  • We effectively model actions using a non-local pyramid feature , which is able to distinguish between completeness and incompleteness of preselection boxes .

Implementation process

insert image description here

  • The one-stage Actionness Network generates initial candidate boxes through the score of frame-level features;
  • The two-stage Refinement Network cuts the unit in the candidate frame and represents it with unit-level features. These features are fed into a GRU-based sequence encoder to produce an accurate box proposal.
  • The three-stage Localization Network extracts non-local features from the two-stage candidate frames, and outputs the final candidate frames and corresponding scores.

detailed method

  • Actionness Network: Get the probability value of the current frame containing the action frame-level class-specific actionness.
    First, each frame of the video is input into the AN network to calculate the frame-level score, and the score is used to generate the initial candidate frame. Input a video with a duration of T, and output the scores of each frame of images under k types of actions, a total of T*K.
    insert image description here
    The underlying assumption is that a segment containing motion should consist of frames with motion scores higher than a threshold. At the same time, the duration of action clips is usually limited, so minimum and maximum frame lengths are used. Considering these scores, a connected component scheme is designed to merge adjacent regions with high scores (Lines 14-32). After obtaining the score of each frame, the discrete scores are combined by clustering to obtain a proposal, and then non-maximum suppression (NMS) is used to eliminate redundant proposals (lines 3-13).

  • Refinement Network: Based on the RNN method (Bi-GRU), the context information (context) is used to correct the candidate frame output by the AN.
    insert image description here
    For each candidate frame (s, s+d), first expand to (sd/2, s+3*d/2), and then cut out many videos according to the fixed stride and duration of the expanded frame Unit (unit), drawing on the idea of ​​​​"Non-local neural networks", obtains the unit-level non-local pyramid feature when extracting each unit feature, and then inputs the features of each unit into the BiGRU network (GRU as a RNN Variant, which can encode an input of any length into an output of a specific length. As a kind of GRU, BiGRU can receive the characteristics of any unit, and the state of BiGRU after processing the last unit is the output), and then use the fully connected layer to process The features output by BiGRU obtain the offset of the center and duration of the candidate frame, that is, the further regression of the AN network results.
    The loss function used by Refinement Network is:
    insert image description here
    c represents the center coordinate of the proposal, and s represents the length of the proposal. N contains positive and incomplete proposals.

  • Localization Network: The LN network uses the SSN network as the backbone, and adds a non-local layer before the last layer of the network. The LN network inputs the candidate boxes of the RN, and the classifier outputs the final candidate boxes and corresponding scores.

  • The classifier consists of three loss functions: (positive samples: IoU greater than 0.7; negative samples: IoU less than 0.1; incomplete samples: IoU between 0.3 and 0.7).
    Classification loss: Use positive and negative samples and cross-entropy loss function to determine the type of action (K action + 1 background), use positive samples and incomplete samples to classify whether the action is foreground or background (non-local features are excellent in this regard), Train the regression model with positive samples.
    Integrity loss: Only a few proposals will match the groundtruth instance. Use an online hard example mining strategy to overcome dataset imbalance and improve classifier performance.
    Positioning loss: Same as above formula: L_loc
    insert image description here


2. CTAP: Complementary Temporal Action Proposal Generation

The purpose of the thesis - to solve the problem

  • The sliding window ranking method and the actionness score grouping method each have their own advantages and disadvantages.
    insert image description here
    The boundary of the sliding window + proposal ranking + boundary adjustment method is not accurate enough, and the high recall rate is also based on a large number of detected proposals, as shown in SW+R&A in the figure; the boundary of the unit-level actionness method is relatively accurate, but it is not good for the accuracy of the actionness
    score The requirements are extremely high (when the accuracy is not high, the wrong candidate frame will be generated and the correct frame will be ignored) , which also limits the upper limit of its AR value, as shown in TAG in the figure;
    one fusion method is to follow the unit-level actionness based method A window-level classifier for candidate frame sorting and boundary regression can effectively reduce the generation of wrong candidate frames, but it cannot solve the problem of ignoring correct frames, as shown in TAG+R&A in the figure.
    The main idea of ​​this paper is to collect the correct boxes that may be ignored in the actionness method in the proposal obtained by the sliding window, and add them back.

Contribution - Innovation

  • A new complementary fusion (actionness proposals + sliding windows) method CTAP is proposed to generate high-quality candidate boxes.
  • Based on spatio-temporal convolution, a new boundary adjustment and proposal ranking network TAR is designed. This network has a temporal convolution function and can efficiently store the sequence information of the candidate frame boundaries.

Implementation process

insert image description here

  • The first stage is to generate initial proposals, which come from two sources, one is actionness score and TAG, and the other is sliding window.
  • The second stage is complementary filtering. When the quality of action scores is low (i.e. action segments have low action scores), TAG misses some correct proposals, but the sliding window uniformly covers all segments in the video.
    Therefore, complementary filtering collects high-quality supplementary proposals from a sliding window to fill in the missed action proposals.
  • The third stage is boundary adjustment and proposal ranking, which consists of a temporal convolutional neural network.

detailed method

  • Initial Proposal Generation:
    In the stage of generating candidate frames, the video is first cut into countless snippets of equal length, unit-level features are extracted using a two-stream CNN, and a binary classifier is trained based on cross-entropy to judge the probability of a segment belonging to an action, and then TAG is used (Watershed) algorithm + NMS method to generate feature-based candidate frame bj ; plus the candidate frame ak obtained by the sliding window to form a set of all candidate frames.

  • Proposal Complementary Filtering:
    Using the feature of sliding search of the sliding window method, the correct box that may be ignored in the actionness method is added back.
    PATE (Proposal-level Actionness Trustworthiness Estimator): The core idea is to train a binary classifier, input the unit-level feature corresponding to the proposal, and output the probability indicating whether the candidate box proposal can be correctly detected by the unit-level actionness scores and TAG.
    The training method of the binary classifier: Set the bj box corresponding to the GT as a positive sample, and set the bj box corresponding to the negative sample, and use the cross-entropy loss function for training.
    Two-classifier test method: input the features of all boxes in ak into the network, if the output probability is lower than the threshold (this box may be ignored by the TAG network), collect this box, and finally collect the candidate box subset pt( ak), take the union of pt(ak) and bj sets to get the final candidate frame set cm.

  • Proposal Ranking and Boundary Adjustment:
    Sort the proposals and adjust the time boundary. (TURN does the same, but it uses mean pooling to aggregate temporal features, which loses temporal ordering information) Temporal convolutional adjustment and ranking (TAR) networks use temporal convolutional layers to aggregate unit-level features.
    For the candidate frame within cm, one candidate frame unit (proposal units) and two boundary units (boundary units) are respectively input into three independent sub-networks , the sub-network corresponding to the proposal outputs the probability of an action, and the boundary correction sub-network outputs the boundary The offset to return .
    When training the TAR network, for the candidate box ak obtained by the sliding window, the box whose tIOU with GT is greater than 0.5/the box with the largest tIoU with a certain GT box is regarded as a positive sample. The proposal sub-network is trained with Softmax cross-entropy, and the boundary regression sub-network is trained with L1 loss.


3. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

The original author's paper notes on Zhihu

The purpose of the thesis - to solve the problem

  • Videos in the real world vary in length, and non-action-related durations in the video itself make up the majority of the video duration.
  • Requirements for high-quality preselection boxes: The proposal generation should generate preselection boxes with flexible time length and precise temporal bounds, and then retrieve the preselection boxes with a reliable confidence score indicating the probability that the preselection boxes contain action instances.

Contribution - Innovation

  • A new "local to global" architecture (BSN) is proposed to generate high-quality temporal action proposals.
  • This method can be integrated into existing detection frameworks and greatly improves the performance in temporal action localization.

Implementation process

insert image description here

  • BSN evaluates the probability that each time position in the video is the start, the end (red solid line and blue dashed line) , and whether it contains action probability sequences as local information.
  • BSN directly combines temporal locations with high start and end probabilities to generate preselection boxes respectively . Utilizing a bottom-up approach, BSN can generate preselection boxes with flexible duration and precise boundaries.
  • Using features consisting of action scores inside and around a preselection box, BSN retrieves preselection boxes by evaluating the confidence that a preselection box contains an action. These proposal-level features provide global information for better evaluation.

detailed method

insert image description here
Video Features Encoding:
Use a two-stream network to encode features in videos.
insert image description here

Boundary-Sensitive Network:

  • Network structure: including three modules, temporal evaluation (timing evaluation), proposal generation (candidate frame generation) and proposal evaluation (candidate frame evaluation).

  • Temporal evaluation: The spatio-temporal evaluation module is a three-layer spatio-temporal convolutional neural network that takes a two-stream feature sequence as input and evaluates the start and end action probabilities of each spatio-temporal location in the video.

  • Proposal generation (candidate box generation): The starting point and end point with higher probability are used as the spatio-temporal position of the candidate box, and then the boundary-sensitive proposal (BSP) feature is constructed for each candidate proposal according to the action probability sequence.
    insert image description here
    (a) Generate candidate boxes. First, to generate candidate boundary locations, we select spatiotemporal locations with high boundary probabilities or as probability peaks. Then, we merge candidate start and end locations into proposals when their durations satisfy the condition . (b) Construct BSP features. Given a proposal and action probability sequence, we can sample the action sequence at the start, center and end regions of the proposal to construct BSP features.

  • proposal evaluation: A multi-layer perceptron model with one hidden layer that evaluates the confidence score of each candidate proposal based on BSP features. Confidence scores and boundary probabilities for each proposal are fused into a final confidence score for retrieval

  • Soft-NMS (result post-processing): Finally, the result needs to be non-maximized and suppressed to remove overlapping results. The soft-nms algorithm is used to suppress overlapping results by reducing the score. The processed result is the timing action nomination finally generated by the BSN algorithm.


4. Rethinking the Faster R-CNN Architecture for Temporal Action Localization

The purpose of the thesis - to solve the problem

  • How to deal with different action durations
  • How to utilize temporal context information
  • How to best fuse multi-stream features

Contribution - Innovation

  • It is proposed to use the hole convolution to align the duration of the anchor with the receptive field, so as to apply multi-scale anchors to adapt to the diverse duration of action segments.
  • Expand the receptive field, use temporal context information to better judge the type of action and determine the boundary of the candidate frame.
  • Demonstrate the superiority of late fusion of optical flow and RGB information.

Implementation process

insert image description here
Action candidate boxes can be regarded as line segments on a one-dimensional time axis, so they all operate on one-dimensional features.

detailed method

  • Receptive Field Alignment:
    In Faster R-CNN, the anchors generated at each location share the same receptive field. This assumption is reasonable for the 2D case, but not for the 3D case, since the variation in time length can be very large. To ensure high recall, the applied anchor segments therefore need to have a wide range of scales. However, if the receptive field is set too small (i.e. short in time) , the extracted features may not contain enough information when classifying large (i.e. long in time) anchors . Whereas if it is set too large, the extracted features may be dominated by irrelevant information when classifying small anchors . (For details, please refer to the original paper.)
    This paper uses a multi-tower network and dilated temporal convolutions to make the receptive field of any anchor correspond to its duration. At the same time, similar to RPN, two parallel 1 1 convolutional layers are also used to complete the judgment of the anchor containing the target and bbox regression.
    insert image description hereHow to design temporal convolutional nets with controllable receptive field size s?
    1) Stack convolutional layers. Disadvantages: easy to lead to overfitting
    2) Increase the pooling layer, disadvantages: exponentially reduce the resolution of the output feature map
    In order to avoid increasing the model parameters and maintain the resolution, it is proposed to use dilated temporal convolutions:
    insert image description here target receptive field size s, define two layers The dialation rate: d1=s/6, d2=s/6
    2, in order to smooth the input, a max pooling with kernel size = s/6 is added before the first conv layer.
    In order to achieve the size s of the target receptive field, r1=s/6 and r2=(s/6)×2 are used to explicitly calculate the dilation rate (ie sub-sampling rate) r_l required for the first layer. The input before subsampling is also smoothed by adding a max pooling layer with kernel size s/6 before the first convolutional layer.

  • Context Feature Extraction:
    proposal network:
    insert image description here In order to ensure that the context feature is used for anchor classification and boundary regression, the receptive field must cover the context area . Assuming an anchor of size s, we enforce that the receptive field also covers two segments of length s/2 immediately before and after the anchor.
    Action classification:
    insert image description here
    In action classification, we perform SoI pooling (i.e., 1D RoI pooling) to extract a fixed-size feature map for each obtained proposal. We illustrate the mechanism for outputting an SoI pool of size 7 in Figure 5 (top). As shown in Figure 5 (bottom), for a proposal of size s, the scope of our SoI pool includes not only the proposal segment, but also two segments of size s/2 immediately before and after the proposal, similar to the anchor Classification.

  • Late Feature Fusion:
    First use two networks to extract 1-D RGB and FLOW features, input the proposal generation network (rpn) to make the average of the last two scores to generate proposals, and then combine the proposals with their respective network features for classification (fast-rcnn part ), and then average the two network results.
    insert image description here

Guess you like

Origin blog.csdn.net/weixin_45751396/article/details/127745204