[Paper reading] Intensive reading of time series motion detection papers (2017)

1.TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals

The purpose of the thesis - to solve the problem

Temporal action proposal (TAP) generation is an important problem, and fast and accurate extraction of semantically important (e.g., human actions) segments from untrimmed videos is an important step in large-scale video analysis.

① The calculation efficiency of the sliding window method is too low. ② Existing evaluation indicators are not accurate.

Contribution - Innovation

  • Proposed Temporal Unit Regression Network (TURN). TURN jointly predicts action suggestions and improves the time boundary through time coordinate regression;

  • Propose a new metric "Average Recall vs. Frequency of retrieved proposals (AR-F)" for the candidate frame proposal task (TAP)

Implementation process

insert image description here
(1) A long video is decomposed into short video units (6/16/32frames), and CNN features (C3D, two-stream CNN) are calculated for each unit.
(2) A set of contiguous unit features, called segments, are pooled to create segment features.
(3) Multiple time scales are used to create a fragment pyramid in an Anchor Unit.
(4) TURN takes a segment as input and outputs a confidence score indicating whether it is an action instance, and the start and end times of two regression offsets to refine temporal action boundaries.

detailed method

Video Unit Processing: To avoid repeatedly extracting visual features of the same window or overlapping windows, we use video units as the basic processing unit in our framework.

Clip Pyramid Modeling:
insert image description here
The parameters in the formula are clearly introduced in the original paper.
insert image description here
Where || represents vector concatenation, and P uses mean pooling. Note that despite the temporal overlap of multi-resolution clips, clip-level features are computed from unit-level features, which are computed only once.

Unit-level Temporal Coordinate Regression:
The network contains two outputs: the first outputs confidence score to determine whether the clip contains action, and the second outputs temporal coordinate regression offsets. The regression offset is expressed by the following formula:
insert image description here
s and e represent the positions of the start unit and the end unit, respectively.

Loss Function:
A positive sample is defined as: (1) the sample with the largest tIoU with GT (2) a sample with a tIoU greater than 0.5 with GT. A
negative sample is defined as: a sample whose tIoU with GT is 0.
insert image description here
The first item Lcls is the classification Loss, Used to classify action/background, it is softmax loss. The second item is the regression Loss, which is used to correct the position of the proposal. λ is a hyperparameter.
insert image description here
Use the L1 distance. l∗i is the label, 1 represents a positive sample, and 0 represents a background sample.

2. Temporal Action Detection with Structured Segment Networks

The purpose of the thesis - to solve the problem

  • A major challenge for precise temporal localization is the presence of a large number of incomplete motion segments in proposed temporal regions
  • The huge amount of visual data in videos limits their ability to model long-term dependencies in an end-to-end manner.
  • There is neither explicit modeling of the different stages in the activity (such as start and end) nor a mechanism for assessing completeness.

Contribution - Innovation

  • Structured Segment Networks (SSN): The temporal structure of each behavior instance can be modeled through a structured temporal pyramid. At the top of the pyramid, a factorized discriminative model is further used, which contains two classifiers for classifying behavior and determining completeness, respectively. This enables the framework to effectively distinguish positive proposals from background or incomplete proposals for accurate identification and localization.
  • It is proposed to pay attention to the three stage attributes of the start, progress and end of the action, and improve the integrity of the action contained in the candidate box according to the time domain structure information, and only show a higher score when the proposal and the action are aligned.
  • Based on the sparse snippet sampling strategy, the calculation problem of long-term modeling is overcome, and an efficient and lightweight end-to-end model is realized.
  • A watershed-based candidate box proposal mechanism TAG (temporal actionness grouping) is proposed. [temporal actionness grouping (TAG): a simple and effective temporal action proposal scheme to generate high-quality action proposals.

specific method

insert image description here
Divide the enhanced frame into three stages: start (yellow feature), progress (green feature) and end (blue feature), and use the structured temporal pyramid pooling (structured temporal pyramid pooling, STPP) method to process the three stages separately. The features obtained by STPP at each stage are concatenated into global information and then passed to the action category discriminator and action integrity discriminator, and finally combine the results of the two classifiers to output the complete action instance result. The above ensembles are included in an end-to-end model for training.

detailed method

The blog is very clear.

add another point:

Location Regression and Multi-Task Loss:
insert image description here
insert image description here

Use the sparse snippet sampling strategy (sparse snippet sampling strategy) to reduce the amount of calculation (this is obviously a compromise for end-to-end training without exploding video memory): a candidate frame (proposal) must include many snippets (snippet), the actual operation Divide any candidate frame into nine segments evenly, and each segment will contain several segments. In SSTP, only one segment is processed in each segment. At the same time, this behavior also fixes the feature dimension obtained by SSTP after processing a candidate box.
During the test, the linear characteristics and matrix knowledge of the classifier are used to convert the matrix W Pool (matrix V) into Pool (matrix W matrix V), and the test speed is increased by 20 times.
In the figure below, the top is the score map, the middle is the anti-score map, and the bottom is the candidate box obtained by the TAG algorithm. Use different thresholds to get several sets of candidate boxes, and input them into the SSN network after NMS.
insert image description here

3. A Pursuit of Temporal Accuracy in General Activity Detection

The purpose of the thesis - to solve the problem

  • It's not easy to tell whether a video clip captures the whole action or part of it . This issue is particularly salient for region proposal-based detection methods because there is a key difference between temporal proposals and spatial proposals.
    [In an image, the appearance of an object is often very different from its local parts. Therefore, it is usually not very difficult for a vision detector to tell whether a window corresponds to the whole object or just a local part.
    However, for video, people can often easily see what the action is from a small segment (or even a frame), but the entire action may actually last longer. The blurred distinction between an action and its parts makes pinpointing the start and end points very difficult .
  • The distribution of action duration is different, so the method based on the sliding window often uses more window scales and shorter step sizes, resulting in a waste of computing resources; at the same time, the cost of convolution calculations over long time spans is also too high.

Contribution - Innovation

  • A bottom-up candidate frame proposal method TAG (temporal actionness grouping) is proposed , which is more sensitive to boundary information, can generate more accurate candidate regions, and can handle a wide range of action lengths.
  • Mobility and completeness are intrinsically different features, so a cascaded classification network with two steps is designed . The first step removes those actions that belong to the background, while the second step, which we refer to as completeness classification, is dedicated to identifying those candidate actions that only capture incomplete parts of the action and discarding them from the results.

Implementation process

insert image description here
The proposed action detection framework starts by evaluating the actionability of video snippets. Use TAG (temporal actionness grouping) to generate temporal action proposals (orange). These cascaded classifiers are evaluated against proposals to verify their relevance and completeness. Only proposals for complete instances of triple jumps are generated by the framework. Incomplete proposals and background proposals are rejected by the framework.

detailed method

Framework Overview:
The proposed framework consists of two phases: generating temporal proposals and classifying proposed candidates . The former is to generate a set of class-independent temporal regions that may reflect the action of interest, while the latter is to determine whether each candidate region really corresponds to an action and which class it belongs to. (Challenges and contributions are ahead)

Temporal Region Proposals:

Proposal box generation is a bottom-up process that is mainly divided into three steps: extracting segments, evaluating the action score of each segment, and grouping segments into candidate boxes.
(1) Extract Segments: , each segment combines a video frame and a resulting optical flow. Not only conveys the appearance of the scene at a specific time step, but also conveys the motion information at that time.
(2) Evaluate the action score: The action score does not care about the category, but only measures the probability that the video clip contains an action. A two-stream binary classifier is trained using the network structure of TSN (Temporal Segment Network) to realize the function. Set the clips with action marks as positive samples, and the background without action marks as negative samples, and adjust the data ratio to 1:1.
(3) grouping: In order to achieve robustness to noise, the model needs to be able to tolerate occasional outliers (eg short-lived low-confidence segments inside actions).

insert image description here
The clustering algorithm is performed as shown in the figure above. First, design an action threshold to obtain segments that are more certain to be actions, and then extend the segments obtained in this step backwards; then design a tolerance threshold to determine which segments are definitely not actions. Segments that fall below the tolerance threshold stop stretching.

Detecting Action Instances:
With a set of candidate temporal regions, the next stage is to find complete action instances from them and classify them into specific action categories.
Candidate box screening is mainly divided into two steps. First, the background box is screened out while judging the action category of the candidate box, and then the incomplete box is screened out based on the integrity of the specific action.

①Activity Classification:
Using the classifier proposed by TSN, the candidate frame with a specific action IOU greater than 0.7 is regarded as a positive sample, and the candidate frame with any action labeling time less than 5% is regarded as a negative sample. The hidden reason behind it is: Those candidate boxes with small IOUs but actually intersect with GT may contain significant action segments, which are easy to confuse the discriminator. This strategy can help the classifier pay more attention to whether the candidate box is foreground or background.
When testing, first process the video, judge the action category of each segment, and then aggregate the results to the region-level to determine what action/background a candidate box is.

②Completeness Filtering:

To judge whether an action is complete, you should not only look at the internal information of the action, but also pay attention to the difference between different parts inside the candidate frame and the video information before and after the candidate frame. In the paper, the following model is built to judge the integrity of the action:
insert image description here
a two-layer pyramid structure: the first layer pools the scores of each segment (snippet) in the entire candidate frame, as shown in dark brown; the second layer performs 1/2 candidate frame The scores of each snippet (snippet) are pooled, as shown in blue. The average score of the actions of the front and back clips is shown in green. In the paper, an SVM classifier is trained for each action and judged one by one. (Although it seems that this method of feature extraction is very simple or even clumsy, the idea of ​​pyramid structure, using the front and back video information to form the overall feature, and using the classifier to judge is still being used.) Set the classifier for judging the action category
to Confidence P, the action integrity score is S, and the final score of the candidate box is C= P * epx(S_c).
where Pa is the probability of an active classifier and Sc is the completeness score. This formulation states that the final confidence of a detection result is the combination of its class probability and completeness score.

Guess you like

Origin blog.csdn.net/weixin_45751396/article/details/127682014