[Paper Reading] Temporal Action Localization under Weak Supervision

Read the paper: Adaptive Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

1. Summary

Task Requirements: Weakly Supervised Temporal Action Localization (W-TAL) aims to classify and localize all action instances in untrimmed videos only with video-level supervision.
Challenges: Without frame-level annotations, it is difficult for the W-TAL method to clearly distinguish action from background.
Get the job done:

  • Frame-level pseudo-baselines are generated and iteratively updated from late-fused activation sequences and used to provide frame-level supervision to improve model training.
  • An adaptive attention normalization loss is introduced, which adaptively selects action and background segments according to the video attention distribution.
  • A video-level and a segment-level uncertainty estimator are proposed, which mitigate the adverse effects of learning from noisy pseudo-GTs.

2. Introduction

The task of Weakly Supervised Temporal Action Localization (W-TAL) is to simultaneously localize and classify all action instances in long untrimmed videos given only video-level classification labels during the learning phase . Compared with its fully supervised counterparts, W-TAL greatly simplifies the procedure of data collection and avoids the annotation bias of human annotators, so it has been extensively studied in recent years.

Several previous W-TAL methods adopt the Multiple Instance Learning (MIL) framework, where a video is regarded as a bag of clips for video-level action classification. During testing, the trained model slides over time and produces a Temporal Class Activation Map (T-CAM) (i.e., the sequence of action class probability distributions at each time step) and a measure Attention sequences of relative importance. Action suggestions are generated by thresholding attention values ​​and/or T-CAM. This MIL framework is usually built on two feature modalities, RGB frame and optical flow, which are fused in two mainstream ways. Early fusion methods concatenate RGB and optical flow features before feeding them into the network, while late fusion methods compute a weighted sum of their respective outputs before generating action proposals.

Visualization of the two-stream output and its post-fusion result:
insert image description here
Five rows show the input video, the ground truth action instance and the attention sequence (scaled from 0 to 1) predicted by the RGB stream, the flow stream and their weighted sum (i.e., the fusion result), respectively. ). The horizontal and vertical axes represent time and strength of attention values, respectively. Green boxes represent localization results produced by attention with a threshold of 0.5. By properly combining two different attention distributions predicted by RGB and flow, the late fusion results achieve better localization performance.

Despite these recent developments, a major challenge remains to be resolved : the lack of frame-level supervision makes it difficult for W-TAL methods to clearly distinguish actions from background. This problem degrades localization performance in two main ways. First, detected action instances do not necessarily correspond to video-level labels, e.g., a detector may mistakenly identify a frame including a swimming pool as a swimming action. Second, the ambiguity between action and background affects the activation sequence. This not only makes thresholding methods produce incomplete or overly complete action proposals, but also leads to unreliable confidence scores for action proposals. Therefore, it is necessary to utilize finer-grained supervision to guide the learning process.

The following author introduces how he solves these problems, and the summary is a meaning but more detailed.
In summary, our contribution is threefold, identical to the Abstract.
Compared with the previous Two-Stream Consensus Networks for Weakly-Supervised Temporal Action Localization work improvement.


3. Related work

We briefly review related work on action recognition, fully-supervised temporal action localization, weakly-supervised temporal action localization,
and self-training.

  • Action recognition action recognition. Traditional approaches aim to model spatio-temporal information through hand-crafted features. More recently, Two stream consensus network for weakly-supervised temporal action localization uses two separate convolutional neural networks (CNNs), exploiting appearance and motion cues from RGB frames and optical flow, respectively, and using a late fusion method to reconcile The output of two streams. Convolutional two stream network fusion for video action recognition focuses on different approaches to fusing two data streams . Inflated 3D ConvNet (I3D) Quo vadis, action recognition? a new model and the kinetics dataset extends the 2D CNN in the two-stream convolutional network to a 3D CNN and further improves the performance. Some recent methods focus on learning motion cues directly from RGB frames, rather than computing optical flow. Besides, some works also try to model long-term temporal information in videos.

  • Fully supervised temporal action localization methods require frame-level annotation of all action instances during training. Some large-scale datasets have been created for this task , such as THUMOS, ActivityNet and Charades. Many methods employ a two-stage pipeline, where action proposals are first generated followed by action classification . Some methods adopt the Faster R-CNN framework for TAL. Recently, some methods focus on generating action proposals with more flexible deadlines. Some methods apply graph convolutional networks (GCNs) to TALs to incorporate more contextual information and exploit proposal-proposal relations. MS-TCN++ proposes a smoothing loss to address the over-segmentation error. Different from them, our proposed smoothing loss is to smooth the attention sequence and remove scattered action proposals.

  • Weakly supervised temporal action localization requires only video-level supervision during training, which greatly reduces the workload of data annotation and has attracted more and more attention from all walks of life. Hide-and-Seek randomly hides a part of the input video to guide the network to discover other relevant parts . UntrimmedNet consists of a selection module for selecting important segments and a classification module for per-segment classification.
    Sparse Temporal Pooling Network (STPN) improves UntrimmedNet by adding a sparse loss to enforce sparsity of selected segments. W-TALC jointly optimizes the co-activity similarity loss and multi-instance learning loss to train the network. AutoLoc and CleanNet employ a two-stage pipeline where they first generate initial action proposals and then regress action proposal boundaries based on prior knowledge : action regions should have higher activations than their surrounding background regions. Liu et al. proposed a multi-branch network to model actions in different stages. In addition, some methods focus on modeling the background to better distinguish between actions and backgrounds. DGAM proposes a conditional variational autoencoder to separate action and background. A2CLPT uses two parallel branches in an adversarial manner to generate complete action proposals. EM-MIL also exploits pseudo-labels, where class attention and class-specific activation sequences are alternately trained to supervise each other.

This is a very grand introduction. I haven't learned so many methods yet. I will continue to read and follow up.


4. Technical details

In this section, we first propose the task of Weakly Supervised Temporal Action Localization (W-TAL), and then describe the proposed Adaptive Two-Stream Consensus Network (A-TSCN) in detail.

insert image description here
As shown in the figure above, our A-TSCN consists of two parts, namely a two-stream base model and a pseudo-GT generation module . Given an input video, the two-stream base model is first used to perform action recognition on RGB clips and optical flow clips respectively, and obtain respective initial attention sequences.
To facilitate action-background discrimination, an adaptive attention regularization loss forces attention to act like a binary selection . Then, frame-level pseudo ground truth is generated from the late-fused attention sequence, which in turn provides frame-level supervision for the two-stream base model. Simultaneously, a video-level and a segment-level uncertainty estimator dynamically computes weights for pseudo-ground truth learning. Finally, the pseudo-GT is iteratively updated and refined the base model, i.e. providing frame-level supervision for the two-stream base model.

4.1 Problem formulation

insert image description here

4.2 Two-stream basic model

This section needs to be read carefully by yourself, and the author wrote it in great detail. Combining the above figure to achieve a better understanding.

We follow the recent W-TAL approach to build a two-stream base model on sequence-level feature sequences extracted from raw video volumes . Afterwards, we use a two-stream base model for action classification using only video-level labels, and then iteratively refine the base model with frame-level pseudo ground truth.

Feature extraction:

insert image description here

Feature embedding: (can be understood as a feature mapping, similar to one-hot encoding)

insert image description here
These two steps correspond to this part of the figure.
insert image description here
insert image description here
insert image description here

insert image description here

4.3 Pseudo-real label learning

After training the base model with only video-level labels, we iteratively refine the two-stream base model with a new frame-level pseudo-truth label.
Specifically, we divide the whole training process into several refinement iterations. At refinement iteration 0, only video-level labels are utilized for training. While at refinement iteration n+1, a frame-level pseudo-truth label is produced at refinement iteration n and provides frame-level supervision for the current refinement iteration. However, without true frame-level ground-truth annotations, we can neither measure the quality of the pseudo-real labels nor guarantee that the pseudo-real labels can help the base model achieve higher performance.

Late fusion is a combination of votes from two data streams: locations with high activations in both data streams are likely to contain GT action instances ; locations with high activations in only one data stream are likely to be false positive action proposals or only one Instances of real action detected by the data streams; locations with low activation in both data streams are likely background.

insert image description here

insert image description here
insert image description here
insert image description here

For the video-level uncertainty estimator, we consider two different implementations. I don't understand this part.

4.4 Action localization

insert image description here


5. Experimental Results and Discussion

5.1 Dataset and Evaluation

The THUMOS14 dataset contains 200 validation videos and 213 test videos in 20 categories of the TAL task. We use 200 validation videos for training and 213 test videos for evaluation. According to BaSNet, we removed test videos #270, #1292, and #1496 because they were incorrectly annotated. Each video contains an average of 15.5 action instances in the THUMOS14 dataset.
There are two release versions of the ActivityNet dataset , ActivityNetwork v1.3 and ActivityNet v1.2. ActivityNet v1.2 is a subset of ActivityNetwork v1.3, covering 100 action categories, with 4819 and 2383 videos in the training and validation sets, respectively. We use training and validation sets for training and testing, respectively. Each video contains an average of 1.5 action instances in the ActivityNet dataset.
The HACS dataset is a recently released dataset for the TAL task. To the best of our knowledge, it is the largest TAL benchmark so far, covering 200 action classes, including a training set of 37612 videos and a validation set of 5981 videos. We use HACS v1.1.1 for experiments. Each video in this dataset contains an average of 2.5 action instances.
Evaluation Metrics: Following the standard protocol for temporal action localization, we evaluate our method using mean precision (mAP) at different intersection over union (IoU) thresholds . We measure performance using the evaluation code provided by ActivityNet

5.2 Implementation Details

Optical flow is estimated by the TV-L1 algorithm.
Two off-the-shelf feature extraction backbones, UntrimmedNet and I3D, are used in the experiments with segment lengths of 15 and 16 frames, respectively.
These two backbone networks are pre-trained on ImageNet and Kinetics-400 respectively, without fine-tuning for fair comparison. RGB and optical flow segment-level features are extracted as 1024-D vectors at the global_pool layer.

5.3 Comparison with other methods

Comparison of our method with state-of-the-art TAL methods on the THUMOS14 test set. Recent fully-supervised and weakly-supervised methods are reported. UNT and I3D are the abbreviations of UntrimmedNet feature and I3D feature, respectively. The average column represents the average mAP at the IoU threshold 0.3:0.1:0.7.
insert image description here
There are also comparative test data in ActivityNet v1.2 and ActivityNet v1.3.

5.4 Ablation experiment

In this subsection, in order to better analyze the contribution of each component, we perform an ablation study on the THUMOS14 test set. Ablation studies using I3D features.

Ablation Study of La-Norm of Adaptive Attention Normalization Loss

insert image description here(1) For the original attention normalization loss Lnorm, when s increases from 2 to 8, the performance first improves, indicating that most of the actions or backgrounds manually set do not conform to the real action and background distribution (e.g., setting s = 4 assuming that the action and the background each account for 25% of the entire video). The performance drops at s = 16, which may be due to the reduction of training samples. (2) For the adaptive version with no lower bound, the performance degrades significantly as s′ increases. Furthermore, the number of action clips decreases faster than the number of background clips. Without lower bound constraints, we conjecture that the model only focuses on the most discriminative parts of actions for classification, while ignoring the completeness of action instances. (3) With the lower bound constraint (the last set), the performance improves significantly with s ′ = 2 and s ′ = 4, illustrating the effectiveness of our adaptive attention normalization loss. Furthermore, when setting s′ = 1, the performance drops slightly, indicating that for a single modality, it is unreliable to determine the motion and background of the entire video.

Ablation Study on Pseudo-Realistic Labels:
insert image description here
Hard pseudo-GT greatly improves the performance of RGB stream and fusion results. Although the performance of the flow stream drops slightly, the fusion result outperforms both streams after hard-pseudo-label learning. In contrast, soft pseudo-GT degrades the performance of traffic flow and fusion results. As for the RGB stream, although soft labels improve its performance, this improvement requires more refinement iterations and is still lower than the performance of training with hard labels. These results reveal the importance of removing ambiguity in pseudo-real labels.
insert image description here
Pseudo-ground truth labels improve both localization performance and fusion results for RGB streams at all IoU thresholds, while improving flow at high IoU thresholds. In addition, pseudo-truth labels greatly improve the precision and recall of RGB stream, and improve the precision of flow stream and fusion results, but with a slight loss of recall. Fake ground truth labels improve the F-measure of all three results.

Ablation Study on Uncertainty Estimator:
To alleviate the adverse effect of pseudo-GT's noise, we introduce a video-level uncertainty estimator and a segment-level uncertainty estimator. They estimate the reliability of pseudo-GTs in a batch and in a video respectively, thereby down-weighting uncertain pseudo-GTs and increasing the weight of confident ones.
Table 7 summarizes the results, showing that using either of the uncertainty estimators improves performance, and their combination leads to even higher performance.
Specifically, the uncertainty estimator at the segment level is more influential than at the video level. Furthermore, uncertainty estimators based on symmetric KL divergence perform better than those using attention differences Sensitivity
insert image description here
analysis to the threshold parameter θ:
insert image description here
Sensitivity analysis to the fusion parameter λ:
insert image description here
λ is an important super parameter, which controls the relative importance between the RGB stream and the flow stream during post-fusion, thus affecting the fusion results and pseudo-truth labels.
In the case of only video-level supervision, the result of late fusion is better than that of two separate streams only when the higher performing stream dominates (e.g. λ = 0.2). With frame-level pseudo-supervision, the localization performance and fusion results of RGB streams are greatly improved compared with only video-level supervision. However, when the noisy RGB stream dominates in pseudo-GT (i.e., λ > 0.5), the performance of the stream and fusion results drop significantly

Ablation Research on Early Fusion Framework:
As we reviewed in Section 1, there are two mainstream two-stream fusion methods, namely early fusion and late fusion.
In the early fusion framework, the pseudo-ground truth requires the base model to output previous results under different random model noises (such as dropout), so it improves the generalization ability and robustness of the base model. Thus, both soft-pseudo-ground truth and hard-pseudo-ground truth improve the performance of earlier fusion frameworks, demonstrating their effectiveness. In addition, hard pseudo-ground truth also achieves higher performance than soft pseudo ground truth, which is consistent with the results in the late fusion framework.

Sensitivity to hyperparameters:insert image description here
The results show that our method is more effective than the adaptive attention normalization loss (Table 9(a)), the smoothing loss (Table 9(b)) and the adaptive pseudo-GT learning loss (Table 9(c) )) The loss weights are robust.

Qualitative analysis:

Green boxes indicate regions with attentional activations above 0.5. The horizontal and vertical axes are the time and intensity of attention, respectively.

insert image description here
In the first example, with only video-level labels, RGB flow provides worse localization results than optical flow, thus leading to a noisy fused attention sequence. The pseudo-truth-guided RGB stream identifies false positive action proposals and discovers true action instances. Furthermore, it also leads to a cleaner fused attention sequence, where high activations better correspond to GTs.
insert image description here
In the second example, with only video-level supervision, both streams have some non-overlapping false positive action proposals at the beginning of the video. In this case, pseudo-GT helps eliminate such false positives.
insert image description here
In the third example, with only video-level supervision, the RGB stream can only distinguish certain scenes, but not approximate action instances. In contrast, optical flow can accurately detect action instances of GT. Therefore, pseudo-GT helps RGB streams to separate consecutive action instances.
insert image description here
The last example shows a typical case of performance degradation.


6 Conclusion

In this paper, we propose an Adaptive Two-Stream Consensus Network (A-TSCN) for W-TAL, which benefits from an adaptive attention normalization loss and an iterative refinement training method. Adaptive attention normalization loss dynamically selects action and background segments in a video and forces attention to binary selection, thereby reducing ambiguity between foreground and background. The iterative refinement training scheme uses novel frame-level pseudo-truth labels as fine-grained supervision and iteratively improves the two-stream base model. Meanwhile, a video-level uncertainty estimator and a segment-level uncertainty estimator dynamically determine the learning weights for each video and segment, which mitigates the adverse effect of learning from noisy pseudo-labels. Experiments on four benchmarks demonstrate that the proposed A-TSCN outperforms current state-of-the-art methods and validates our design intuition.


7. Accuracy of mainstream datasets

insert image description here

insert image description here

It can be found that in weakly supervised time-series action positioning, the accuracy of mainstream datasets is not very high, and there is still a lot of work that needs to be invested in this area.

Guess you like

Origin blog.csdn.net/weixin_45751396/article/details/127508677