[Spatial-Temporal Action Localization (1)] Understanding spatiotemporal action localization

Nice blog and recommended frameworks

Nanjing University open source MultiSports: a fine-grained multi-person spatio-temporal action detection data set for sports scenes...

Recommended paper reading, Video Understanding (3) Spatio-Temporal Action Localization Spatio-Temporal Action Localization

task definition

Spatio-temporal action detection: Input an untrimmed video. Not only do you need to identify the start and end timing of the actions in the video and the corresponding categories , but you also need to mark them with a bounding box within the spatial range. The spatial position of the character .

Insert image description here
Spatio-temporal action detection aims to localize action instances in both space and time, and recognize the action labels.In the fully-supervised setting of this task, the temporal boundary of action instances at the video-level, the spatial bounding box of actions at the frame-level, and action labels are provided during training and must be detected during inference. The start and end of action “long jump” are detected in temporal domain. Also, bounding box of the actor performing the action is detected in each frame in spatial domain.
The purpose of spatiotemporal action detection is to locate action instances in space and time and identify action labels. In the fully supervised setting of this task, the temporal boundaries of video-level action instances, the spatial bounding boxes of frame-level actions, and action labels are provided during training and must be detected during inference. The start and end of the "long jump" action are detected in the time domain. And, the bounding box of the actor performing the action is detected in each frame in the spatial domain.

Task difficulty

Spatiotemporal modeling: One of the key challenges in this area is how to model the spatiotemporal information in videos. Generally, spatiotemporal modeling involves modeling motion, poses, and scenes in videos to accurately capture the spatiotemporal characteristics of actions.

  • Action localization task faces significant challenges, eg intra- class variability, cluttered background, low quality video data, occlusion, changes in viewpoint. Occlusion, viewpoint changes, etc.

data set

Datasets in the field of video understanding (including S-TAL)

Existing data sets are mainly divided into two major categories:

  • Densely annotated data sets (25FPS) represented by UCF101-24 and JHMDB. Each video in this type of data set has only one action. Most of the videos are single people doing some repetitive actions with simple semantics. The action categories are highly related to the background. .

  • Sparse annotation data sets (1FPS) represented by AVA, due to sparse annotation, do not give clear action boundaries. Existing methods are more like instance-level action recognition, weakening temporal positioning; at the same time, action categories are daily atoms. The movement speed is slow, the deformation is small, and the tracking difficulty is low. The classification does not require complex modeling and reasoning of people, objects, and environments.

Atomic Visual Actions : "Atomic actions" refer to the basic and smallest unit actions in the action data set. These actions are usually the smallest identifiable units in action recognition tasks.
"Atomic actions" refer to action fragments in the action data set that are basic, common in daily life, short in duration, small in deformation, slow in speed, and difficult to track. These atomic actions are often used in weakly labeled datasets because they are relatively easy to identify and classify and do not require complex modeling and reasoning of people, objects, and environments.

  • AVA is designed for spatio-temporal action detection and consists of 437 videos where each video is a 15 minute segment taken from a movie. Each person appearing in a test video must be detected in each frame and the multi-label actions of the detected person must be predicted correctly. The action label space contains 80 atomic action classes but often the results are reported on the most frequent 60 classes. AVA is designed for spatiotemporal action detection and consists of 437 videos, each video is taken from a movie 15-minute segment. Every person appearing in the test video must be detected in every frame, and the multi-label actions of the detected people must be correctly predicted. The operation label space contains 80 atomic operation classes, but results typically report the most common 60 classes.

Task status

Insert image description here
Insert image description here

Evaluation indicators

  • frame-AP: frame-AP measures the area under the precision-recall curve of the detections for each frame. Measures the area under the precision-recall curve of the detections for each frame. A detection is correct if the intersection-overunion with the ground truth at that frame is greater than a threshold and the action label is correctly predicted . If the intersection-overunion with the ground truth at that frame is greater than a threshold and the action label is correctly predicted, then the detection is correct.
  • video-AP: video-AP measures the area under the precision-recall curve of the action tubes predictions. A tube is correct if the mean per frame intersection-over-union with the ground truth across the frames of the video is greater than a threshold and the action label is correctly predicted.

"Action tubes predictions" refers to a series of temporal and spatial regions where detected action instances in a video are connected . This area represents the start and end time of the action and the spatial location where the action occurs. "video-AP measures the area under the precision-recall curve of the action tubes predictions" refers to evaluating the performance of the model by calculating the intersection ratio between the predicted area and the actual area for all action instances in the video. On each frame of the video, the average intersection ratio of the predicted action area and the real action area needs to be greater than a set threshold before it is considered the correct action area.

Innovative perspectives to think about

  • Multi-modal information: In addition to video frames, multi-modal information such as audio and text descriptions can also be used to improve the performance of action detection. This approach allows for a more comprehensive understanding of video content.

  • Attention mechanism: In spatiotemporal action detection, attention mechanism is often introduced to help the model focus on key moments and spatial areas related to actions in the video.

Guess you like

Origin blog.csdn.net/weixin_45751396/article/details/132780883