Video Behavior Recognition (2) - Hierarchical Combination Representation of Small Sample Action Recognition

Hierarchical compositional representations for few-shot action recognition

      Article published in a paper at the CVPR conference in 2023. This conference is the TOP conference in computer vision tasks.
Paper address: https://arxiv.org/abs/2208.09424
Open source address: not yet open source (focus on Idea)

Article innovation

1. Core work

      A novel hierarchical composite representation (HCR) learning method is proposed for few-shot action recognition. Specifically, hierarchical clustering is used to divide actions into multiple sub-actions, which are further decomposed into fine-grained Spatial Attention Actions (SAS Actions). The theoretical basis is that there are similarities between new action types and basic action types in action recognition tasks between sub-actions and fine-grained SAS actions. In addition, the similarity of sub-actions between video samples is measured using Earth Mover's Distance.

2. Idea

      Although there is a large gap between the basic movements during training and the new movements during testing in the video, they can share basic SAS movements. For example, almost all videos in the HMDB51 dataset contain arm movement movements. Therefore, the paper generalizes fine-grained patterns from rich base classes of actions and transfers them to learn new action categories. These fine-grained patterns can help provide discriminative and transferable information across categories for classification

3. Contribution

(1) A hierarchical representation of fine-grained sub-actions and SAS action components is proposed, which is able to learn more common patterns between new actions and base actions; (2) A
Part Attention Module (PAM) is designed to focus on various Regions of interest, in particular, explicit SAS actions contain predefined body parts and implicit SAS actions contain other action-related cues such as context.
(3) To better compare fine-grained patterns, few-shot action recognition employing the distance of the Earth Mover as a distance metric handles time-independent actions, which can well match these fine-grained and discriminative sub-action representations.
(4) Extensive experiments show that the proposed method achieves state-of-the-art results on HMDB51, UCF101 and Kinetics datasets.

model structure

model structure

  1. The figure above shows the overall framework of the small-sample video action recognition model proposed in this paper. The model first divides complex actions into several sub-actions through clustering, and then further divides the sub-actions through the Parts Attention Module (PAM). Decomposed into finer-grained SAS actions, these fine-grained SAS actions are composed of explicit SAS actions and implicit SAS actions. The former corresponds to predefined body parts, and the latter corresponds to other action-related cues such as contextual information. Furthermore, this paper modifies traditional hierarchical clustering to segment videos into sub-actions of varying temporal lengths, instead of segmenting video sequences equally into clips. Therefore, similar video frames are collected and continuous semantics are preserved together within sub-actions. Moreover, considering that directly aligning local representations along the temporal dimension cannot handle time-independent action samples well, this paper adopts Earth Mover Distance (EMD) as a distance function to match sub-action representations to better compare fine-grained The mode realizes that the temporal sequence inside video clips is well preserved in the clustering sub-action, while the temporal sequence between clips is ignored by the process of optimizing the EMD distance. Finally, after obtaining the matching similarity from EMD, a softmax function is employed to calculate the probabilities of various actions.
  2. Model structure: The figure shows the Hierarchical Compositional representations (HCR) model proposed in this paper. Specifically, the video is first cut into multiple sub-action segments of different lengths, and the spatio-temporal features of each segment are extracted. In addition, use the (Parts Attention Module) PAM module to treat each channel as a SAS action and divide it into explicit (Explicit) SAS actions (ie, body parts) and implicit (Implicit) SAS actions (ie, contextual information). Finally, the EMD distance function is used to calculate the similarity between the sub-action representation sequences of the support set and the query set. The similarity score is sent to the Softmax layer and mapped to the probability distribution of the sample action classification. The calculation formula is as follows:
    official

Key Technology Analysis

1. Hierarchical combination representation

      This method divides the video action twice: the first is to divide the complex action into multiple sub-actions according to the principle that people usually decompose the action to recognize the action; the second is to divide the sub-action into multiple sub-actions along the spatial dimension. Fine-grained SAS actions. After two divisions, the information transfer between the basic action (training) and the new action (test) can be realized. The network uses an Efficient R (2+1) D network, and its structure is as follows:
Model Structure Diagram
      In the original network, modifications are made: (1) The PAM module is added to help the model learn SAS actions. The structure diagram is as follows:
key model
(2 ) adds spatial downsampling (maximum pooling layer); (3) deletes temporal downsampling. The following figure shows the SAS action attention map for visual prediction:
insert image description here

2. EMD measurement

      When measuring the distance between the support set and the query set, if the global representation is directly measured, timing information will be lost, and if the local representation is strictly matched, some actions related to time sequence cannot be processed. Therefore, Earth Mover's Distance (EMD) is used to measure the distance between two sub-actions. The EMD distance evaluates the difference between two multidimensional distributions in a vector space. When calculating the feature representation sequence distance of sub-actions in the support set and query set, the automatic feature is calculated first, and then the sub-action feature is used as a node (similar to the producer and consumer), and the last two action videos u and v The distance between them is regarded as the best matching cost of two corresponding sub-action representation sequences.

experiment analysis

insert image description here

      The data sets used in this paper are HMDB512, UCF101 and Kinetics. This paper compares with the current SOTA on 5-way. The result is shown in the figure above. Without pre-training on any dataset, our method outperforms ARN by 3.1% and 5.5% on HMDB51 and UCF101 datasets, respectively, in 1-shot. Furthermore, our method outperforms MlSo and increases to 48.6% and 71.8% on HMDB51 and UCF101 datasets, respectively.
insert image description here

In the above table, following the Kinetics-CMN protocol (Zhu and Yang, 2018), our paper outperforms all recent methods, e.g., TRX and HyRSM, and achieves state-of-the-art kinetic results, which again demonstrates our The superiority of the method.

future work

  1. In the future scientific research process, any applied research can be solved by imitating human thinking. For example, this paper imitates that humans usually divide actions into small details when recognizing actions, and the action recognition model also divides the actions in the video twice for fine-grained recognition.
  2. When calculating the distance between actions in this paper, EMD is used, which is not an algorithm commonly used in the computer field. Therefore, in the usual research work, you can expand the domain of knowledge, and use methods in other fields that are the same or similar to your own research tasks to solve the bottleneck of the current task.

Thesis Acquisition

The above papers and codes can be quickly downloaded .
Follow the official account AI 8x Mirror . Background reply: HCR , you can download the above papers and open source code in one step.

Guess you like

Origin blog.csdn.net/chenzhiwen1998/article/details/131649995