Behavior Recognition-A Comprehensive Study of Deep Video Action Recognition

0. Preface

  • Relevant information:
    • arxiv
    • github (GluonCV also has to start supporting PyTorch, which is a good thing for me)
    • Interpretation of the paper
  • Basic information of the paper
    • Field: Behavior Recognition
    • Author unit: Amazon
    • Posting time: 2020.12
  • One sentence summary: introduces the development process of behavior recognition from the perspective of data sets and models, provides a code base, discusses current challenges, and looks forward to future development trends.

1. Data set overview

1.1. Look at pictures and talk

  • Overview of behavior recognition data sets in the past 10 years

    • The abscissa is the year
    • The ordinate is the number of labels (log level)
    • The size of the source is the number of samples.
  • image-20201216194615580

1.2. Overview of the data set

  • Data set construction process

    • Determine the behavior category (obtain from the previous database and add it according to your needs)
    • Obtaining videos from various channels, such as youtube, generally has the behavior category name in the title of the video.
    • Manually mark the start and end time of the behavior.
    • Finally, perform data cleaning (remove duplicate labels, incorrect labels, etc.)
  • challenge

    • Challenge 1: Determining the behavior category is very troublesome and very important.
      • The reason is: human behavior is a very complex concept, and there is no good hierarchical structure.
      • My own understanding : The main difference between the behavior category and the image category is that the behavior category is a verb or a gerund .
        • The verb itself is very complicated, and some verbs have the phenomenon of polysemy. For example, "do", this word means a lot, such as doing surgery, making a watch, and so on. take/playWhat is in English is also very troublesome and difficult to define clearly.
        • The gerunds are even more complicated. For example, the action of "pick up an item", is it a kind of behavior to pick up the phone and the water glass? It belongs to some scenes, and it doesn't belong to some scenes.
        • The goal of image classification is "nouns", which is generally better for classification and has a better hierarchical structure.
    • Challenge 2: Video tagging is quite troublesome
      • Need to watch the entire video (unlike pictures, much faster)
      • Labels are often very vague. For example, everyone may have different opinions on the determination of the start/end position of the action.
    • Challenge 3: The data set itself is very difficult to obtain.
      • The data sets are only given links and need to be downloaded by yourself. Maybe everyone can get different data sets, so the model is not particularly fair in comparison.
      • Ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah ah. I am now suffering from all the hardships, and when I first started it was really painful.
  • Data set classification

    • Scene-focused datasets: The video length is very short, and the behavior itself can be judged by static situations, such as Kinetics-400/UCF101/HMDB51, etc.
    • motion-focused datasets: Background little help for the act itself is not large, including 从左到右and 从右到左other categories, need a strong motion information.
    • Multi-label data set: There are more labels, such as bbox and object labels.

1.3. Overview of specific data sets

  • I’m not going to introduce it here. I actually have a separate note to record these data sets.
    • HMDB51、UCF101、Sports1M、ActivityNet、YouTube8M、Charades、Kinetics、Something-Something、AVA、Moments In Time、HACS、HVU、AViD

2. Model development

2.1. Look at pictures and talk

  • image-20201216194748846
  • image-20201221163804288

2.2. Model overview and challenges

  • Challenges (Challenges for video data modeling)
    • Challenge 1: There are very large intra- and inter-class variations in human behavior
      • The gap between the same type of behavior is very large, and the gap between different behaviors is also very large
      • The same action can be performed from different angles and speeds.
      • Some behaviors have very similar movement trends and are very difficult to distinguish.
    • Challenge 2: The modeling of behavior must simultaneously model short-term action information and long-term time information
      • both short-term action-specific motion information and long-range temporal information
    • Challenge 3: The amount of calculation required for model training and inference is very large.

2.3. Model development

  • Hand-crafted features: I don’t care about this, so I didn’t take a closer look.

  • Two-stream networks

    • Optical flow (optical flow) is a motion representation used to describe the movement of objects and scenes
      • Can describe the motion pattern very well.
      • Compared with RGB images, it can provide more direct information (orthogonal infomation). Guessing means that the context information is not considered, and the action itself is more considered.
    • The dual-stream method has a so-called "two-streams hypothesis": the visual cortex in the brain contains two channels, dorsal stream (to achieve target detection) and dorsal stream (to achieve behavior detection)
    • RNN related algorithms: basically cnn backbone plus LSTM and its variants.
    • Segment-based related algorithms: such as TSN/TSM, and the timing behavior detection method TRN based on these.
    • Multi-modal data: such as adding sound, deep learning, skeleton information, etc.
  • 3D model

    • Directly extend 2D models to 3D models, such as C3D
    • Overlay 2D and 3D models, such as R2+1D
    • Long-range temporal modeling (don’t know how to translate, long-term temporal modeling?): Ordinary behavior recognition is to model short-term data. There are some methods for long-term data modeling, such as T3D/LTC. The author also puts non-local on In this part, I didn't understand it.
    • Variations of 3D models, such as X3D/A3D
  • Explore more efficient video modeling methods

    • problem:
      • For Kientics-400, if the optical flow is built, it needs 4.5T of space to save...
      • 3D model deployment is more difficult (no 2D support is good)
      • 3D models require more IO performance.
    • Explore the method of "simulating optical flow"
      • Dual-stream transmission requires pre-calculation of optical flow, which is a big limitation.
      • Some methods, such as MotionNet, PAN, etc., are all methods of simulating optical flow.
    • Explore time modeling methods that do not require 3D convolution, which are generally new structures, such as TSM/TIN/STM/TEA/TEINet, etc.
  • Other research

    • Trajectory-based approach
    • Based on the rank pooling method (using a method similar to LTR for modeling), but the papers seem to be early (before 2017).
    • Compressed video action recognition: In video coding, I frames are key frames, and P/B frames are not key frames. You may be able to start from here. There are some methods that use knowledge distillation. Various Sampler methods may fall into this category, such as SCSampler
    • Video frame extraction method: frame/clip sampling is related, that is, the general method thinks that all input frames have the same weight, but this should not be the case. Sampler related can also be classified in this category.
    • Visual tempo: describe how fast the action is, such as CIDC/TPN

3. Performance indicators and results display

  • Generally compare accuracy and fps.
  • image-20201221163652544
  • image-20201221163712532

4. Other research directions

  • Data enhancement: Some papers say that color jitter and random flip have some effects, but others have not been verified.
  • domain adaptation (a type of transfer learning)
  • Neural Network Search (NAS): What if the meat eaters seek it out?
  • Efficient model deployment (deployment to real-world scenarios is more difficult, it should refer to monitoring scenarios):
    • The main problems:
      • Most models are designed and trained in the offline state, that is, every time they get a video, not an online video stream.
      • Most models cannot run in real time.
      • 3D and other non-standard ops are difficult to deploy.
    • Many 2D related technologies can be applied to behavior recognition, such as model compression, quantization, pruning, and so on.
    • May need better data sets and more appropriate performance indicators.
    • It may be possible to use compressed video, after all, most videos are already compressed.
  • New data set:
    • Most of the existing data sets are biased towards spatial information, that is, the behavior category can be judged through a picture, without the need for dynamic information.
    • youtube does not allow a single id to download large amounts of data...cry
  • Video adversarial attack
  • Zero-shot learning
  • Weakly supervised learning
  • Fine-grained classification
  • First-view behavior recognition
  • Multimodal
  • Self-supervised learning

Guess you like

Origin blog.csdn.net/irving512/article/details/111478649