Behavior Recognition-TDN: Temporal Difference Networks for Efficient Action Recognition

0. Preface

  • Relevant information:
    • arxiv
    • github : No open source yet
    • Interpretation of the paper
  • Basic information of the paper
    • Field: Behavior Recognition
    • Author unit: Nanjing University
    • Posting time: 2020.12
  • One sentence summary: A new feature extraction structure is designed using RGB difference.

1. What problem to solve

  • Explore efficient temporal modeling methods.
  • There are two common temporal modeling methods
    • Using the dual-flow method, RGB is used to extract appearance information, and optical flow is used to extract movement information.
      • This method can effectively improve the recognition accuracy, but it requires a lot of computing power to calculate the optical flow.
    • 3D models, or temporal convolutions, implicitly learn motion fetures.
      • There is no separate consideration of temporal dimension related content, and a lot of computing power is required.
  • Previously, there was also a way to use RGB difference as input as a substitute for optical flow.
    • But the previous methods simply used RGB as another input, and finally merged on the result side.

2. What method was used

  • The Temporal Difference Network (TDN) is proposed to extract multi-scale temporal information.

    • Using the structure of TSN, sparse and holistic sampling strategy, that is, this form of 1x1x8
    • The main thing is to introduce the TDM structure, including short-term and long-term.
    • The role of short-term TDM is to provide more frame-wise representation
      • image-20201221190546994
      • The first parameter is the final result, the second parameter is the feature map of ordinary 2D CNN results, the function in the third parameter is the structure of S-TDM, and the input is a picture
    • The role of long-term TDM balances the structure between segments, thereby enhancing the expressiveness of each frame
      • image-20201221190807841
      • The last function is the L-TDM structure, where F should be the result of the above S-TDM.
      • The current model only considers the relationship between two adjacent frames, that is, L-TDM only exists between two adjacent frames.
    • image-20201221185736040
  • The key to TDN is the introduction of temporal difference based module (TDM)

    • image-20201221185757528
  • S-TDM

    • the author thinks:
      • Adjacent frames in a small local temporal window are very similar. It is unwise to directly superimpose this information and extract features.
      • On the other hand, although extracting information from segments can effectively extract appearance information, it cannot extract local motion information.
      • Therefore, it is necessary to use S-TDM and adjacent frame temporal differences to enhance information.
    • The overall structure is as shown in the figure above. It feels that a total of 5 frames of the selected picture and the selected picture are used to extract the diff information and superimpose it.
    • All in all, it is to extract the local motion and appearance information in a segment.
  • L-TDM

    • All in all, it is to extract information between segments.

3. How effective is it

  • A detailed ablation experiment was done to prove the effectiveness of the proposed structure.

    • To put it bluntly, I tried many S-TDM and L-TDM implementation methods and chose the best publication.
  • Reach SOTA on SomethingSomething. On Kinetcis-400 to achieve almost SOTA effect.

    • image-20201221172626536

4. What are the problems & what can be learned

  • Waiting for open source, I don't know how efficient it is to run.
    • For example, x3d looks great, but I don’t know how effective it will be when deployed.
  • It looks very tempting.
  • But from the perspective of principle, it may not have much effect in online tasks...
    • At least, for my fall detection, S-TDM did not have very good results.

Guess you like

Origin blog.csdn.net/irving512/article/details/111488943