Article Directory
0. Preface
- Relevant information:
- Basic information of the paper
- Field: Behavior Recognition
- Author unit: Nanjing University
- Posting time: 2020.12
- One sentence summary: A new feature extraction structure is designed using RGB difference.
1. What problem to solve
- Explore efficient temporal modeling methods.
- There are two common temporal modeling methods
- Using the dual-flow method, RGB is used to extract appearance information, and optical flow is used to extract movement information.
- This method can effectively improve the recognition accuracy, but it requires a lot of computing power to calculate the optical flow.
- 3D models, or temporal convolutions, implicitly learn motion fetures.
- There is no separate consideration of temporal dimension related content, and a lot of computing power is required.
- Using the dual-flow method, RGB is used to extract appearance information, and optical flow is used to extract movement information.
- Previously, there was also a way to use RGB difference as input as a substitute for optical flow.
- But the previous methods simply used RGB as another input, and finally merged on the result side.
2. What method was used
-
The Temporal Difference Network (TDN) is proposed to extract multi-scale temporal information.
- Using the structure of TSN, sparse and holistic sampling strategy, that is, this form of 1x1x8
- The main thing is to introduce the TDM structure, including short-term and long-term.
- The role of short-term TDM is to provide more frame-wise representation
- The first parameter is the final result, the second parameter is the feature map of ordinary 2D CNN results, the function in the third parameter is the structure of S-TDM, and the input is a picture
- The role of long-term TDM balances the structure between segments, thereby enhancing the expressiveness of each frame
- The last function is the L-TDM structure, where F should be the result of the above S-TDM.
- The current model only considers the relationship between two adjacent frames, that is, L-TDM only exists between two adjacent frames.
-
The key to TDN is the introduction of temporal difference based module (TDM)
-
S-TDM
- the author thinks:
- Adjacent frames in a small local temporal window are very similar. It is unwise to directly superimpose this information and extract features.
- On the other hand, although extracting information from segments can effectively extract appearance information, it cannot extract local motion information.
- Therefore, it is necessary to use S-TDM and adjacent frame temporal differences to enhance information.
- The overall structure is as shown in the figure above. It feels that a total of 5 frames of the selected picture and the selected picture are used to extract the diff information and superimpose it.
- All in all, it is to extract the local motion and appearance information in a segment.
- the author thinks:
-
L-TDM
- All in all, it is to extract information between segments.
3. How effective is it
-
A detailed ablation experiment was done to prove the effectiveness of the proposed structure.
- To put it bluntly, I tried many S-TDM and L-TDM implementation methods and chose the best publication.
-
Reach SOTA on SomethingSomething. On Kinetcis-400 to achieve almost SOTA effect.
4. What are the problems & what can be learned
- Waiting for open source, I don't know how efficient it is to run.
- For example, x3d looks great, but I don’t know how effective it will be when deployed.
- It looks very tempting.
- But from the perspective of principle, it may not have much effect in online tasks...
- At least, for my fall detection, S-TDM did not have very good results.