Paper reading: Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

Abstracts and contributions

This paper proposes an end-to-end deep convolutional network T-CNN

First, a video is first divided into fixed-length segments (8 frames) -------> These segments are sent to the Tube Proposal Network (TPN) and a series of tube proposals are generated --------> According to the behavior score of the tube proposals in each video clip and the overlap between adjacent proposals, they are connected (i.e., linking tube proposals) to form a complete tube proposal for spatiotemporal behavior positioning in the video ------ -> Finally the Tube-of-Interest (TOI) pooling is used to connect the action tube proposals to form a fixed-length feature vector for action label prediction.

Insert image description here

Contributions are as follows:

  1. A video behavior detection method based on end-to-end deep learning is proposed . It directly operates on the original video, uses a single 3D network to capture spatiotemporal information, and performs behavior localization and recognition based on 3D convolution features. To the best of our knowledge, this is the first work to leverage 3D ConvNet for behavior detection.
  2. We introduce a Tube Proposal Network (TPN) that utilizes skip pooling in the temporal domain to preserve temporal information for action localization in a three-dimensional volume.
  3. We propose a new pooling layer Tube-of-Interest (ToI) pooling in T-CNN. ToI pooling is the 3D form of the R-CNN Region-of-Interest (RoI) pooling layer . It effectively alleviates the problem of Tube Proposal Networks with varying spatial and temporal sizes. We demonstrate that ToI pooling can significantly improve recognition results.

Related work

  1. Related methods of CNN and 3DCNN in action detection

  2. Action detection related methods

  3. object detection process

This article generalizes R-CNN from 2D image areas to 3D videos for action detection.

Generalizing R-CNN from 2D to 3D

Unlike images, which can be cropped, videos vary greatly in the temporal dimension, so we divide the input video into fixed-length (8 frames) video clips so that the video clips can be processed under the fixed-size ConvNet architecture. Additionally, clip-based processing reduces GPU memory costs.

One advantage of 3D CNN over 2D CNN is that it captures motion information by applying convolution in time and space . Since our method uses 3D convolution and 3D max pooling not only in the spatial dimension, but also in the temporal dimension. 3D convolution and 3D max pooling are used, thereby reducing the size of video clips while concentrating distinguishable information. Temporal pooling is important in recognition tasks because it can better model the spatiotemporal information of the video and reduce some background noise. However, chronological information is lost. This means that if we arbitrarily change the order of the frames in the video clip, the final 3D maximum feature set will be the same. This is problematic in action detection because it relies on feature cubes to obtain bounding boxes for raw frames. ——Time information is very important and the order of frames cannot be changed arbitrarily.

Since a video is processed segment by segment, the action tube generates action tube proposals with different spatial and temporal sizes for different segments. These fragment candidate boxes need to be linked to a tube proposal sequence, which is used for behavioral label prediction and localization. In order to generate a fixed-length feature vector, we propose a new pooling layer - Tube-of Interest. The ToI pooling layer is a three-dimensional generalization of the R-CNN Region-of-Interest (RoI) pooling layer. Classic max pooling layers define kernel size, stride, and padding, which determine the shape of the output. For the RoI pooling layer, the output shape is first determined, and then the size and stride of the kernel are determined. Compared to RoI pooling which takes 2D feature maps and 2D regions as input, ToI pooling deals with feature cube and 3D tubes . Indicates that the size of the feature cube is d × h × w, where d, h, and w represent the depth, height, and width of the feature cube respectively. The ToI in the feature cube is defined by a d-by-4 matrix, which consists of d boxes distributed in all frames. Box is defined by a four-tuple (x 1 i, y 1 i, x 2 i, y 2 i), which specifies the upper left corner and lower right corner of the i-th feature map. Due to the different sizes of d bounding boxes , Aspect Ratio and Position,To apply spatiotemporal pooling, spatial domain pooling,and temporal domain pooling are performed separately. First, the h × w feature map map is divided into H × W bins, each cell corresponds to a cell of size h/w. At each cell, max pooling is applied to select the maximum value. Second, spatial pooling The d features are temporarily divided into D bins. Similar to the first step, d/D adjacent feature maps are grouped together to form standard temporal max pooling. Therefore, the fixed output size of the TOI pooling layer is DxHxW , as shown below

Insert image description here

Framework

The core structure is that TPN generates tube proposals for each segment

Tube Proposal Network(TPN)

  • Goal: Input 8 frames of pictures and output 8 consecutive bboxes.

[External link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-3mQKPXSy-1675910590283) (C:%5CUsers%5C%E7%8E%8B%E4%B8%80% E4%BA%8C%5CAppData%5CRoaming%5CTypora%5Ctypora-user-images%5Cimage-20230207165308635.png)]

For 8-frame video clips, three-dimensional convolution and three-dimensional pooling methods are used to extract spatio-temporal feature cubes . Our 3D ConvNet consists of seven three-dimensional convolution layers and four three-dimensional max pooling layers. We use d×h×w to represent the kernel shape of three-dimensional convolution/pooling, where d; h; W are depth, height and width respectively. In all convolutional layers, the kernel size is 3×3×3, and padding and stride are kept at 1. The number of filters is 64, 128 and 256 for the first three convolutional layers, and 512 for the remaining convolutional layers. The kernel size of the first 3D max pooling layer is set to 1 × 2 × 2, and the kernel size of the remaining 3D max pooling layers is set to 2 × 2 × 2. The details of the network architecture are shown in Table 1. We use C3D models as pre-trained models and fine-tune each dataset in our experiments.

After conv5, the temporal size is reduced to 1 frame (i.e. feature cube with depth D = 1) and we generate bounding box proposals in the conv5 feature tube

Insert image description here

Anchor bounding boxes selection

In Faster RCNN, the number of anchors is manually defined. For example, 9 anchors contain 3 scales and 3 aspect ratios. In this paper, we do not choose hand-picked anchor boxes, but apply k-means clustering on the training set. class to learn 12 anchor boxes (i.e., cluster centroids). This data-driven approach to anchor boxes can adapt to different data sets

Each bounding box is associated with an "actionness" score, which measures the probability that the content in the box corresponds to a valid action. We assign each bounding box a binary class label (whether it is an action or not) and the action score is less than a threshold. The bounding box will be discarded. During the training phase, the IoU overlap with any ground-truth box is greater than 0.7 or the Intersection-over-Union maximum bounding box (IoU) with the ground-truth box (the latter case is considered in case the former case may not be found Positive samples) overlap is considered a suggestion for positive bounding boxes

Temporal skip pooling

  • Problems: 3D CNN actually loses the order information (order) of the frames, and temporal skip pooling is to retain the order information.

The bounding box generated by the conv5 feature cube can be used for frame-level action detection by bounding box regression. However, due to the temporal concentration of temporal maximum pooling (8 frames to 1 frame), the temporal order of the original 8 frames will be lost. Therefore, we use temporal skip pooling to introduce temporal order for frame-level detection.

  • Implementation:
    • When 8 frames are input to conv5, the temporal latitude has become 1, and bbox propoals are obtained through ordinary detection methods.
    • When extracting features from the above proposal, go to conv2 extraction. Because conv2 does not operate on the temporal latitude, it can be considered that conv2 still retains order information.
    • Input the proposal of conv5 + the feature of conv2, and through operations similar to RoI Pooling, fixed-length features can be extracted for subsequent operations.
    • The input of the subsequent bbox reg is extracted through proposal + (conv2 & conv5).
    • The input of bbox reg is consistent 8 times.

Therefore, taking 5 bounding boxes in the conv5 feature cube as an example, 5 scaled bounding boxes are mapped to the corresponding positions in each conv2 feature patch. This will create 5 tube proposals as shown in Figure 3

[External link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-3mQKPXSy-1675910590283) (C:%5CUsers%5C%E7%8E%8B%E4%B8%80% E4%BA%8C%5CAppData%5CRoaming%5CTypora%5Ctypora-user-images%5Cimage-20230207165308635.png)]

Linking Tube Proposals

  • Goal: Connect tubes from different clips.

  • There are two main conditions for link: actionness (that is, the action score of the tube in each clip, the higher the score, the greater the probability of action) and overlap (that is, the IoU of the tubes between different clips, the last frame of the previous tube and the following one) IoU of the first frame of a tube).

  • Calculate the score between the front and rear frame tubes through the formula, and connect them according to the score.

    [External link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-Xn8qhgaM-1675910590284) (C:%5CUsers%5C%E7%8E%8B%E4%B8%80% E4%BA%8C%5CAppData%5CRoaming%5CTypora%5Ctypora-user-images%5Cimage-20230207112043252.png)]
    Insert image description here

Action Detection

  • Goal: Enter the linked tube obtained in the previous part and classify the behavior of the tube.

  • Since the tube lengths are different, to extract fixed-length features from all tubes, you need to use the ToI Pooling proposed in the article.
    Insert image description here

Results and discussion

Insert image description here
Insert image description here
Insert image description here

  • This cannot be used in the online version.
  • TOI pooling can be combined with two-stream CNN

Region proposal network (RPN)vs Tubelet proposal network(TPN)理解

First, let’s review the mainstream methods of object detection, which usually include the following steps:

  • Generate a series of candidate boxes, these candidate boxes are called proposals

  • Determine whether the content in the candidate frame is the foreground or the background based on the candidate frame, that is, whether it contains the detection object

  • Use regression to fine-tune the candidate box so that it can more accurately frame objects. This process is called bounding box regression.

Insert image description here

  • Region proposal: candidate box area, selected area.

  • Anchor box: Positioning center point box obtained by manual design or clustering (the difference from proposal is that these boxes are based on a certain point)

  • Bounding box (bbox): The result of regressing these anchor bboxes is called bounding box, which is a further candidate box, closer to the answer, and to a certain extent, it is also a proposal (candidate box).

RPN mainly includes the following steps:

  • Generate Anchor boxes.

  • Determine whether anchor boxes contain foreground or background.

  • Regression learns the position difference between anchor boxes and ground truth to accurately locate objects.

    img

    Assume that each anchor generates k boxes, and each anchor box will be input to two convolutional networks, namely cls layaer and reg layer. The training data of RPN is generated by comparing anchor boxes and GT boxes. Some anchor boxes are sampled in the picture, and then the IOU of the anchor box and GT box is calculated to determine whether the box is the foreground or the background. For the foreground box, calculation is also required. Each coordinate offset between it and the GT box.

    img

Tubelet Proposal Network(TPN)

Similar to the bounding box proposals in static target detection, the bounding box in video target detection is called a tube. The tube is a collection of candidate bounding boxes. The video target detection algorithm uses the tube to obtain temporal information. Therefore, in Tubelet Proposal Network (TPN), the feature map of the base 3D network is obtained from the previous step, and the tube anchor is designed using manual design or clustering methods. Each tube anchor has two labels, one is CLS—— Shows whether there is a high overlap between the foreground tube and the proposal tube from this spatial location. One is REG - outputs a 4T-dimensional vector encoding displacement, which is derived from the tube bounding box based on the coordinates of each box in the tube anchor.

Equivalent to
region proposal = tube proposal
feature map = feature cube A four-dimensional matrix composed of d boxes (because time is added)
anchor = tube anchor Use manual design or clustering method to design tube anchor (such as the 12 anchor clusters used in this article Class)
bounding box = tube bounding box. Score with or without action from the tube anchor and calculate IOU. The coordinates are (x1, y1, x2, y2) and are the
feature cube obtained by conv5 in the upper left and lower right corners.

Related

Paper View (38) Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos - Likecs.com (likecs.com)

Detailed explanation of Region Proposal Network - Zhihu (zhihu.com)

[Target Detection] Concept understanding: region proposal, bounding box, anchor box, ground truth, IoU, NMS, RoI Pooling_kentocho's blog-CSDN blog_ground truth iou

Guess you like

Origin blog.csdn.net/qq_42740834/article/details/128948892