[Spatial-Temporal Action Localization (2)] Paper Reading 2017

1. ActionVLAD: Learning spatio-temporal aggregation for action classification code

Summary and conclusion

  • Action classification, aggregating local convolutional features across entire video spatiotemporal range
  • Combining dual-stream networks and learnable spatio-temporal feature aggregation, end-to-end
  • Pooling and combining signals from different streams across space and time. pooling across space and time and combining signals from the different streams.
  • (i) Joint pooling across space and time is important, but (ii) appearance and motion streams are best aggregated into their own separate representations. (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations.

Introduction: Targeting pain points and contributions

Pain points:

  • Both 3D convolutional and two-stream networks are dissected: What are the appropriate spatiotemporal representations for video modeling? 3D spatiotemporal convolution , which potentially learns complex spatiotemporal dependencies but has so far been difficult to scale in terms of recognition performance ; two-stream architecture, which decomposes the video into a motion stream and an appearance stream and trains a separate CNN for each stream, and finally fuses output. While both approaches have made rapid progress, two-stream architectures generally outperform spatiotemporal convolutions because they can easily leverage new ultra-deep architectures and models pretrained for static image classification. However, two-stream architectures largely ignore the long-term temporal structure of the video, essentially learning a classifier that operates on a single frame or short blocks of a few (up to 10) frames, potentially forcing different classification scores to reach consensus. .
  • Independent classification and fusion of frames cannot accurately model actions : whether temporal averaging can simulate the complex spatiotemporal structure of human behavior. This problem is exacerbated when the same sub-action is shared between multiple action classes. For example, consider the complex compound action of "basketball shot" shown in Figure 1. Given only a few consecutive video frames, it can easily be confused with other actions such as 'running', 'dribbling', 'jumping' and 'throwing'.
    Insert image description here

contribute:

  • ActionVLAD CNN layer, which may help related tasks such as (spatio-temporal) temporal localization of human actions in long videos
  • (1) We develop powerful video-level representations by integrating trainable spatio-temporal aggregation with a state-of-the-art two-stream network. (2) We study different strategies for pooling and combining signals from different streams across space and time, providing insights and experimental evidence for different design choices.
  • Better than baseline:

Related work

Trainable Spatio-Temporal Aggregation: This
Insert image description here
Insert image description here is achieved by dividing the descriptor space RD into K units using a vocabulary of K “action words” represented by anchor points {ck} (Figure 3 ©).
Insert image description here

model framework

Insert image description here
Features are extracted from sampled appearance and motion frames of the video using a standard CNN architecture (VGG-16). These features are then pooled across space and time using an ActionVLAD pooling layer, which is trainable end-to-end with a classification loss.

Which layer to aggregate?
Evaluation of (a) ActionVLAD at different positions in a VGG-16 network;
Insert image description here
using a dual-stream network (pretrained on frame level) as feature generator to feed our trainable ActionVLAD pooling layer from different frames input of. But, which layer’s activations do we pool? We get the best performance by pooling features at the highest convolutional layer (conv5.3 of VGG-16).

How to combine Flow and RGB streams?
Insert image description here
Single ActionVLAD layer over concatenated appearance and motion features (Concat Fusion).
Single ActionVLAD layer over all appearance and motion features (Early Fusion).
Late Fusion.
Insert image description here

Thinking about shortcomings

  • The starting point is good: a single action classification, but there may be multiple different sub-category features, so find a way to integrate multiple sub-category features into the representation of a whole video feature. However, the reintegration of judgments from multiple subcategories undoubtedly increases the decrease in final accuracy.

2. Action Tubelet Detector for Spatio-Temporal Action Localization code

term prefix

Anchor cuboids : Anchor cuboids are similar to bounding boxes in two-dimensional target detection. They are both a dense sampling method used to generate candidate boxes. The difference is that anchor cuboids are a three-dimensional shape candidate box that can handle moving objects in videos and utilize the time dimension information, while anchor boxes are only suitable for two-dimensional images.
Tubelets : The tubelet in this article refers to a spatiotemporal detection unit composed of bounding box sequences in multiple consecutive frames. Each bounding box has an associated score, which is used to indicate whether there is a certain action category in the detection unit. . It can be understood as a way to detect and locate the spatiotemporal position of an action in a video. a sequence of bounding boxes, with one confidence score per action class
Insert image description here

Summary and conclusion

  • Proposed ACtion Tubelet detector (ACT-detector) [Action Tubelet Detector (ACT-Detector)]
  • The features of each frame are stacked in time to form time series information sequences of frames.
  • It is built on the basis of SSD and introduces anchor cuboids , which perform scoring and regression on sequences of frames.

Introduction: Targeting pain points and contributions

Pain points:

  • Action categories cannot be accurately identified from a single frame alone.
    Insert image description here

contribute:

  • An Action Tubelet detector (ACT-detector) is proposed, which inputs multiple continuous video frames, outputs anchor cuboids composed of multiple bboxes with predicted behaviors on multiple frames, and then performs regression on each bbox to obtain tubelets with predicted behaviors . Because ACT-detector takes into account the continuity features of multiple video frames, it can reduce the ambiguity of behavior prediction while improving positioning accuracy.

Related work

  • Target detection: Faster-RCNN (RoI anchor box) –> YOLO and SSD. This paper extends them to anchor cuboids , thereby significantly improving action localization.
  • Action localization: (1) Expansion of the sliding window (requires assumptions such as: cuboid shape, the spatial range of objects or behaviors involved in the action remains unchanged) (2) Applying candidate boxes (proposals) to the video field, that is Put forward several proposals for the video and classify them.

model framework

Insert image description here
Insert image description here
Input: In the middle of K consecutive frames
: After the SSD network, there will be K * features of different layers. K different features are stacked together, and features of different layers occupy different rows.
These features will output category scores (class+1) through a convolution, and regression coordinates (4*K) through another convolution. Features of different layers will output different scores and coordinates, and different layers correspond to different anchors. Therefore, different picture positions and different layers will have different anchors. There will also be multiple initialized anchors at the same position, but the K frame corresponds to the same anchor, which is equivalent to SSD scales in the time dimension.
Each anchor will eventually correspond to an anchor cuboid , and finally a tubelet will be output after NMS or ignored.
The initialized anchor cuboid has a fixed position along the frame dimension, and is larger than the human area. The position of each frame after regression is different. Composition of tubeltes.
Output: tubelets: a sequence of bounding boxes , with one confidence score per action class.

source

Given a sequence of K frames, the ACT detector computes convolutional features for each frame. The weights of these convolutional features are shared across all input frames.

Specifically, when processing this image sequence, ACTdetector will use convolution operations to extract the features of each frame. The same convolution weights are used for feature extraction between different frames, which means that the parameters of the convolution kernel are the same throughout the sequence. This method of sharing weights helps the model better understand and capture the correlation of actions or objects across different frames because it uses information from all frames to extract features rather than processing each frame independently. This helps improve the accuracy of motion localization or detection.

Stack the convolutional features of each frame from K frames. (After the SSD network, there will be K * features of different layers. K different features are stacked together, and the features of different layers occupy different rows.) The stacked features are the inputs of two convolutional layers, one for the action class score, and another for regression anchoring the cuboid.
The classification layer outputs a C+1 score for each anchored cuboid: one score per action class plus a background score. This means that tube classification is based on a sequence of frames. The regression will output 4×K coordinates for each anchored cuboid (4 for each K frame). Note that although all cuboids in the bundle are jointly regressed, they produce different regressions for each frame.
(Features of different layers will output different scores and coordinates, and different layers correspond to different anchors. Therefore, different picture positions and different layers will have different anchors . There will also be multiple initialized anchors at the same position, but the K frame corresponds to the same anchor, which is equivalent to Expand SSD in time dimension.)

The receptive field of the neuron used for scoring and regression anchoring the cuboid is larger than its spatial extent. This allows us to also make predictions based on the context around the cuboid, i.e. understanding actors that may move outside the cuboid.
Insert image description here
Training loss:
Insert image description hereN represents the number of anchor cuboids matching the true value,

Insert image description hereDefine confidence loss using softmax loss
Insert image description here

Thinking about shortcomings

  • Restrictions on the size and shape of anchor points or stubs: The size and shape of anchor points or stubs used in methods are often defined in advance, which can lead to poor results when handling a variety of actions or objects of different sizes and shapes. . If an action or object in the video does not match the defined anchor points, it may result in misses or false detections.

3. Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos

Local maximum pooling feature space-time vector (ST-VLMPF)

Summary and conclusion

  • Spatio-temporal vector of local maximum pooling features (ST-VLMPF), which is a hypervector-based encoding method specially designed for local deep feature encoding . a super vector-based encoding method specifically designed for encoding local deep features.
  • Feature assignment is carried out at two levels, by using the similarity and spatio-temporal information. Feature assignment is performed at two levels using similarity and spatiotemporal information.

Introduction: Targeting pain points and contributions

Pain points:

  • One of the shortcomings of current standard encodings is the lack of consideration of spatiotemporal information
  • Handcrafted features are designed by hand and often contain low-level information such as edges.
  • With the current high availability of pretrained neural networks, many researchers use them solely as feature extraction tools because retraining or fine-tuning neural networks is in many ways more difficult. Therefore, an efficient deep feature encoding method is needed to deal with this problem.

contribute:

  • It solves an important problem in video understanding: how to construct a video representation containing CNN features on the entire video.

model framework

Insert image description here
A codebook C is learned using k-means from a large subset of randomly selected features extracted from a subset of videos in the dataset. The results represent K1 visual words, C={c1,c2,…,ck 1}, which are basically the average of each feature cluster learned with k-Means.

  • First, local features are extracted from the video. The video uses the extracted local features X={x1, x2,..., xn}∈Rn×d, where d is the feature dimension and n is the total number of local features of the video. Together with local features, we retain their positions P = {p1, p2, …, pn}∈Rn×3.
  • Our proposed encoding method performs two hard assignments using the obtained codebook, the first is based on feature similarity and the second is based on their location. For the first assignment, each local video feature xj(j=1, …, n) is assigned to the closest visual word from codebook C. Then, on the feature groups assigned to clusters ci (i=1, …, n). …, k1)
    Insert image description here
    NN(xj) represents the nearest neighbor centroid of the codebook C for feature Absolute value. Basically, Equation 2 obtains the maximum absolute value while retaining the initial sign of the final result returned. In Figure 1, we refer to this similarity as feature max pooling because features are grouped based on their similarity and then max pooling is performed on each resulting group. The concatenated representation of all vectors [vc1, vc2, …, vck1] is VLMPF (Local Maximum Pooling Feature Vector) encoding, and the final vector size is (k1×d).
  • After the first assignment, we also retain the centroid membership of each feature with the aim of retaining relevant similarity-based clustering information. After the first assignment, we also retain the centroid membership of each feature with the aim of retaining relevant similarity-based clustering information. For each feature, we use a vector m to represent the membership information, for example, m=[0100…00] to map the membership feature information to the second visual word of the codebook C.
  • We perform a second assignment based on feature locations. The bottom part of Figure 1 shows this path. Each feature position pj in P is assigned to the nearest centroid in the codebook PC.
  • Features are grouped based on spatiotemporal information, and then the maximum absolute value is calculated while maintaining the original sign of the feature. We also concatenate the membership information about the feature similarity obtained from the first assignment in Equation 3, with the goal of encapsulating the similarity membership of the spatiotemporal grouping features with the spatiotemporal information. We concatenate all these vectors [vpc1, vpc2, …, vpck2] to create ST (space-time) encoding, thus getting (k2×d + k2×k1) vector size. Finally, we concatenate the ST and VLMPF encodings to create the final ST-VLMPF representation, which is used as input to the classifier. Therefore, the final size of the vector represented by ST-VLMPF is (k1×d) + (k2×d + k2×k1).

Local deep features extraction
Insert image description here
is used to capture the spatial flow of appearance, the temporal flow to capture motion, and the spatiotemporal flow to capture both appearance and motion information.

4. Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos code

Summary and conclusion

  • propose an endto-end deep network called Tube Convolutional Neural Network (T-CNN) for action detection in videos. It utilizes 3D convolutional networks to extract effective spatiotemporal features and perform action localization and recognition in a unified framework. Coarse proposal boxes are densely sampled based on 3D convolutional feature cubes and linked for action recognition and localization.

Introduction: Targeting pain points and contributions

Pain points:

  • Previous convolutional neural network (CNN)-based video action detection methods usually contain two main steps: frame-level action proposal generation and cross-frame proposal association. In addition, most of these methods adopt a dual-stream CNN framework to process spatial and temporal features separately.

contribute:

  • A video action detection method based on end-to-end deep learning is proposed. It operates directly on the original video, uses a single 3D network to capture spatiotemporal information, and performs action localization and recognition based on 3D convolutional features. This is the first work to utilize 3D ConvNet for action detection.
  • Tube Proposal Network is introduced, which uses skip pooling in the time domain to retain the temporal information of action positioning in the 3D volume.
  • A new pooling layer - Tube-of-Interesr (ToI) pooling layer is proposed in T-CNN . The ToI pooling layer is a 3D generalization of the R-CNN region of interest (RoI) pooling layer. It effectively alleviates the problem of variable spatial and temporal sizes of pipeline proposals.

Related work

  • R-CNN: For target detection in images, Region-CNN (R-CNN), region proposals are extracted using selective search . Then the candidate region is warped to a fixed size and input into ConvNet to extract CNN features. Finally, the SVM model is trained for object classification.
  • Fast R-CNN. Compared with the multi-stage pipeline of R-CNN , Fast R-CNN adds an object classifier to the network and simultaneously trains the object classifier and the bounding box regressor. A Region-of-Interest (RoI) pooling layer is introduced to extract fixed-length feature vectors of bounding boxes of different sizes.
  • Faster R-CNN. It introduces RPN (Region Proposal Network) to generate proposals instead of selective search . RPN shares the full image convolutional features with the detection network, so proposal generation is virtually free. Faster R-CNN achieves state-of-the-art object detection performance while remaining efficient during testing. Motivated by its high performance, in this paper, we explore generalizing faster R-CNN from 2D image regions to 3D video volumes for action detection.

model framework

Insert image description here

  • The input video is first divided into clips of equal length .
  • Then, the clip is input to the tube proposal network (TPN) and a set of tube proposals is obtained .
  • Next, the tube proposals for each video clip are linked based on their actioness score and the overlap between adjacent proposals to form a complete tube proposal for spatiotemporal action localization in the video.
  • Finally, Tube-of-Interest (ToI) pooling is applied to the link ’s action tube proposals to generate a fixed length feature vector for action label prediction.

Tube of interest pooling

Insert image description here

  • Spatial max pooling: First, divide the h × w feature map into H × W bins, where each bin corresponds to a unit of size approximately h/H × w/W. In each cell, max pooling is applied to select the maximum value.
  • Temporal maximum pooling: Secondly, the d feature maps of spatial pooling are divided into D bins in time. Similar to the first step, d/D adjacent feature maps are grouped together to perform standard temporal max pooling. Therefore, the fixed output size of the ToI pooling layer is D × H × W. Figure 2 shows a diagram of ToI pooling.
  • As shown in the figure above, the red area convolution divides the 20*20 feature map into 4 bins.

Max Pooling divides the input image into several rectangular areas and outputs the maximum value for each sub-area.

Tube Proposal Network

Insert image description here

Linking Tube Proposals

Each tube proposal from different clips can be linked in a tube proposal sequence (i.e. video tube proposal) for action detection. However, not all combinations of tube proposals can correctly capture the complete action. For example, a clip A tube proposal in a clip may contain motion, while a tube proposal in the next clip may only capture the background. Intuitively, the content in the selected tube proposal should capture motion, and tube proposals connected in any two consecutive clips should have a larger time overlap. Therefore, two criteria are considered when linking tube proposals: actionability and overlap score. Each video proposal is then assigned a score, defined as follows: where
Insert image description here
Actionness_i represents the action score of the tube proposal from the ith clip , Overlap_j,_j+1 measures the overlap between two proposals from the links of the j-th and (j + 1)-th clip respectively, and m is the total number of video clips. As shown in Figure 3, each from the conv5 feature tube Bounding box proposals are each associated with an action score. The action score is inherited by the corresponding tube proposal. The
overlap between two tube proposals is based on the last frame of the j-th tube proposal and the (j+1)-th tube proposal It is calculated based on the IoU (intersection over union) of the first frame. The first term of S calculates the average action score of all tube proposals in the video proposal, and the second term calculates the average overlap between tube proposals in each two consecutive video clips. .Therefore, we ensure that linked tube proposals can encapsulate operations while being temporally consistent. Figure 4 shows an example of connecting tube proposals and calculating scores. We select the sequence of multiple linked proposals with the highest scores in the video.
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_45751396/article/details/132787749