[Video Understanding] 2018-CVPR-Non-local Neural Network

non-local neural network

Paper address
code address

Summary

 Both convolution operations and loop operations are building blocks for processing one local neighborhood at a time. In this paper, we consider non-local operations as a general building block for capturing remote dependencies. Inspired by classic 非局部均值methods in computer vision [4], our non-local operation computes the response at a location as a weighted sum of features at all locations. This building block can be plugged into many computer vision architectures. In the video classification task, even without any fancy features, our non-local model can compete with or surpass the current competitive winners on the Kinetics and Charades datasets. In static image recognition, our non-local model improves object detection/segmentation and pose estimation on the COCO suite of tasks. The code is available at https://github.com/facebookresearch/video-nonlocal-net.

1 Introduction

 Capturing long-range dependencies is crucial in deep neural networks. For sequential data (e.g. speech, language), loop operations [38, 23] are the main solution for long-range dependency modeling. For image data, long-range dependencies are modeled by large receptive fields formed by deep convolution operations [14, 30].

Both convolution and loop operations process local neighborhoods in space or time ; therefore, long-range dependencies can only be captured when these operations are repeatedly applied, progressively propagating the signal through the data. There are several limitations to repeating local operations. First, it is computationally inefficient . Second, it leads to optimization difficulties that need to be solved carefully [23, 21]. Finally, these challenges make multihop dependency modeling difficult , for example, when messages need to be passed back and forth between distant locations.

 In this paper, we present non-local operations as an efficient, simple and versatile component for capturing long-range dependencies of deep neural networks . The non-local operation we propose is a generalization of the classic non-local mean operation [4] in computer vision. Intuitively, non-local operations compute the response at a location as a weighted sum of features at all locations in the input feature map (Figure 1). This set of locations can be spatial, temporal, or spatiotemporal, which means that our operations are applicable to image, sequence, and video problems.

figure 1

Figure 1. A spatiotemporally non-local operation in our network trained on video classification in Kinetics. The response at location xi is calculated as the weighted average of the features for all locations xj (only the highest weighted features are shown here). In this example calculated by our model, notice how it relates the ball in the first frame to the ball in the last two frames. There are more examples in Figure 3.

 There are several advantages to using non-local operations: (a) In contrast to the asymptotic behavior of loop and convolution operations, non-local operations directly capture long-range dependencies by computing the interaction between any two locations, regardless of their location distance ; (b) ) As we show in our experiments, non-local operations are efficient and can achieve optimal results even with only a few layers (e.g., 5 layers) ; © Finally, our non-local operations maintain variable input sizes, And can be easily combined with other operations (for example, convolution, which we will use).

 We demonstrate the effectiveness of non-local operations in video classification applications . In video, long-range interactions occur between pixels that are distant in space and time. A single non-local block as our basic unit can directly capture these spatiotemporal dependencies in a feed-forward manner . With some non-local blocks, architectures we call non-local neural networks are better for video classification than 2D and 3D convolutional networks [48] (including dilation variants [7]) more acurrate. Furthermore, non-local neural networks are computationally more economical than 3D convolutional neural networks. Comprehensive ablation studies are conducted on Kinetics [27] and Charades [44] datasets. Using only RGB and without any additional features (e.g., optical flow, multi-scale testing), our method achieves results on both datasets that are comparable to or better than the latest competitions winners.

 To demonstrate the generality of non-local operations , we further conduct object detection/segmentation and pose estimation experiments on the COCO dataset [33]. On top of the powerful Mask R-CNN baseline [19], our non-local blocks can improve the accuracy on all three tasks at a small additional computational cost. Together with video evidence, these image experiments show that non-local operations are often useful and can become a basic building block for designing deep neural networks.

2. Related work

Non-local image processing . Non-local mean [4] is a classic filtering algorithm that calculates the weighted average of all pixels in the image. It allows distant pixels to contribute to the filtered response at a location based on patch appearance similarity. This non-local filtering idea was later developed into BM3D (Block Matching 3D) [10], which performs filtering on a set of similar but non-local blocks. Even compared to deep neural networks [5], BM3D is a reliable baseline for image denoising. Block matching is used together with neural networks for image denoising [6, 31]. Non-local matching is also the essence of successful texture synthesis [12], super-resolution [16] and inpainting [1] algorithms.

Graph model . Long-range dependencies can be modeled through graphical models such as conditional random fields (CRF) [29, 28]. In the context of deep neural networks, CRF can be used to perform post-processing semantic segmentation predictions on the network [9]. Iterative mean field inference of CRFs can be transformed into recurrent networks and trained [56, 42, 8, 18, 34]. In contrast, our approach is a simpler feedforward block for computing non-local filtering. Unlike these methods developed for segmentation, our general component is used for both classification and detection. These methods and ours are also related to a more abstract model called graph neural networks [41].

Feedforward modeling of sequences . There has been a recent trend towards using feedforward (i.e. acyclic) networks to model speech and language sequences [36, 54, 15]. In these methods, long-term dependencies are captured by large receptive fields contributed by very deep 1D convolutions. These feedforward models are suitable for parallel implementations and are more efficient than the widely used loop models.

Self-attention . Our work is related to recent self-attention [49] methods for machine translation. The self-attention module computes the response at a position in a sequence (e.g., sentence) by paying attention to all positions and taking their weighted average in the embedding space. As we will discuss next, self-attention can be viewed as a form of non-local mean [4], and in this sense, our work connects self-attention for machine translation with the more general non-local mean Filtering categories are linked to operate on image and video problems in computer vision.

interactive network . Interaction Networks (INs) [2, 52] have recently been proposed for modeling physical systems. They operate on graphs of objects involved in pairwise interactions. Hoshen [24] proposed a more effective vertex attention IN (VAIN) in the context of multi-agent predictive modeling. Another variant called Relation Network [40] computes feature embedding functions at all pairs of positions in its input. Our method also handles all pairs, as we will explain in equation (1) ( f ( xi , xj ) f(x_i\ ,\ x_j)f(xi , xj) ). While our non-local network is connected to these methods, our experiments show that the non-locality of the model is orthogonal to the idea of ​​attention/interaction/relationship (e.g., the network can focus on local regions) and is their empirical success. The essential. Non-local modeling is a long-standing key element of image processing (e.g., [12, 4]) that has been largely ignored in recent computer vision neural networks.

Video classification architecture . A natural solution for video classification is to combine the success of CNNs on images and the success of RNNs on sequences [55, 11]. In contrast, feedforward models are implemented through 3D convolutions in space and time (C3D) [26, 48], and 3D filters can be formed by “inflating” [13, 7] pre-trained 2D filters. In addition to end-to-end modeling of raw video inputs, optical flow [45] and trajectories [50, 51] have also been found to be helpful. Flows and trajectories are ready-made modules that can find remote, non-local dependencies. A systematic comparison of video architectures can be found in [7].

3. Non-local Neural Networks

 We first give a general definition of non-local operations, and then we provide several concrete examples of it.

figure 2

Figure 2. Space-time non-local blocks. A feature map is shown as the shape of its tensor, e.g., T×H×W×1024 means 1024 channels (carefully reshaped when doing so). "⊗" means matrix multiplication and "⊕" means element-wise summation. Perform softmax operation on each row. The blue box represents 1×1×1 convolution. Here we show the embedded Gaussian version with a bottleneck of 512 channels. The vanilla Gaussian version can be done by removing θ and Φ, while the dot product version can be done by replacing the softmax with a 1/N scaling.

3.1. Official

 Following the non-local mean operation [4], we define the general non-local operation in deep neural networks as:

official 1

 这りiii is the index of the output location (space, time, or spatio-temporal) whose response is to be calculated,jjj is the index of all possible positions in the enumeration. x \mathbf{x}x is the input signal (image, sequence, video; usually their features),y \mathbf{y}y is the same asx \mathbf{x}output signals with the same size x . pairwise functionfffcalculationii __i and alljjA scalar between j (indicating relationships such as affinity). Unary functionggg computes the representation of the input signal at position j. The response is given by the factorC ( x ) \mathcal{C}\left(\mathbf{x}\right)C( x ) normalization.

 The non-local behavior in equation (1) is due to the fact that all positions ( ∀ j \forall jj ). As a comparison,the convolution operation sums the weighted inputs in a local neighborhood(e.g., in the one-dimensional case with kernel size 3,i − 1 ≤ j ≤ i + 1 i-1\le j\le i+1i1ji+1 ), while timeiiThe loop operation of i is usually based only on the current and most recent time step (e.g.,j = ij=ij=i ori − 1 i-1i1)。

Non-local operations are also different from fully connected (fc) layers . Equation (1) calculates the response based on the relationship between different positions, while fc uses the learned weights. In other words, unlike non-local layers, xj \mathbf{x}_jxj x i \mathbf{x}_i xiThe relationship between is not a function of the input data in fc. Furthermore, our formula in Equation (1) supports variable-sized inputs and maintains corresponding sizes in the output. In contrast, the fc layer requires fixed size input/output and loses positional correspondence (e.g., from position iii处的xi \mathbf{x}_ixiyi \mathbf{y}_iyi)。

 Non-local operations are a flexible building block that can be easily used with convolutional/recurrent layers. It can be added to early parts of deep neural networks , unlike fc layers which are often used at the end. This allows us to build a richer hierarchy that combines non-local and local information.

3.2. Instantiation

 Next we describe several versions of fff andggg . Interestingly, we willshow experimentally (Table 2a) that our non-local model is insensitive to these choices, suggesting that general non-local behavior is the main reason for the observed improvement.

 For simplicity, we only consider the linear embedding form: g ( xj ) = W gxjg\left(\mathbf{x}_j\right)=\ W_g\mathbf{x}_jg(xj)= Wgxj, in which W g W_gWgis the weight matrix to be learned. This is implemented as, for example, a 1×1 convolution in space or a 1×1×1 convolution in space-time.

 Next we discuss the pairwise function fff choice.

Gaussian . After non-local means [4] and bilateral filters [47], ffThe natural choice for f is the Gaussian function. In this article we consider:

official 2

 这里 x i T x j \mathbf{x}_i^T\mathbf{x}_j xiTxjis the dot product similarity. The Euclidean distance used in [4, 47] is also applicable, but the dot product is easier to implement in modern deep learning platforms. The normalization factor is set to C ( x ) = ∑ ∀ jf ( xi , xj ) \mathcal{C}\left(\mathbf{x}\right)=\sum_{\forall j} f\left(\mathbf{ x}_i,\ \mathbf{x}_j\right)C(x)=jf(xi, xj)

Embedded Gaussian . A simple extension of the Gaussian function is to compute the similarity in the embedding space. In this article we consider:

official 3

 这里 θ ( x i ) = W θ x i \theta\left(\mathbf{x}_i\right)=W_\theta\mathbf{x}_i i(xi)=Wixiϕ ( xj ) = W ϕ xj \phi\left(\mathbf{x}_j\right)=W_\phi\mathbf{x}_jϕ(xj)=Wϕxjare two embeddings. As above, we set C ( x ) = ∑ ∀ jf ( xi , xj ) \mathcal{C}\left(\mathbf{x}\right)=\sum_{\forall j} f\left(\mathbf{x} _i,\ \mathbf{x}_j\right)C(x)=jf(xi, xj)

 We note that the recently proposed self-attention module for machine translation [49] is a special case of non-local operations in the embedded Gaussian version. This can be seen from the fact that for a given iii 1 C ( x ) f ( x i ,   x j ) \frac{1}{\mathcal{C}\left(\mathbf{x}\right)}f\left(\mathbf{x}_i,\ \mathbf{x}_j\right) C(x)1f(xi, xj) becomes along dimensionjjCalculation of softmax of j . So we have   y = softmax ( x TW θ TW ϕ x ) g ( x ) \ \mathbf{y}=softmax\left(\mathbf{x}^TW_\theta^TW_\phi\mathbf{x}\right) g\left(\mathbf{x}\right) y=softmax(xTWiTWϕx)g( x ) , which is the form of self-attention in [49]. Our work therefore provides insights by relating this recent self-attention model to classic non-local means [4] computer vision methods and extends the sequential self-attention network in [49] for use incomputerUniversal spatial/spatiotemporal non-local network for image/video recognition in vision.

 Although related to [49], we show that attentional behavior (due to softmax) is not essential in the application we study. To illustrate this point, we next describe two alternative versions of non-local operations.

Dot product . fff can be defined as the dot product similarity:

official 4

 Here we use the embedded version. In this case, we set the normalization factor to C ( x ) = N \mathcal{C}\left(\mathbf{x}\right)=NC(x)=N , whereN NN isx \mathbf{x}number of positions in x , not ffThe sum of f as it simplifies the gradient calculation. Normalization like this is necessary because the input can be of variable size.

The main difference between the dot product and embedded Gaussian versions is the presence of softmax, which acts as an activation function .

Concatenation . Pairwise functions in relational networks [40] use links for visual reasoning. We also evaluate ffLink form of f :

official 5

 Here [ ⋅ , ⋅ ] [·,·][⋅,⋅] means concatenation,wf \mathbf{w}_fwfis a weight vector projecting the concatenated vector to a scalar. As above, we set C ( x ) = N \mathcal{C}\left(\mathbf{x}\right)=NC(x)=N. _ In this case we are inffReLU [35] is adopted in f .

 Several variations above demonstrate the flexibility of our general non-local operations. We believe alternative versions are possible and may improve results.

3.3. Non-local Block

We wrap the non-local operations in equation (1) into a non-local block that can be incorporated into many existing architectures . We define non-local blocks as:

official 6

 其中yi \mathbf{y}_iyiGiven in equation (1), “ + xi +\mathbf{x}_i+xi” denotes residual connection [21]. Residual connection allows us to insert a new non-local block in any pre-trained model without destroying its initial operation (e.g. if W z W_\mathcal{z}Wzis initialized to zero). Figure 2 shows an example of a non-local block. Pairwise calculations in equations (2), (3) or (4) can be done simply by matrix multiplication, as shown in Figure 2; the chained version in (5) is simple.

Pairwise computation of non-local patches is lightweight when used in high-level, downsampled feature maps . For example, typical values ​​in Figure 2 are T = 4 T=4T=4 H = W = 14 H=W=14 H=W=14 or7 77 . Pairwise computations done via matrix multiplication are comparable to typical convolutional layers in standard networks. We further adopt the following implementation to make it more efficient.

 Implementation of non-local blocks. We will W g W_gWg W θ W_\theta WiW ϕ W_\phiWϕThe number of channels represented is set to x \mathbf{x}Half the number of channels in x . This follows the bottleneck design of [21] and reduces the computational effort of the block by approximately half. The weight matrix W z W_\mathcal{z}in equation (6)WzThe default is \mathbf{y}_iyiposition embedding on, relate the number of channels to x \mathbf{x}The number of channels in x matches. See Figure 2.

Downsampling techniques can be used to further reduce the computational effort . We modify equation (1) to: yi = 1 C ( x ^ ) ∑ ∀ jf ( xi , x ^ j ) g ( x ^ j ) \mathbf{y}_i=\frac{1}{\mathcal{ C}\left(\hat{\mathbf{x}}\right)}\sum_{\forall j} f\left(\mathbf{x}_i,\ {\hat{\mathbf{x}}}_j\ right)g\left({\hat{\mathbf{x}}}_j\right)yi=C(x^)1jf(xi, x^j)g(x^j) , wherex ^ \hat{\mathbf{x}}x^ isx \mathbf{x}A downsampled version of x (e.g. via pooling). We do this in the spatial domain, which reduces the pairwise computation by 1/4. This trick doesn't change the non-local operations, it just makes the computation sparser. This can be determined by ϕ \phiin Figure 2ϕ andggAdd a max pooling layer after g to complete.

 We use these effective modifications for all non-local blocks studied in this paper.

4. Video classification model

 To understand the behavior of non-local networks, we conduct comprehensive ablation experiments on the video classification task . First, we describe baseline network architectures for this task, and then extend them to 3D ConvNets [48, 7] and our proposed non-local network.

2D ConvNet Baseline (C2D) . To isolate our non-local network from the temporal effects of 3D ConvNet, we constructed a simple 2D baseline architecture where the temporal dimension is simply handled (i.e., only through pooling) .

 Table 1 shows our C2D baseline under the ResNet-50 backbone. The input video clip has 32 frames, each frame is 224×224 pixels. All convolutions in Table 1 are essentially 2D kernels that process the input frame by frame (implemented as 1 × k × k 1\times k\times k1×k×k kernel).
The model can be initializeddirectly from ResNet weights pretrained on ImageNetThe ResNet-101 counterpart is built in the same way.

The only operation involving the time domain is the pooling layer. In other words, this baseline simply aggregates temporal information .

Table 1

Table 1. Our video baseline ResNet-50 C2D model. The size of the 3D output map and filter kernel is T×H×W (2D kernel is H×W), and the number of channels is as follows. The input is 32×224×224. Remaining blocks are shown in parentheses.

Inflated 3D ConvNet (I3D) . The C2D model in Table 1 can be converted into a 3D convolutional model by “inflating” kernels as done in [13, 7] . For example, a 2D k × kk\times kk×The k kernel can be expanded to spantt3Dt × k × kt\times k\times k of t framet×k×k kernel. This kernel can be initialized from a 2D model (pretrained on ImageNet):t × k × kt\times k\times kt×k×Each ttin k kernelThe t planes are all pre-trainedk × kk\times kk×k weight initialization, press1/t 1/t1/ tRescale . If the video consists of a single static frame repeated in time, this initialization produces the same results as a 2D pretrained model running on the static frames.

 We study two dilation cases: we either dilate the 3×3 kernels in the residual block to 3×3×3 (similar to [7]), or we inflate the first 1×1 kernel in the residual block is 3×1×1 (similar to [13]). We denote them as I3D 3×3×3 and I3D 3×1×1. Since 3D convolutions are computationally intensive, we only inflate one kernel for every 2 residual blocks; inflating more layers shows diminishing returns. We dilate conv1 to 5×7×7.

 The authors of [7] showed that the I3D model is more accurate than the corresponding CNN + LSTM model .

Non-local network . We insert non-local blocks into C2D or I3D, turning them into non-local networks . We study adding 1, 5 or 10 non-local blocks; implementation details are described in the context of the next section.

4.1. Implementation details

training . Our model is pre-trained on ImageNet [39]. Unless otherwise stated, we use 32-frame input clips to fine-tune our models. These segments are formed by randomly cropping 64 consecutive frames from the original full-length video and then discarding every other frame. The spatial size is 224 × 224 pixels, randomly cropped from the scaled video, and its short sides are randomly sampled in [256, 320] pixels, following [46]. We train on an 8-GPU machine, with 8 segments per GPU in a mini-batch (so a total of 64 segments in the mini-batch). We train the model for a total of 400k iterations, starting with a learning rate of 0.01 and reducing it by a factor of 10 every 150k iterations (see also Figure 4). We use a momentum of 0.9 and a weight decay of 0.0001. We employ dropout [22] after the global pooling layer with a dropout ratio of 0.5. We enable BatchNorm (BN) [25] when applying to fine-tune our model. This is in contrast to the common practice of fine-tuning ResNet [21], where the BN is frozen. We found that enabling BN in our application reduced overfitting.

 We adopt the method in [20] to initialize the weight layers introduced in non-local blocks. We are expressing W z W_\mathcal{z}WzAdd a BN layer after the last 1×1×1 layer; we do not add BN to other layers in non-local blocks. Following [17], the scale parameter of this BN layer is initialized to zero. This ensures that the initial state of the entire non-local block is an identity map, so it can be plugged into any pretrained network while maintaining its initial behavior .

reasoning . Following [46], we perform spatial fully convolutional inference on videos with short sides rescaled to 256. For the temporal domain, in our practice, we uniformly sample 10 clips from the full-length video and calculate their softmax scores respectively. The final prediction is the average softmax score of all segments.

5. Video classification experiment

 We conduct a comprehensive study on the challenging Kinetics dataset [27]. We also report results on the Charades dataset [44] to show the generality of our model.

5.1. Kinetics experiments

 Kinetics [27] contains approximately 246k training videos and 20k validation videos. This is a classification task involving 400 human behavior categories. We train all models on the training set and test on the validation set.

Figure 4 shows the training process curve of the ResNet-50 C2D baseline versus non-local C2D with 5 blocks (more details below) . Throughout the training process, our non-local C2D model consistently outperforms the C2D baseline in terms of training and validation errors.

Figures 1 and 3 visualize several examples of non-local block behavior computed by our model . Our networks can learn to find meaningful relationship cues regardless of distance in space and time.

image 3

Figure 3. Example of behavior of non-local blocks in res3, computed from a 5-block non-local model trained on Kinetics. These examples are from retained verification videos. The starting point of the arrow represents an xi and the end point represents an xj. The 20 highest weighted arrows for each xi are visualized. 4 frames from a 32-frame input, displayed in 8-frame steps. These visualizations show how the model finds relevant clues to support its predictions.

Figure 4

Figure 4. Kinetic training process curve of ResNet-50 C2D baseline (blue) and non-local C2D with 5 blocks (red). We show the top-1 training error (dashed line) and validation error (solid line). The validation error is calculated in the same way as the training error (so it is a 1-clip test with the same random jitter as in training); the final results are shown in Table 2c (R50, 5 blocks).

Table 2

Table 2. Ablation for Kinetics action classification. We show top-1 and top-5 classification accuracy (%).

 Table 2 shows the ablation results, analyzed as follows:

Instantiations . Table 2a compares different types of single non-local blocks added to the C2D baseline (just before the last residual block of res4). Even adding a non-local block can lead to about 1% improvement over the baseline .

 Interestingly, the performance of the embedded Gaussian, dot product, and chained versions is similar until some random variation occurs (72.7 to 72.9). As discussed in Section 3.2. Non-local operations of Gaussian kernels become similar to self-attention modules [49]. However, our experiments show that the attention (softmax) behavior of this module is not key to the improvement of our application; instead, the non-local behavior is more likely to be important, and it is not sensitive to instantiation.

 In the remainder of this article, we default to the embedded Gaussian version. This version is easier to visualize because its softmax score is in [0, 1] [0,\ 1][0, 1 ] within the range.

Which stage to add non-local blocks ? Table 2b compares individual non-local blocks added to ResNet at different stages . This block is added before the last residual block of a stage. The improvement of a non-local block on res2, res3 or res4 is similar, and slightly smaller on res5. One possible explanation is that the spatial size of res5 is small (7 × 7), which is not enough to provide precise spatial information . Table 2d examines more evidence that non-local blocks exploit spatial information .

Going deeper with non-local blocks . Table 2c shows the results for more non-local blocks. We add 1 block (to res4), 5 blocks (3 to res4 and 2 to res3, every other residual block) and 10 blocks (to every other residual block in res3 and res4) in ResNet-50. difference block); in ResNet-101, we add them to the corresponding residual block. Table 2c shows that more non-local blocks generally lead to better results . We consider that multiple non-local blocks can perform long-range multi-hop communication. Messages can be passed back and forth between distant locations in space and time, which is difficult to do with local models.

 It is worth noting that the improvement of non-local blocks is not just because they increase the depth of the baseline model. To see this, we note in Table 2c that the non-local 5-block ResNet-50 model achieves an accuracy of 73.8, which is higher than the 73.1 of the deeper ResNet-101 baseline. However, the 5-block ResNet50 only has ~70% of the parameters and ~80% of the FLOPs of the ResNet-101 baseline, and is shallower. This comparison shows that the improvements brought by non-local blocks are complementary to the in-depth study of standard methods.

 We also tried adding standard residual blocks instead of non-local blocks to the baseline model. Accuracy is not improved. This again shows that the improvement of non-local blocks is not just due to their added depth .

Non-local in spacetime . Our method can handle spatiotemporal signals naturally. This is a nice property: related objects in the video can be presented in distant spaces and long time intervals, and their dependencies can be captured by our model.

In Table 2d, we study the impact of non-local blocks applied along space, time or spacetime . For example, in the purely spatial version, non-local dependence occurs only within the same frame: i.e., in equation (1) it only affects index iiIndex jjin the same frame of ijSum . Time-only versions can be set similarly. Table 2d shows thatboth the purely spatial and purely temporal versions improve over the C2D baseline, but not as well as the spatiotemporal version.

Non-local net vs. 3D ConvNet . Table 2e compares our non-local C2D version with dilated 3D ConvNets. Non-local operations and 3D convolution can be seen as two ways to extend C2D to the time dimension .

 Table 2e also compares the parameters and number of FLOPs relative to the baseline. Our non-local C2D model is more accurate than the I3D model (eg, 75.1 vs. 74.4) while having fewer FLOPs (1.2× vs. 1.5×). This comparison shows that our method is more efficient than 3D convolution when used alone .

Non-local 3D ConvNet . Despite the above comparison, non-local operations and 3D convolutions can model different aspects of the problem: 3D convolutions can capture local dependencies. Table 2f shows the results of inserting 5 non-local blocks into the I3D 3×1×1 model. These non-local I3D (NL I3D) models improve upon their I3D counterparts (+1.6 points accuracy), demonstrating that non-local operations and 3D convolution are complementary .

Longer sequences . Finally, we study the generalizability of our model on longer input videos . We use input segments consisting of 128 consecutive frames without downsampling. Therefore, the sequence of all layers in the network is 4 times longer than the sequence of 32 frames. To fit this model into memory, we reduce the mini-batch size to 2 clips per GPU. Due to the use of mini-batches, we freeze all BN layers in this case. We initialize this model from the corresponding model trained with 32 frames of input. We fine-tune the 128-frame input using the same number of iterations as the 32-frame case (albeit now with a smaller mini-batch size), starting with a learning rate of 0.0025. Other implementation details are the same as before.

 Table 2g shows the results for 128-frame clipping. All models have better results on longer inputs compared to their 32-frame counterparts in Table 2f . Our NL I3D can maintain its gain relative to I3D, indicating that our model works well on longer sequences.

Comparisons with state-of-the-art results . Table 3 shows the results by I3D author [7] and Kinetics 2017 competition winner [3]. We note that these are comparisons of systems that may differ in many respects. Nonetheless, our method significantly outperforms all existing RGB or RGB + optical flow based methods. Without using optical flow and without any additional features, our method is on par with the well-designed results of the 2017 competition winner .

table 3

Table 3. Comparison with state-of-the-art results in dynamics, reported on validation and test sets. We included results from the Kinetics 2017 competition winners [3], but their best results exploited audio signals (marked in gray) and were therefore not purely visual solutions. † : “avg” is the average of top-1 and top-5 accuracies; at the time of submission of this manuscript, the test server was unable to provide individual top-1 or top-5 numbers.

5.2. Charades Experiment

 Charades [44] is a video dataset with ~8k training, ~1.8k validation, and ~2k test videos. It is a multi-label classification task with 157 action categories. We use the sigmoid output of each category to handle multi-label attributes.

 We initialize the model pretrained on Kinetics (128frame). Mini-batch size is set to 1 fragment per GPU. We train our model for 200k iterations, starting with a learning rate of 0.00125 and decreasing it by 10 every 75k iterations. We use a dithering strategy similar to that in Kinetics to determine the position of a 224×224 crop window, but we rescale the video so that this crop window outputs 288×288 pixels, and we fine-tune our network based on this. We test on a single scale of 320 pixels.

 Table 4 shows a comparison with previous results on Charades. The result of [7] was the winner of the 2017 Charades competition, which was also fine-tuned from a Kinetics pre-trained model. Our I3D baseline is higher than previous results . As a control comparison, our non-local network improves by 2.3% over our I3D baseline on the test set.

Table 4

Table 4. Classification mAP (%) for train/validation split and train/test split in the Charades dataset [44]. Our results are based on ResNet-101. Our NL I3D uses 5 non-local blocks.

6. Extension: COCO Experiment

 We also studied static image recognition models. We conduct COCO [33] object detection/segmentation and human pose estimation (keypoint detection) experiments on the Mask R-CNN baseline [19]. The models were trained on COCO train2017 (i.e. trainval35k in 2014) and tested on val2017 (i.e. minival in 2014).

Object detection and instance segmentation . We modify the Mask R-CNN backbone by adding a non-local block (just before the last residual block of res4). All models are pre-trained and fine-tuned on ImageNet. We evaluate on the standard baseline of ResNet-50/101 and the high baseline of ResNeXt-152 (X152) [53]. Unlike the original paper [19] which adopted stage-wise training on RPN, we use an improved implementation of end-to-end joint training similar to [37], which results in a higher baseline than [19].

 Table 5 shows the box and mask APs on COCO. We see that a single non-local block improves all R50/101 and X152 baselines on all metrics involving detection and segmentation. APbox increased by ~1 point in all cases (e.g., +1.3 points in R101). Our non-local blocks are complementary to increase model capacity, even when upgrading models from R50/101 to X152. This comparison shows that despite the increased depth/capacity, existing models do not adequately capture non-local dependencies .

 Furthermore, the above benefits are obtained at a very small cost. A single non-local block only adds <5% computational effort to the baseline model. We also tried using more non-local blocks to the backbone but found diminishing returns.

table 5

Table 5. Adding 1 non-local block to Mask R-CNN for COCO object detection and instance segmentation. The backbone is ResNet-50/101 or ResNeXt-152 [53], both with FPN [32].

Key point detection . Next we evaluate non-local patches in Mask R-CNN for keypoint detection. In [19], Mask R-CNN uses a stack of 8 convolutional layers to predict keypoints as 1-hot masks. These layers are local operations and may ignore dependencies between distant keypoints. Inspired by this, we insert 4 non-local blocks into the keypoint heads (after every 2 convolutional layers).

 Table 6 shows the results of COCO. On the strong baseline of R101, adding 4 non-local blocks to keypoints results in an increase in keypoint AP of ∼1 point. If we add an additional non-local patch to the backbone as in object detection, we observe that the keypoint AP increases by 1.4 points compared to the baseline. In particular, we see an improvement of 2.4 points on the tighter criterion of AP75, indicating stronger localization performance .

Table 6

Table 6. Adding non-local blocks to Mask R-CNN for COCO keypoint detection. The backbone is ResNet-101 with FPN [32].

7. Conclusion

We propose a new class of neural networks that capture long-range dependencies through non-local operations. Our non-local blocks can be combined with any existing architecture. We demonstrate the importance of non-local modeling for video classification, object detection and segmentation, and pose estimation tasks. Across all tasks, simply adding non-local blocks provides better improvements than the baseline. We hope that non-local layers will become an important part of future network architectures

references

[1] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. In Proceedings of SIGGRAPH, ACM Transactions on Graphics, 2009. 2
[2] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al Interaction networks for learning about objects, relations and physics. In Neural Information Processing Systems (NIPS), 2016. 2
[3] Y. Bian, C. Gan, X. Liu, F. Li, X. Long, Y. Li, H. Qi, J. Zhou, S. Wen, and Y. Lin. Revisiting the effectiveness of off-theshelf temporal modeling approaches for large-scale video classification. arXiv:1708.03805, 2017. 7
[4] A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In Computer Vision and Pattern Recognition (CVPR), 2005. 1, 2, 3
[5] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with BM3D? In Computer Vision and Pattern Recognition (CVPR), 2012. 2
[6] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising with multi-layer perceptrons, part 2: training trade-offs and analysis of their mechanisms. arXiv:1211.1552, 2012. 2
[7] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Computer Vision and Pattern Recognition (CVPR), 2017. 1, 2, 4, 6, 7, 8
[8] S. Chandra, N. Usunier, and I. Kokkinos. Dense and low-rank Gaussian CRFs using deep embeddings. In International Conference on Computer Vision (ICCV), 2017. 2
[9] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv:1412.7062, 2014. 2
[10] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. Transactions on Image Processing (TIP), 2007. 2
[11] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Computer Vision and Pattern Recognition (CVPR), 2015. 2
[12] A. A. Efros and T. K. Leung. Texture synthesis by nonparametric sampling. In International Conference on Computer Vision (ICCV), 1999. 2
[13] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In Neural Information Processing Systems (NIPS), 2016. 2, 4
[14] K. Fukushima and S. Miyake. Neocognitron: A selforganizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets. Springer, 1982. 1
[15] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to sequence learning. In International Conference on Machine Learning (ICML), 2017. 2
[16] D. Glasner, S. Bagon, and M. Irani. Super-resolution from a single image. In Computer Vision and Pattern Recognition (CVPR), 2009. 2
[17] P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, ´ A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017. 5
[18] A. Harley, K. Derpanis, and I. Kokkinos. Segmentationaware convolutional networks using local attention masks. In International Conference on Computer Vision (ICCV), 2017. 2
[19] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-CNN. ´ In International Conference on Computer Vision (ICCV), 2017. 2, 8
[20] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In International Conference on Computer Vision (ICCV), 2015. 5
[21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016. 1, 4, 5
[22] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012. 5
[23] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997. 1
[24] Y. Hoshen. Multi-agent predictive modeling with attentional commnets. In Neural Information Processing Systems (NIPS), 2017. 2
[25] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015. 5
[26] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. In International Conference on Machine Learning (ICML), 2010. 2 [27] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al The kinetics human action video dataset. arXiv:1705.06950, 2017. 1, 5
[28] P. Krahenb ¨ uhl and V. Koltun. Efficient inference in fully ¨ connected crfs with gaussian edge potentials. In Neural Information Processing Systems (NIPS), 2011. 2
[29] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning (ICML), 2001. 2
[30] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989. 1
[31] S. Lefkimmiatis. Non-local color image denoising with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017. 2
[32] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and ´ S. Belongie. Feature pyramid networks for object detection. In Computer Vision and Pattern Recognition (CVPR), 2017. 8
[33] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Com- ´ mon objects in context. In European Conference on Computer Vision (ECCV). 2014. 2, 8
[34] S. Liu, S. De Mello, J. Gu, G. Zhong, M.-H. Yang, and J. Kautz. Learning affinity via spatial propagation networks. In Neural Information Processing Systems (NIPS), 2017. 2
[35] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning (ICML), 2010. 3
[36] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016. 2
[37] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017. 8
[38] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 1986. 1
[39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015. 5
[40] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Neural Information Processing Systems (NIPS), 2017. 2, 3
[41] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 2009. 2
[42] A. G. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv preprint arXiv:1503.02351, 2015. 2
[43] G. A. Sigurdsson, S. Divvala, A. Farhadi, and A. Gupta. Asynchronous temporal fields for action recognition. In Computer Vision and Pattern Recognition (CVPR), 2017. 8
[44] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision (ECCV), 2016. 1, 5, 8
[45] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Neural Information Processing Systems (NIPS), 2014. 2
[46] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015. 5
[47] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In International Conference on Computer Vision (ICCV), 1998. 3
[48] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In International Conference on Computer Vision (ICCV), 2015. 1, 2, 4
[49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Neural Information Processing Systems (NIPS), 2017. 2, 3, 6
[50] H. Wang and C. Schmid. Action recognition with improved trajectories. In International Conference on Computer Vision (ICCV), 2013. 2
[51] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In Computer Vision and Pattern Recognition (CVPR), 2015. 2
[52] N. Watters, A. Tacchetti, T. Weber, R. Pascanu, P. Battaglia, and D. Zoran. Visual interaction networks. In Neural Information Processing Systems (NIPS), 2017. 2
[53] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggre- ´ gated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017. 8
[54] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. The Microsoft 2016 Conversational Speech Recognition System. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017. 2
[55] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Computer Vision and Pattern Recognition (CVPR), 2015. 2
[56] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random fields as recurrent neural networks. In International Conference on Computer Vision (ICCV), 2015. 2

Guess you like

Origin blog.csdn.net/weixin_42475026/article/details/129765159