[Papers understanding] Attentional Pooling for Action Recognition

Attentional Pooling for Action Recognition

Brief introduction

This is a NIPS article, the article highlights is a pool of a matrix, said second-order matrix represents the pool, the weight matrix and low-rank decomposition, so that the result of decomposition can be bottom-up and top-down interpretation, and clever use of attention mechanism to explain, I feel learned a lot, especially tensor decomposition knowledge matrix theory.

Basic concepts

Low rank decomposition

Objective: To remove redundant and reduce the weight parameters of the model

Method: using a convolution kernel K * 1 instead of two out of a convolution kernel K * K

Principle: weight vectors are mainly distributed in some low-rank sub-space, use a small amount of the base can be restored weight matrix

Mathematical Formula (herein):
\ [W \ {F in R & lt ^ \ it \\} Times F W matrix can be decomposed as \\ W = ab ^ T \\ wherein a, b \ in R ^ { f \ times 1} \\ \]
This two matrix to 1 * f f * f represents the original matrix, the matrix is made herein rank-1 decomposition, the decomposed matrix is a rank 1 and b, of course, also we need to do more set of experiments to determine how much decomposition rank the most appropriate.

Ordinary pooling

Common pooling can be represented by the following formula (n = 16 * 16 = 256 , the product is characterized in width and height, f is the feature channels):
\ [score_ the pool} {(X-). 1 ^ = X-TXW \\ here \ in R ^ {n \ times f }, 1 \ in R ^ {n \ times 1}, w \ in R ^ {f \ times 1} \\ write is open \\ 1 ^ {T} X = \ begin {bmatrix} 1 \\ 1 \\ \ cdots \\ 1 \ end {bmatrix} _ {n \ times 1} ^ {T} \ times \ begin {bmatrix} x_ {1,1} & \ cdots & x_ {1 .f} \\ x_ {2,1} & \ cdots & x_ {2, f} \\\ vdots & \ vdots & \ vdots \\ x_ {n, 1} & \ cdots & x_ {n, f} \ \\ end {bmatrix} _ {n \ times f} \]

It will be appreciated as the first to be characterized globally summed spatial dimension, f results obtained, and then use these results to the weight matrix to obtain a final weighted summation result of pooling, the pooling result is a scalar.

Avgpooling the general formula can be specialized, i.e. n * X is a tensor, the same operation is performed for each channel, and w 1 are constant matrices. maxpooling 1 matrix is ​​not all 1, the maximum value corresponding to that position is 1.

Second-order pool

This paper proposes a method pool second order, as follows:

Here I copy directly from the paper came out, not their own hands to fight.

Here comes the article on the results of the second-order pool has fine-grained classification help, then do the W low rank decomposition, the formula becomes:

Wherein the characteristics of the matrix to use Tr is not explained.

So what good is it so decomposed? I think this is the essence of a paper to be top-down and bottom-up to explain the formula, the formula for this article interpretability lot of extra points.

Bottom-up explanation

We see equation (6) in the first count is Xb, the result obtained here is a matrix of n * 1, this matrix can be seen as just a attention map, then the author explains to him is from the bottom to the top feature mapping feature generated during the attention map, used to assess the position of features. And, b here is the same for each class, can be interpreted from the bottom up, and a is a learning for each category for a specific, it is explained that a top-down.

Top-down interpretation

As mentioned above, top-down interpretation is a major explanation from the above formula is already possible to explain the bottom-up, but the author has done a step further simplification:

Here we see the formula and turned it into (8), this is actually more intuitive, Xa get is top-down related to the category of results, and Xb obtained is independent from the bottom up with the class results, both do matrix multiplication, to get the final result. This decomposition method I feel wonderful, very good and explanatory.

Development - tensor decomposition

See blog: http://www.xiongfuli.com/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/2016-06/tensor-decomposition-cp.html

CP decomposition

You can make low-rank approximation of the tensor, so here you can do a lot of work.

Tucker decomposition

It is the same. The above essay blog is more telling. Also you can do low-rank approximation.

So this is a very magical direction of direction, I think it can do a lot of things behind. We need strong mathematical skills.

Network architecture

With a picture over it. Because the standard is to do human pose, it is for the network structure of the pose. Structure is very simple, using two methods, we look at the method 2 now.

Directly mapped to the layer 17 wherein the channel former is sixteen pose map, for predicting critical points, the last one is the attention map, where the attention map pose map is a by-product, that is to say to use pose map help category, the attention map and re-Xa effect, the end result did classify such an idea.

Behind the experiments do not read, I do not do this. .

in conclusion

Richer pool of second-order description of local features.

Low rank decomposition can be used for attention.

Coding

To realize his article about the method 2 bar.

'''
@Descripttion: This is Aoru Xue's demo, which is only for reference.
@version: 
@Author: Aoru Xue
@Date: 2019-10-27 13:11:23
@LastEditors: Aoru Xue
@LastEditTime: 2019-10-27 13:18:40
'''
import torch
import torch.nn as nn
from torchsummary import summary
from torch.autograd import Variable
class AttentionalPolling(nn.Module):
    def __init__(self):
        super(AttentionalPolling, self).__init__()
        self.conv = nn.Conv2d(128,16,kernel_size = 1)
        self.a = Variable(torch.randn(1,10,128,1))
        self.b = Variable(torch.randn(1,128,1))
    def forward(self,x): 
        feat = self.conv(x)
        # (64*64,128) @ (128,1) -> (64*64,1)
        #print(x.permute(0,2,3,1).view(-1,64*64,128).size())
        xb = x.permute(0,2,3,1).contiguous().view(-1,64*64,128) @ self.b 
        #print(xb.size())
        xa = x.permute(0,2,3,1).contiguous().view(-1,1,64*64,128) @ self.a
        xa = xa.permute(0,1,3,2).contiguous().view(-1,10,1,4096)
        xb = xb.view(-1,1,4096,1)
        
        output = xa @ xb
        print(output.size())
        return output.view(-1,10)

if __name__ == "__main__":
    net = AttentionalPolling()
    summary(net,(128,64,64),device = "cpu") # feature X

    
    '''
    torch.Size([2, 10, 1, 1])
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 16, 64, 64]           2,064
================================================================
Total params: 2,064
Trainable params: 2,064
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 2.00
Forward/backward pass size (MB): 0.50
Params size (MB): 0.01
Estimated Total Size (MB): 2.51
----------------------------------------------------------------
    '''

Original paper: https: //arxiv.org/pdf/1711.01467v2.pdf

Guess you like

Origin www.cnblogs.com/aoru45/p/11747275.html