[Share] Huawei cloud multimodal fusion algorithm --Multimodal Compact Bilinear Pooling

Abstract lot of multi-modal task, requires the integration of the two modes feature. Characterized in fusion of the two modes is entered feature vector, the output vector after the fusion. The most common method is stitching (concatenation), bitwise multiplication (element-wise product), bitwise addition (element-wise sum). MCB authors believe that these simple operations as effective as the outer product (outer product), is not sufficient to model complex relationships between the two modes. But there is the outer product of high complexity problem.

Multimodal Compact Bilinear Pooling (MCB)来自EMNLP 2016的论文《Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding》。

Many multi-modal task, such as the VQA, visual positioning, requires the fusion of two modalities characteristics. Characterized in fusion of the two modes is entered feature vector, the output vector after the fusion. The most common method is stitching (concatenation), bitwise multiplication (element-wise product), bitwise addition (element-wise sum). MCB authors believe that these simple operations as effective as the outer product (outer product), is not sufficient to model complex relationships between the two modes. But there is the outer product of high complexity problem. n-dimensional vector, the outer product vector n ^ 2 is obtained. MCB is then presented, the MCB is mapped to a result of the outer product of the low-dimensional space, and does not require an outer product calculated explicitly.

MCB

Bilinear (Bilinear) that calculates an outer product of vectors. Bilinear pooled (Bilinear Pooling) is characteristic for the bilinear fusion pooled. In [2], the bilinear pooling first feature vector of each feature map of the location of the convolution outer vector product calculation is performed, then the sum pooling the results of the outer product to obtain a feature vector for all positions x. After signed square root x L2 normalization and get the final feature.

However, the dimensions are extremely high linear character, Compact Bilinear Pooling (CBP) [3] is a pool of bilinear approximation of reduced dimensions.

Bilinear pooling operations may be expressed as:

1583315685666799.png

The case of linear core are:

1583315685904924.png

because:

1583315685821705.png

Nuclear polynomial approximation for low-dimensional mapping function [Phi] , can be used to make compressed bilinear pooled. Tensor Sketching [4] is a polynomial kernel approximation algorithm, can be compressed Tensor Sketching. Tensor Sketching approximated using pooled bilinear algorithm is as follows:

1583371195910834.png

Count Sketch function which has good properties:

1583372146933604.png1583372146899986.png

1583372146349632.png1583372147708476.png

MCB for improved CBP, characterized in that it is suitable for the integration of different modalities. MCB calculated as shown in Figure2.

1583310843370247.png

First, feature vectors of the two modes, respectively, by the mapping function obtained Count Sketch features Count Sketch. Then after FFT and inverse FFT to get Feature Fusion.

The specific calculation step Algorithm1 as shown.

1583310853663840.png

VQA

MCB is applied to the block diagram VQA follows:

1583311878300259.png

MCB used here two modules, the first MCB fused image feature and the text feature attention weight is calculated for each spatial location image. The second MCB fused image features and the text feature to get an answer.

The following are experimental results using different ways of fusion on VQA dataset.

1583311879491397.png

Visual Grounding

Visual Grounding block diagram is as follows:

1583311878850097.png

MCB text of the features and problems of image fusion proposal.

The following are the results of Flickr30k entity data set.

1583311878854062.png

Reference material

[1] Fukui A, Park D H, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding[J]. arXiv preprint arXiv:1606.01847, 2016.

[2] Lin T Y, RoyChowdhury A, Maji S. Bilinear cnn models for fine-grained visual recognition[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1449-1457.

[3] Gao Y, Beijbom O, Zhang N, et al. Compact bilinear pooling[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 317-326.

[4] Pham N, Pagh R. Fast and scalable polynomial kernels via explicit feature maps[C]//Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013: 239-247.

[5] https://zhuanlan.zhihu.com/p/62532887

Author: sugar Ning Meng

Guess you like

Origin www.cnblogs.com/huaweicloud/p/12523622.html