CNN translation invariance scale invariance absolute position

CNN whether there is translation and scale invariance and equality
ref

https://zhuanlan.zhihu.com/p/113443895

论文:How much position information do convolutional neural networks encode?

FIG images in a total of three sets, each set of shear and FIG FIGS original configuration. Can be found in the region will significantly change the picture because the center position is changed. To activate the strong center

Here Insert Picture Description

This is a ICLR2020 received article. Previously, the interest positional information only in NLP task, because of the different character positions, different semantic text.

However, CV has been no such demand, it was agreed that CNN has a translation invariance. In the three objects of perception tasks CV, the classification does not require location information, semantic segmentation is concerned that semantic classification pixel level, do not feel the need to position information (position information actually useful); you might think object detection will be used location information,

However, the model object detection is done by detecting Anchor classification of features, the final coordinate information obtained from Anchor. The method of decoupling the main i.e. the absolute position relative to the anchor into the anchor block or partial regression of the relative position. Thus, the network itself does not need to know the absolute position of the object, the position information is used as the prior artificial before and after the coordinate conversion process.

We consider only the input output invariance and equality of circumstances, it may be difficult to understand, because we have more to imagine the level mapping feature

How to get translation invariant

Translation invariance surrender sampling influence

The first is to address a ICML2019 paper "Making Convolutional Networks Shift-Invariant Again" CNN translation invariance confrontational attacks. This article discusses the effects of down-sampling CNN network for translational invariance

ways to improve

The authors used a vague way, put forward three different blur kernel:

Rectangle-2: [1, 1 ], similar to the average cell and nearest neighbor interpolation;
Triangle-2: [. 1, 2,. 1], similar to the bilinear interpolation;
the Binomial. 5-: [. 1,. 4,. 6, 4, 1], that is used in the Laplacian pyramid.

Impact of translation, scaling and slight differences in image classification confidence of network prediction

The second is the same year published a paper in JMLR of "Why do deep convolutional networks generalize so poorly to small image transformations?". The author first gives the example of several groups, respectively, the effect of translation, scaling and slight differences in image classification confidence of network prediction:

How to get location information

Monocular depth estimation, CNN may be estimated by the depth of the object image in the ordinate.
https://zhuanlan.zhihu.com/p/95758284

Zero-padding position information is revealed.

In fact, CoordConv article has a similar result, ordinary conv and CoordConv location mapping processing simple tasks, is the difference between 80 points and 100 points, rather than the difference between 0 and 100 points. And then holes Tao and Chen Chunhua teacher discussion guess is zero-padding revealed information about the location, but no further experimental verification. This conjecture is very natural, because in the process of training and testing the network, all external inputs are only two: the input picture and padding. Picture no input location information, it should be the padding affected.

CNN hidden encoded location information, and with the increase in the number of layers of the network and increase the convolution kernel, that is receptive field increases, better able to encode location information. Wherein the position information caused by zero-padding, the edge of the image to provide zero-padding information of the image boundary. Originally, the network does not know the position of each pixel or feature points. However, by zero padding, there is provided a relative position to the model information, each feature point known zero distance from the boundary information.

Sufficiently large network (or multilayer large kernel) disclose padding can diffuse out of the boundary information, obtain a coarse global position information.

Although the current model implicitly CNN learned a certain degree of location information, but it is clearly not sufficient. What better use of absolute position information, very worthy of further digging, CoordConv [5] and semi-conv [6] is a good exploration.

Examples of the most direct approach of course, is the coordinates of each pixel on the input or intermediate concat feature, this simple approach may be directly SOLO [3] segmentation results on the lift to bring 3.6 AP. But I think there may be more ways to more fully tap the picture location information.
[3] Wang, X., Kong, T., Shen, C., Jiang, Y., & Li, L. (2019). SOLO: Segmenting Objects by Locations.

CVPR2020 article "On Translation Invariance in CNNs: Convolutional Layers can Exploit Absolute Spatial Location", which also referred to the problem of translation invariance of CNN and the absolute position information coding problems, the entry point is the border issue in CNN.

Starting from the convolution operation

Published 452 original articles · won praise 271 · views 730 000 +

Guess you like

Origin blog.csdn.net/qq_35608277/article/details/105241864