CoordConv, Uber, 2018 [1] [2]

An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution论文中，作者研究并分析了卷积神经网络的一种常见缺陷，即它无法将空间表示转换成笛卡尔空间中的坐标和one-hot像素空间中的坐标。这很意外，因为这些任务似乎很简单，并且此类坐标的转换也是解决常见问题的必备方法，例如图像中的物体检测、训练生成模型、训练强化学习智能体等等，所以也很重要。经过研究我们发现，这些任务已经多多少少受到卷积结构的限制。所以为了提升性能，我们提出了一种名为CoordConv的解决方案，在多个领域进行了成果展示。
在这里插入图片描述

Supervised Coordinate Classification task：输入x, y，输出一张图片只有x, y 为白色1，其余位置都为黑色0。我们发现使用Convolution 的方法仅能够达到80% 的准确度，当添加CoordConv 的时候就可以达到趋近于100% 的准确度。
Supervised Rendering task：输入x, y，输出一张图片以x, y为中心，画出一个白色正方形，其余位置都为黑色。

另外：

Object Detection：在侦测MNIST 的资料集中，相较于没有使用CoordConv，加入CoordConv 后让IoU 提升了24%
Generative Model: 这边是假设mode collapse 的问题是因为latent space 无法学习好空间的相关性所导致的，而这时候使用CoordConv 或许会有帮助，感觉是结果论。

方法就是：在input层上添加(i,j)坐标两个channel，会预处理在[-1, 1] 之间。有一些实验还有添加r coordinate（添加第3 个Channel），
在这里插入图片描述

很简单，有趣的一个工作。我个人理解相当于在Conv（第一层）中添加了一个位置编码，这个和NLP的language model任务中position embedding有点类似，但是Coordconv还是简单粗暴了一些，是一种绝对位置编码，没有考虑相对位置编码，因此损失了Conv的平移不变性的特点。我觉得如果采用相对位置编码，通过learning对应的weight，应该还是可以做到平移不变性的。（Uber这篇也被怼了：Uber发布的CoordConv遭深度质疑，“翻译个坐标也需要训练？” ）

Learning rate, batchsize and minima [3]

Stochastic gradient descent is no different, and recent work suggests that the procedure is really a Markov chain that, under certain assumptions, has a stationary distribution that can be seen as a sort of variational approximation to the posterior. So when you stop your SGD and take the final parameters, you’re basically sampling from this approximate distribution. I found this idea to be illuminating, because the optimizer’s parameters (in this case, the learning rate) make so much more sense that way.

As an example, as you increase the learning parameter (learning rate) of SGD the Markov chain becomes unstable until it finds wide local minima that samples a large area; that is, you increase the variance of procedure. On the other hand, if you decrease the learning parameter, the Markov chain slowly approximates narrower minima until it converges in a tight region; that is, you increase the bias for a certain region.

Another parameter, the batch size in SGD, also controls what type of region the algorithm converges two: wider regions for small batches and sharper regions with larger batches.

在这里插入图片描述
SGD prefers wide or sharp minima depending on its learning rate or batch size. Wider minima：large learning rate，small batch size，i.e., large variance. Sharp minima：small learning rate，large batch size，i.e., small variance.

An Empirical Model of Large-Batch Training

参考资料

[1] https://xiaosean.github.io/deep%20learning/computer%20vision/2018-12-23-CoordConv/
[2] Uber提出CoordConv：解决普通CNN坐标变换问题
[3] http://hyperparameter.space/blog/when-not-to-use-deep-learning/
[4]
[5]
[6]
[7]
[8]
[9]
[10]

大饼博士X 博客专家

发布了83 篇原创文章 · 获赞 639 · 访问量 133万+

他的留言板关注

大饼博士的神经网络/机器学习算法收录合集：2020年整理，持续更新ing

文章目录

CoordConv, Uber, 2018 [1] [2]

Learning rate, batchsize and minima [3]

An Empirical Model of Large-Batch Training

参考资料

猜你喜欢