[论文翻译] Network In Network

论文(2014年): 链接

Abstract

We propose a novel deep network structure called “Network In Network”(NIN) to enhance model discriminability for local patches within the receptive field. The conventional convolutional layer uses linear filters followed by a nonlinear activation function to scan the input. Instead, we build micro neural networks with more complex structures to abstract the data within the receptive field. We instantiate the micro neural network with a multilayer perceptron, which is a potent function approximator. The feature maps are obtained by sliding the micro networks over the input in a similar manner as CNN; they are then fed into the next layer. Deep NIN can be implemented by stacking mutiple of the above described structure. With enhanced local modeling via the micro network, we are able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional fully connected layers. We demonstrated the state-of-the-art classification performances with NIN on CIFAR-10 and CIFAR-100, and reasonable performances on SVHN and MNIST datasets

我们提出一种新的深层网络结构,称为“Network In Network”(NIN),它可以增强了网络局部模块的抽象表达能力。传统的卷积层使用线性滤波器和一个非线性激活函数来扫描输入。相反,我们构建具有更复杂结构的微神经网络来抽象感受野内的数据。我们用多层感知器MLP来实例化这个微神经网络,它是一个有效的函数逼近器。特征图是也是通过在输入端经过微网络得到的,其方式与CNN相似;然后它们被送入下一层。通过上述结构的多重叠加,可以实现深层的NIN。通过微网络来增强的局部建模的同时,我们在分类层(指最后一层)中对特征图进行全局平均池化,这比传统的全连接层更容易解释,也更不易过拟合。我们展示了NIN在CIFAR-10和CIFAR-100上的最新分类性能,以及在SVHN和MNIST数据集上也展示了理想的性能

1 Introduction

Convolutional neural networks (CNNs) [1] consist of alternating convolutional layers and pooling layers. Convolution layers take inner product of the linear filter and the underlying receptive field followed by a nonlinear activation function at every local portion of the input.The resulting outputs are called feature maps.

卷积神经网络(CNNs)[1]由交替的卷积层和池化层组成。卷积层中采用线性滤波器和底层感受野区域进行内积,然后在每个局部再处取一个非线性激活函数。产生的输出称为特征图。


The convolution filter in CNN is a generalized linear model (GLM) for the underlying data patch, and we argue that the level of abstraction is low with GLM. By abstraction we mean that the feature is invariant to the variants of the same concept [2]. Replacing the GLM with a more potent nonlinear function approximator can enhance the abstraction ability of the local model. GLM can achieve a good extent of abstraction when the samples of the latent concepts are linearly separable, i.e. the variants of the concepts all live on one side of the separation plane defined by the GLM. Thus conventional CNN implicitly makes the assumption that the latent concepts are linearly separable. However, the data for the same concept often live on a nonlinear manifold, therefore the representations that capture these concepts are generally highly nonlinear function of the input. In NIN, the GLM is replaced with a ”micro network” structure which is a general nonlinear function approximator. In this work, we choose multilayer perceptron [3] as the instantiation of the micro network, which is a universal function approximator and a neural network trainable by back-propagation.

在CNN中的卷积滤波器是一个的广义线性模型(GLM),它是用于处理底层的数据块, 我们认为GLM的抽象水平较低。通过抽象我们我们知道不管相同的地方如何变化这个特征始终保持不变[2]。用更有效的非线性函数逼近器代替GLM可以增强局部模型的抽象能力。当潜在概念的样本是线性可分的,即概念的变体都位于GLM定义的分离平面的一侧时,GLM可以达到很好的抽象程度。因此,传统的CNN隐含地假设潜在概念是线性可分的。然而,相同概念的数据通常存在于非线性流形上,因此捕捉这些概念的表示通常是输入的高度非线性函数。在NIN中,GLM被一个“微网络”结构所代替,该结构是一个通用的非线性函数逼近器。在本工作中,我们选择多层感知器[3]作为微网络的实例化,它是一个通用的函数逼近器,是一个可以通过反向传播来训练的神经网络。


The resulting structure which we call an mlpconv layer is compared with CNN in Figure 1. Both the linear convolutional layer and the mlpconv layer map the local receptive field to an output feature vector. The mlpconv maps the input local patch to the output feature vector with a multilayer perceptron (MLP) consisting of multiple fully connected layers with nonlinear activation functions. The MLP is shared among all local receptive fields. The feature maps are obtained by sliding the MLP over the input in a similar manner as CNN and are then fed into the next layer. The overall structure of the NIN is the stacking of multiple mlpconv layers. It is called “Network In Network” (NIN) as we have micro networks (MLP), which are composing elements of the overall deep network, within mlpconv layers,

我们将得到的结构称为mlpconv层,在图1中将其与CNN进行比较。线性卷积层和mlpconv层都将局部感受野映射到输出特征向量。mlpconv使用多层感知器(MLP)将输入局部快映射到输出特征向量,多层感知器由多个具有非线性激活函数的全连接层组成。MLP在所有本地感受野之间共享。通过将MLP以类似于CNN的方式在输入上滑动获得特征图,然后将其输入到下一层。NIN的整体结构是多个mlpconv层的叠加。它被称为“网络中的网络”(NIN),因为我们有微网络(MLP),它构成了整个深层网络的元素,在mlpconv层中,


Instead of adopting the traditional fully connected layers for classification in CNN, we directly output the spatial average of the feature maps from the last mlpconv layer as the confidence of categories via a global average pooling layer, and then the resulting vector is fed into the softmax layer. In traditional CNN, it is difficult to interpret how the category level information from the objective cost layer is passed back to the previous convolution layer due to the fully connected layers which act as a black box in between. In contrast, global average pooling is more meaningful and interpretable as it enforces correspondance between feature maps and categories, which is made possible by a stronger local modeling using the micro network. Furthermore, the fully connected layers are prone to overfitting and heavily depend on dropout regularization [4] [5], while global average pooling is itself a structural regularizer, which natively prevents overfitting for the overall structure.

在CNN中,我们不采用传统的全连通层进行分类,而是直接输出来自最后一层mlpconv的feature maps的空间平均作为类别的置信度,通过全局平均池化层,然后将得到的向量送入softmax层。在传统的CNN中,由于完全连接的层之间充当了一个黑盒子,很难解释来自目标成本层的类别级信息是如何传递回之前的卷积层的。相比之下,全局平均池更有意义和可解释性,因为它加强了feature map和类别之间的对应,这是通过使用微网络进行更强的局部建模来实现的。此外,全连通层容易发生过拟合,严重依赖dropout正则化[4][5],而全局平均池本身就是一个结构正则化器,从本质上防止了整体结构的过拟合。

2 Convolutional Neural Networks

Classic convolutional neuron networks [1] consist of alternatively stacked convolutional layers and spatial pooling layers. The convolutional layers generate feature maps by linear convolutional filters followed by nonlinear activation functions (rectifier, sigmoid, tanh, etc.). Using the linear rectifier as an example, the feature map can be calculated as follows:

经典的卷积神经网络[1]由交替堆叠的卷积层和空间池化层组成。卷积层通过线性卷积滤波器和非线性激活函数(ReLU、sigmoid、tanh等)生成地形图。以ReLU为例,其特征图计算如下:

This linear convolution is sufficient for abstraction when the instances of the latent concepts are linearly separable. However, representations that achieve good abstraction are generally highly nonlinear functions of the input data. In conventional CNN, this might be compensated by utilizing an over-complete set of filters [6] to cover all variations of the latent concepts. Namely, individual linear filters can be learned to detect different variations of a same concept. However, having too many filters for a single concept imposes extra burden on the next layer, which needs to consider all combinations of variations from the previous layer [7]. As in CNN, filters from higher layers map to larger regions in the original input. It generates a higher level concept by combining the lower level concepts from the layer below. Therefore, we argue that it would be beneficial to do a better abstraction on each local patch, before combining them into higher level concepts.

当潜在概念的实例是线性可分的时,这个线性卷积对于抽象是足够的。然而,实现良好抽象的表示通常是输入数据的高度非线性函数。在传统的CNN中,这可以通过使用一组过完备的过滤器[6]来覆盖所有潜在概念的变化来进行补偿。也就是说,可以学习单个线性滤波器来检测同一概念的不同变化。然而,对于一个概念有太多的过滤器会给下一层带来额外的负担,它需要考虑上一层的所有变化组合[7]。和CNN一样,过滤器从较高的层映射到原始输入的较大区域。它通过结合来自下一层的较低层次的概念来生成更高层次的概念。因此,我们认为在将每个局部块组合成更高级别的概念之前,对它们进行更好的抽象是有益的。

In the recent maxout network [8], the number of feature maps is reduced by maximum pooling over affine feature maps (affine feature maps are the direct results from linear convolution without applying the activation function). Maximization over linear functions makes a piecewise linear approximator which is capable of approximating any convex functions. Compared to conventional convolutional layers which perform linear separation, the maxout network is more potent as it can separate concepts that lie within convex sets. This improvement endows the maxout network with the best performances on several benchmark datasets.

在最近的maxout网络[8]中,通过对仿射特征图的最大池化来减少特征图的数量(仿射特征图是线性卷积的直接结果,不需要使用激活函数)。线性函数的最大化使得分段线性逼近器能够逼近任何凸函数。与传统的卷积层进行线性分离相比,maxout网络更有效,因为它可以分离凸集中的概念。这一改进使maxout网络在几个基准数据集上具有最佳性能。

However, maxout network imposes the prior that instances of a latent concept lie within a convex set in the input space, which does not necessarily hold. It would be necessary to employ a more general function approximator when the distributions of the latent concepts are more complex. We seek to achieve this by introducing the novel “Network In Network” structure, in which a micro network is introduced within each convolutional layer to compute more abstract features for local patches.

然而,maxout网络强加了一个先验,即潜在概念的实例位于输入空间的凸集内,这并不一定成立。当潜在概念的分布比较复杂时,有必要使用更一般的函数逼近器。我们试图通过引入新的“Network In Network”结构来实现这一点,该结构在每个卷积层中引入一个微网络来为局部块计算更抽象的特征。

Sliding a micro network over the input has been proposed in several previous works. For example, the Structured Multilayer Perceptron (SMLP) [9] applies a shared multilayer perceptron on different patches of the input image; in another work, a neural network based filter is trained for face detection [10]. However, they are both designed for specific problems and both contain only one layer of the sliding network structure. NIN is proposed from a more general perspective, the micro network is integrated into CNN structure in persuit of better abstractions for all levels of features

在输入端滑动一个微网络在之前的几项工作中已经被提出。例如,结构化多层感知器(SMLP)[9]将一个共享的多层感知器应用到输入图像的不同块上;在另一项工作中,训练了一个基于神经网络的滤波器来检测人脸[10]。但是,它们都是针对特定的问题设计的,并且都只包含一个滑动网络结构层。NIN是从更一般的角度提出的,将微网络集成到CNN结构中,以更好地对各个层次的特征进行抽象

3 Network In Network

We first highlight the key components of our proposed “Network In Network” structure: the MLP convolutional layer and the global averaging pooling layer in Sec. 3.1 and Sec. 3.2 respectively. Then we detail the overall NIN in Sec. 3.3.

我们首先强调我们提出的“网络中的网络”结构的关键组件:MLP卷积层和全局平均池化层,分别在3.1节和3.2节中介绍。然后我们在3.3节中详细介绍了整个NIN。

3.1 MLP Convolution Layers

Given no priors about the distributions of the latent concepts, it is desirable to use a universal function approximator for feature extraction of the local patches, as it is capable of approximating more abstract representations of the latent concepts. Radial basis network and multilayer perceptron are two well known universal function approximators. We choose multilayer perceptron in this work for two reasons. First, multilayer perceptron is compatible with the structure of convolutional neural networks, which is trained using back-propagation. Second, multilayer perceptron can be a deep model itself, which is consistent with the spirit of feature re-use [2]. This new type of layer is called mlpconv in this paper, in which MLP replaces the GLM to convolve over the input. Figure 1 illustrates the difference between linear convolutional layer and mlpconv layer. The calculation performed by mlpconv layer is shown as follows:

由于对潜在概念的分布没有先验知识,我们希望使用一个通用的函数逼近器来提取局部块的特征,因为它能够逼近潜在概念的更抽象的表示。径向基网络和多层感知器是两个著名的通用函数逼近器。在这项工作中,我们选择多层感知器有两个原因。首先,多层感知器与卷积神经网络的结构兼容,卷积神经网络是通过反向传播来训练的。其次,多层感知器本身可以是一个深度模型,这与特征重用[2]的精神是一致的。这种新的层在本文中称为mlpconv, MLP代替GLM对输入进行卷积。图1说明了线性卷积层和mlpconv层的区别。mlpconv层的计算如下:

From cross channel (cross feature map) pooling point of view, Equation 2 is equivalent to cascaded cross channel parametric pooling on a normal convolution layer. Each pooling layer performs weighted linear recombination on the input feature maps, which then go through a rectifier linear unit. The cross channel pooled feature maps are cross channel pooled again and again in the next layers. This cascaded cross channel parameteric pooling structure allows complex and learnable interactions of cross channel information.

从cross channel (cross feature map) pooling的观点来看,方程2等价于一个正常卷积层上的级联cross channel参数pooling。每个池化层对输入特征图进行加权线性重组,然后通过整流线性单元。跨通道池特征映射在接下来的层中被一次又一次地跨通道池化。这种级联的跨通道参数池结构允许复杂和可学习的跨通道信息交互。

The cross channel parametric pooling layer is also equivalent to a convolution layer with 1x1 convolution kernel. This interpretation makes it straightforawrd to understand the structure of NIN.

交叉通道参数池化层也等价于具有1x1卷积核的卷积层。这种解释使我们很容易理解NIN的结构。

Comparison to maxout layers: the maxout layers in the maxout network performs max pooling across multiple affine feature maps [8]. The feature maps of maxout layers are calculated as follows:

与maxout层的比较:maxout网络中的maxout层跨多个仿射特征映射[8]执行最大池功能。maxout层的feature map计算如下:

Maxout over linear functions forms a piecewise linear function which is capable of modeling any convex function. For a convex function, samples with function values below a specific threshold form a convex set. Therefore, by approximating convex functions of the local patch, maxout has the capability of forming separation hyperplanes for concepts whose samples are within a convex set (i.e. l2 balls, convex cones). Mlpconv layer differs from maxout layer in that the convex function approximator is replaced by a universal function approximator, which has greater capability in modeling various distributions of latent concepts.

Maxout除以线性函数形成一个分段线性函数,可以对任何凸函数进行建模。对于凸函数,函数值低于特定阈值的样本构成凸集。因此,对于样本位于凸集(即l2球、凸锥)内的概念,maxout通过逼近局部patch的凸函数,具有形成分离超平面的能力。Mlpconv层与maxout层的不同之处在于将凸函数逼近器替换为通用函数逼近器,该逼近器对各种潜在概念的分布具有更强的建模能力。

...

猜你喜欢

转载自blog.csdn.net/weixin_40519315/article/details/105327155