CNN paper reading (1) LeNet: Gradient-based learning applied to document recognition

 

1. Historical map of CNN structure evolution

 

  The first CNN classic paper study, the originator of the convolutional neural network, the classic handwriting recognition paper-LeNet: "Gradient-Based Learning Applied to Document Recognition", the author includes Yann Lecun, one of the three giants of deep learning, the flower book " Yoshua Bengio, one of the authors of Deep Learning.

  The length of the original text is very long, choose to record the most important part of chapters A and B that introduce the CNN network structure.

2. Convolutional Neural Networks for Character Recognition   Multilayer networks
using gradient descent method can learn complex, high latitude, nonlinear mapping from a large amount of data, which makes them the first choice for image recognition tasks. In the traditional pattern recognition model, a manually designed feature extractor extracts relevant features from the image to remove irrelevant information. The classifier can classify these features. A fully connected multilayer network can be used as a classifier. A more interesting mode is to try to rely on the feature extractor itself for learning. For character recognition, the image can be input into the network as a line vector as input. Although these tasks (such as character recognition) can be completed using a traditional forward fully connected network. But there are still some problems.

  First, the image is very large and consists of many pixels. A fully connected network with 100 hidden units contains tens of thousands of weights, so many parameters increase the system's consumption and memory footprint, so a larger training set is required. But the main disadvantage of an unstructured network is that it does not have the invariance of translation, deformation and distortion for applications other than image or audio. In the network money input to a fixed size input, the size of the character image must be normalized and placed in the middle of the input. Unfortunately, no preprocessing can be so perfect: Since the handwriting uses characters as the normalized unit, Will cause changes in the size, tilt, and position of each character, plus differences in writing styles, will lead to changes in feature positions. In principle, a fully connected network of sufficient size can be robust to these changes, but, to achieve This purpose requires more neurons in different positions of the input image, so that different features can be detected, no matter where they appear in the image. Learning these weight parameters requires a large number of training samples to cover the possible sample space. In the convolutional neural network described below, shift invariance can be achieved through weight sharing.

  Point 2, another disadvantage of a fully connected network is that it completely ignores the input topology. The input images can be in any order without affecting the training results. However, the image has a strong two-dimensional local structure: spatially adjacent pixels are highly correlated. Local correlation has a huge advantage for extracting local features, because the weights of adjacent pixels can be divided into several categories. CNN extracts features by restricting the receptive field of hidden nodes to local.

A convolutional network
  CNN realizes displacement, scaling, and deformation invariance through local receptive fields , shared weights , and sub-sampling . A typical network structure used for character recognition is shown in Figure 2, which is called LeNet-5. The input layer has a character image whose size is normalized and the characters are located in the middle. Each neuron in each layer (each unit) receives the input of a group of neurons in the local field in the previous layer (that is, the local receptive field). The idea of ​​connecting multiple neurons into local receptive fields can be traced back to the perceptron of the 1960s, which is almost synchronized with the neurons of local receptive and direction selection discovered by Hubel and Wiesel's in the cat's visual system Science is closely related). Partial receptive fields are used many times in visual learning neural models. Using local receptive fields, neurons can extract visual features such as edges and corners. These features are combined in the next layer to form higher-level features. As mentioned earlier, deformation And the displacement will cause the change of the salient feature position. In addition, the local feature detector of the image can also be used for the entire image. Based on this feature, we can set a group of neurons with local receptive fields at different positions of the image to the same weight ( This is weight sharing).All neurons in each layer form a plane, and all neurons in this plane share weights. All outputs of the neuron (unit) constitute a feature map. All units in the feature map perform the same operation at different positions of the image, so that they can detect the same feature at different positions of the input image. A complete convolutional layer consists of multiple A feature map (using different weight vectors), so that each location can extract a variety of features. A specific example is the first layer in Figure 2 LeNet-5. All cells in the first hidden layer form 6 planes, each of which is a feature map. A cell in a feature map corresponds to 25 inputs. These 25 inputs are connected to the 5x5 area of ​​the input layer. This area is the local receptive field. Each unit has 25 inputs, so there are 25 trainable parameters plus an offset. Since adjacent units in the feature map are centered on consecutive units in the previous layer, the local receptive fields of adjacent units overlap. For example, in LeNet-5, the receptive fields of horizontally continuous units overlap with 5 rows and 4 columns. As mentioned earlier, all units in a feature map share 25 weights and an offset, so they are The same feature is detected at different locations, and the other feature maps of each layer use different sets of weights and offsets to extract different types of local features. In LeNet, 6 different features are extracted for each input location. One way to realize the feature map is to use a unit with a receptive field to scan the entire image and keep the state of each corresponding position in the feature map. This operation is equivalent to convolution, followed by adding a partial Set up and a function, therefore, named the convolutional network, the convolution kernel is the weight of the connection. The kernel of the convolutional layer is a set of connection weights used by all units in the feature map. An important characteristic of the convolutional layer is that if the input image is displaced, the feature map will be correspondingly displaced, otherwise the feature map remains unchanged. This feature is the basis for CNN to maintain robustness against displacement and deformation.

  Once the feature map is calculated, the precise location becomes unimportant, and the approximate location relative to other features is relevant. For example, we know that the upper left area has an endpoint of a horizontal line segment, the upper right corner has a corner, and the lower vertical line segment has an endpoint. We know that the number is 7. The precise location of these features not only does not help recognition, but it is not conducive to recognition, because for different handwritten characters, the location will often change. The way to reduce the accuracy of the feature position in the feature map is to reduce the spatial resolution of the feature map. This can be achieved by the downsampling layer. The downsampling layer reduces the resolution of the feature map by taking the local average and reduces the output translation and deformation. Sensitivity. The second hidden layer in LeNet-5 is the downsampling layer. This layer contains 6 feature maps, corresponding to the 6 feature maps of the previous layer. The receptive field of each neuron is 2x2. Each neuron calculates the average of the four inputs, then multiplies it by a coefficient, and finally adds a paranoia, and finally passes the value to a sigmoid function. The receptive fields of adjacent neurons did not overlap. Therefore, the rows and columns of the feature map of the down-sampling layer are half of the feature map of the previous layer. Coefficients and offsets affect the effect of the sigmoid function. If the coefficient is relatively small, the downsampling layer is equivalent to blurring the input. If the coefficient is large, the downsampling layer can be regarded as an "or" or "and" operation according to the offset value. The convolutional layer and the downsampling layer appear alternately. This form forms a pyramid: for each layer, the resolution of the feature map gradually decreases, and the number of feature maps gradually increases. The input of each neuron of the third hidden layer (layer C3) in LeNet-5 can come from multiple feature maps of the previous layer (S2). The inspiration for the combination of convolution and downsampling comes from Hubel and Wiesel's concept of "simple" and "complex" cells, although at that time there was no global supervised learning process like backpropagation. Downsampling and the combination of multiple features can greatly improve the invariance of the network to geometric transformation.

  Since all weights are learned through back propagation, the convolutional network can be regarded as a feature extractor. Weight sharing technology has an important impact on reducing the number of parameters, while weight sharing technology reduces the gap between test error and training error. LeNet-5 contains 340,908 connections, but due to weight sharing only 60,000 trainable parameters are included. Convolutional neural networks are used in many fields, including handwriting recognition, printed character recognition, online handwriting recognition, and face recognition. Convolutional neural networks with weights shared in a single time dimension are called delay neural networks (TDNNs). TDNNs have been used in scene recognition (no downsampling) [40], speech recognition (no downsampling), independent Handwritten character recognition [44] and gesture verification [45].

B LeNet-5


  LeNet-5 has a total of 7 layers and does not include inputs. Each layer contains trainable parameters (connection weights). The input image is 32 * 32 in size. This is larger than the largest letter in the Mnist database (a recognized handwriting database) (28 * 28). The reason for this is to hope that potentially obvious features such as stroke endpoints or corner points can appear in the center of the receptive field of the highest-level feature monitor. In LeNet-5, the center of the receptive field of the last convolutional layer forms a 20x20 area in the 32x32 input image. The input pixel values ​​are normalized so that the background (white) corresponds to -0.1 and the foreground ( Black) corresponds to 1.175. This makes the mean of the input approximately equal to 0 and the variance approximately equal to 1, which can speed up learning [46].

  In the following, the convolutional layer is identified as Cx, the down-sampling layer is identified as Sx, the fully connected layer is identified as Fx, and the x indicates the index of the layer.

  C1层是一个卷积层,由6个特征图Feature Map构成。特征图中每个神经元与输入中5*5的邻域相连。特征图的大小为28*28,这样能防止输入的连接掉到边界之外。C1有156个可训练参数(每个滤波器5*5=25个unit参数和一个bias参数,一共6个滤波器,共(5*5+1)*6=156个参数),共122,304个连接(26*28*28*6,每个神经元对应26个连接,每个feature map有28*28个unit, 一共有6个feature map)。(25个输入和1个偏置共26个连接,得到输出特征图里一个像素。)

  S2层是一个下采样层,有6个14*14的特征图。特征图中的每个单元与C1中相对应特征图的2*2邻域相连接。S2层每个单元的4个输入相加,乘以一个可训练参数,再加上一个可训练偏置。结果通过sigmoid函数计算。可训练系数和偏置控制着sigmoid函数的非线性程度。如果系数比较小,那么运算近似于线性运算,下采样相当于模糊图像。如果系数比较大,根据偏置的大小下采样可以被看成是有噪声的“或”运算或者有噪声的“与”运算。每个单元的2*2感受野并不重叠,因此S2中每个特征图的行列分别是C1中特征图的一半。S2层有12个(池化层没有要学习的参数)可训练参数(每个feature map有一个系数和偏置)和5880(5*14*14*6)个连接。

  C3是一个有16个特征图的卷积层。C3层的卷积核大小为5*5,每个特征图中的每个单元与S2中的多个特征图相连,表1显示了C3中每个特征图与S2中哪些特征图相连。
那为什么不把S2中的每个特征图连接到每个C3的特征图呢?原因有2点。
  第一,不完全的连接机制将连接的数量保持在合理的范围内。
  第二,也是更加重要的,其破坏了网络的对称性。不完全连接能够保证C3中不同特征图提取不同的特征(希望是互补的),因为他们的输入不同。
表1中展示了一个合理的连接方式:C3的前6个特征图以S2中3个相邻的特征图为输入。接下来6个特征图以S2中4个相邻特征图为输入,下面的3个特征图以不相邻的4个特征图为输入。最后一个特征图以S2中所有特征图为输入。这样C3层有1516个可训练参数((25*3+1)*6+(25*4+1)*9+(25*6+1))和151600个(C3层特征图大小10*10)连接。

(表1中第1列表示C3的第0个特征图,与S2中的第0,1,2个特征图连接)

  S4层是一个下采样层,由16个5*5大小的特征图构成。特征图中的每个单元与C3中相应特征图的2*2邻域相连接,跟C1和S2之间的连接一样。S4层有32个可训练参数(每个特征图1个系数和一个偏置)和2000个连接(5*5*5*16,对于S4的每个unit,对应感受野4个参数,加上一个偏置)。

  C5层是一个卷积层,有120个特征图。每个单元与S4层的全部16个特征图的5*5领域相连。由于S4层特征图的大小也为5*5(同滤波器一样),故C5特征图的大小为1*1:这构成了S4和C5之间的全连接。之所以仍将C5标示为卷积层而非全连接层,是因为如果LeNet-5的输入变大,而其他的保持不变,那么此时特征图的维数就会比1*1大。C5层有48120个可训练连接((5*5*16+1)*120)。

  F6层有84个单元(之所以选这个数字的原因来自于输出层的设计,下面会有说明),与C5层全相连。有10164个可训练参数。

如同经典神经网络,F6层计算输入向量和权重向量之间的点积,再加上一个偏置。神经元i的加权和表示为aiai,然后将其传递给sigmoid函数产生单元i的一个状态,表示为xi,
    xi=f(ai)

Sigmoid函数是一个双曲线正切函数:

    f(a)=Atanh(Sa)

  A表示函数的振幅,S决定了斜率,这个函数是一个奇函数,水平渐近线为+A,-A。常量A通常取1.7159。选择该函数的原因见附录A。
  最后,输出层(其实就是softmax loss)由欧式径向基函数(Euclidean Radial Basis Function,RBF)单元组成,每类一个单元,每个单元有84个输入,每个RBF单元yiyi的输出按照如下方式计算:
                         yi=∑(xj−ωij)2
                         yi=∑(xj−ωij)2

  换句话说,每个输出RBF单元计算输入向量和参数向量之间的欧式距离。输入离参数向量越远,RBF输出的越大。一个RBF输出可以被理解为衡量输入模式和与RBF相关联类的一个模型的匹配程度的惩罚项。用概率术语来说,RBF输出可以被理解为F6层配置空间的高斯分布的负的log似然(log-likelihood)。给定一个输入模式,损失函数应能使得F6的配置与RBF参数向量(即模式的期望分类)足够接近。这些单元的参数是人工选取并保持固定的(至少初始时候如此)。这些参数向量的成分被设为-1或1。虽然这些参数可以以-1和1等概率的方式任选,或者构成一个纠错码,但是被设计成一个相应字符类的7*12大小(即84)的格式化图片。这种表示对识别单独的数字不是很有用,但是对识别可打印ASCII集中的字符串很有用。基本原理就是字符是相似的,容易混淆,比如大小的O,小写的o和数字0或者小写的l与数字1,方括号和大写的I,会有相似的输出编码。如果一个系统与一个能够纠正此混淆的语言处理器相结合,这个就非常有用了。由于容易混淆的类别的编码是相似的,有歧义的字符的RBF输出是相似的,这个语言处理器就能够选择出合适的解释。图3给出了所有ASCII字符集的输出编码。

  使用这种分布编码而非更常用的“1 of N”编码(又叫位置编码或者细胞编码)用于产生输出的另一个原因是,当类别比较大的时候,非分布编码的效果比较差。原因是大多数时间非分布编码的输出必须是关闭状态。这使得用sigmoid单元很难实现。另一个原因是分类器不仅用于识别字母,也用于拒绝非字母。使用分布编码的RBF更适合该目的,因为与sigmoid不同,他们在输入空间的较好得限制区域内兴奋,而非典型模式更容易落到外边。

  RBF参数向量起着F6层目标向量的角色。需要指出这些向量的成分是+1或-1,这正好在F6 sigmoid的范围内,因此可以防止sigmoid函数饱和。实际上,+1和-1是sigmoid函数的最大曲率的点。这使得F6单元运行在最大非线性范围内。必须避免sigmoid函数的饱和,因为这将会导致损失函数较慢的收敛和病态问题。

3、重要的点

  神经元是一个包含完整输入和输出完整过程的计算模型。

  一个神经元对应一组权值(卷积神经网络中的卷积核+偏置,全连接网络中的权重w和偏置),执行一次计算 y=f(∑ωixi+b)(CNN中的卷积计算,全连接网络中的权重计算),产生一个输出(CNN中特征图的一个像素,全连接网络中的下一层的一个神经元)的过程。

 

————————————————

转载,原文链接:https://blog.csdn.net/qianqing13579/article/details/71076261

 

Guess you like

Origin www.cnblogs.com/maybe-lucky/p/12734885.html