Popular understanding of convolutional neural network - introduction to deep learning of artificial intelligence

Guide: There are roads in the mountains of books, hard work is the path, and there is no limit to the sea of ​​learning.

  • AI: Making machines exhibit human intelligence

  • Machine Learning: A Path to AI Goals

  • Deep Learning: Technologies That Enable Machine Learning

If you need convolutional neural network learning materials, you can also add WeChat

 Convolutional Neural Networks (CNN) is a type of Feedforward Neural Networks (Feedforward Neural Networks) that includes convolution calculations and has a deep structure. It is one of the representative algorithms for deep learning. Convolutional neural network has the ability of representation learning, and can perform shift-invariant classification on input information according to its hierarchical structure, so it is also called "shift-invariant artificial neural network". Neural Networks, SIANN)".

Convolutional Neural Network ( CNN ) is a feedforward neural network. Its artificial neurons can respond to surrounding units within a part of the coverage area, and it has excellent performance for large-scale image processing. It includes convolutional layers and pooling layers.

structure

input layer

The input layer of a convolutional neural network can process multidimensional data. Typically, the input layer of a one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array, where a one-dimensional array is usually a time or spectral sample; a two-dimensional array may contain multiple channel; the input layer of a 2D convolutional neural network receives a 2D or 3D array; the input layer of a 3D convolutional neural network receives a 4D array. Since convolutional neural networks are widely used in the field of computer vision, many studies presuppose three-dimensional input data when introducing its structure, that is, two-dimensional pixels and RGB channels on a plane.

Similar to other neural network algorithms, the input features of convolutional neural networks need to be normalized due to the use of gradient descent for learning. Specifically, before inputting the learning data into the convolutional neural network, the input data needs to be normalized in the channel or time/frequency dimension. If the input data is pixels, it can also be distributed in

The raw pixel values ​​of are normalized to

 interval. The standardization of input features is beneficial to improve the operating efficiency and learning performance of the algorithm.

hidden layer

The hidden layer of the convolutional neural network includes three common structures: the convolutional layer, the pooling layer, and the fully connected layer. In some more modern algorithms, there may be complex structures such as Inception modules and residual blocks. Among common architectures, convolutional and pooling layers are unique to convolutional neural networks. The convolution kernel in the convolutional layer contains weight coefficients, but the pooling layer does not contain weight coefficients, so in the literature, the pooling layer may not be considered as an independent layer. Taking LeNet-5 as an example, the order of the three common structures in the hidden layer is usually: input-convolution layer-pooling layer-convolution layer-pooling layer-full connection layer-output.

Convolutional layer 

1. Convolutional kernel

The function of the convolutional layer is to extract features from the input data. It contains multiple convolutional kernels. Each element of the convolutional kernel corresponds to a weight coefficient and a bias vector, similar to a feedforward neural network. The neurons of the network (neurons). Each neuron in the convolutional layer is connected to multiple neurons in the area close to the previous layer. The size of the area depends on the size of the convolution kernel, which is called "receptive field" in the literature. , its meaning can be compared to the receptive field of visual cortex cells. When the convolution kernel is working, it will regularly scan the input features, perform matrix element multiplication and summation on the input features in the receptive field, and superimpose the deviation:

The summation part in the formula is equivalent to solving a cross-correlation. b is the amount of bias, and

Indicates the first

The convolutional input and output of the layer, also known as the feature map (feature map), is the size of , here it is assumed that the feature map has the same length and width. Corresponding to the pixels of the feature map, K is the number of channels of the feature map, which is the parameter of the convolution layer, corresponding to the size of the convolution kernel, the convolution step size (stride) and the number of padding layers.

The above formula uses a two-dimensional convolution kernel as an example, and a one-dimensional or three-dimensional convolution kernel works similarly. Theoretically, the convolution kernel can also be flipped 180 degrees first, and then solve the cross-correlation. The result is equivalent to a linear convolution that satisfies the commutative law, but this does not facilitate the solution parameters while increasing the solution steps. , so the linear convolution kernel uses cross-correlation instead of convolution.

In particular, when the convolution kernel is of size , the stride

 And when the unit convolution kernel is not included, the cross-correlation calculation in the convolutional layer is equivalent to matrix multiplication, and thus a fully connected network is constructed between the convolutional layers:

 

A convolutional layer consisting of unit convolutional kernels is also called a Network-In-Network (NIN) or a multilayer perceptron convolution layer (mlpconv). The unit convolution kernel can reduce the number of channels of the map while maintaining the size of the feature map, thereby reducing the computational load of the convolutional layer. A convolutional neural network constructed entirely of unit convolution kernels is a multi-layer perceptron (Multi-Layer Perceptron, MLP) that includes parameter sharing.

Based on linear convolution, some convolutional neural networks use more complex convolutions, including tiled convolution, deconvolution, and dilated convolution. The convolution kernel of tiled convolution only sweeps a part of the feature map, and the rest is processed by other convolution kernels of the same layer, so the parameters between convolution layers are only partially shared, which is beneficial for the neural network to capture the input image. Rotation invariant (shift-invariant) features. Deconvolution or transposed convolution concatenates a single input stimulus with multiple output stimuli to amplify the input image. Convolutional neural networks composed of deconvolution and up-pooling layers have important applications in the field of image semantic segmentation (semantic segmentation), and are also used to build convolutional autoencoders (Convolutional AutoEncoder, CAE) . Expansion convolution introduces expansion rate on the basis of linear convolution to increase the receptive field of convolution kernel, so as to obtain more information of feature map, which is conducive to capturing the long-distance dependence (long-range) of learning targets when used for sequence data. dependency). Convolutional neural networks using dilated convolutions are mainly used in the field of Natural Language Processing (NLP), such as machine translation and speech recognition.

2. Convolution layer parameters

Convolutional layer parameters include convolutional kernel size, step size, and padding. The three together determine the size of the output feature map of the convolutional layer, which is the hyperparameter of the convolutional neural network. The size of the convolution kernel can be specified as any value smaller than the size of the input image. The larger the convolution kernel, the more complex the input features that can be extracted

The convolution step defines the distance between the positions of the convolution kernel when it scans the feature map twice adjacently. When the convolution step is 1, the convolution kernel will sweep the elements of the feature map one by one. When the step is n, it will be in the next The scan skips n-1 pixels.

From the cross-correlation calculation of the convolution kernel, it can be seen that as the convolution layer is stacked, the size of the feature map will gradually decrease. After that, a 12×12 feature map will be output. To this end, padding is a method of artificially increasing the size of feature maps before they pass through convolution kernels to counteract the effect of size shrinkage in computation. Common padding methods are padding by 0 and padding with repeated boundary values ​​(replication padding). Filling can be divided into four categories according to its number of layers and purpose:

  • Valid padding: That is, no padding is used at all, and the convolution kernel only allows access to the position in the feature map that contains the complete receptive field. All pixels of the output are a function of the same number of pixels in the input. A convolution that uses efficient padding is called a "narrow convolution", and the feature map output by the narrow convolution is of size (Lf)/s+1.
  • Same/half padding: Only enough padding is done to keep the output and input feature maps of the same size. The size of the feature map under the same padding will not be reduced, but the part of the input pixels near the boundary has less influence on the feature map than the middle part, that is, there is under-expression of the boundary pixels. Convolutions that use the same padding are called "equal-width convolutions".
  • Full padding: Do enough padding so that each pixel is visited the same number of times in each direction. When the step size is 1, the feature map size of the fully filled output is L+f-1, which is larger than the input value. Convolutions using full padding are called "wide convolutions"
  • Arbitrary padding: between effective padding and full padding, artificially set padding, rarely used.

Bringing into the previous example, if the 16×16 input image is filled in the same way before passing through the 5×5 convolution kernel with a unit step size, two layers will be filled in the horizontal and vertical directions, that is, two layers will be added on each side pixels (

) becomes an image of 20×20 size, after passing through the convolution kernel, the output feature map size is 16×16, maintaining the original size.

3. Activation function

The activation function is included in the convolutional layer to help express complex features, and its representation is as follows:

Similar to other deep learning algorithms, convolutional neural networks usually use a linear rectification function (Rectified Linear Unit, ReLU). Other ReLU-like variants include sloped ReLU (Leaky ReLU, LReLU), parameterized ReLU (Parametric ReLU, PReLU), randomized ReLU (Randomized ReLU, RReLU), exponential linear unit (Exponential Linear Unit, ELU), etc. Before the advent of ReLU, the Sigmoid function and the hyperbolic tangent function (hyperbolic tangent) were also used.

The activation function operation is usually after the convolution kernel, and some algorithms using preactivation techniques place the activation function before the convolution kernel. In some early convolutional neural network research, such as LeNet-5, the activation function comes after the pooling layer.

 pooling layer

After feature extraction in the convolutional layer, the output feature map is passed to the pooling layer for feature selection and information filtering. The pooling layer contains a preset pooling function, whose function is to replace the result of a single point in the feature map with the feature map statistics of its adjacent regions. The pooling layer selects the pooling area in the same steps as the convolution kernel scanning feature map, controlled by the pooling size, step size and padding.

1. Pooling (Lp pooling)

Pooling is a type of pooling model inspired by the hierarchical structure in the visual cortex, and its general representation is: 

step size  

 pixel

 The meaning of the convolutional layer is the same, and P is a pre-specified parameter. When P=1, the pooling takes the average value in the pooling area, which is called average pooling; at that time , the pooling takes the maximum value in the area, which is called max pooling (max pooling) . Mean pooling and max pooling are the most common pooling methods, both of which preserve the background and texture information of the image at the expense of the size of the feature map. In addition, when P=2, pooling is also used in some work.

2. Random/hybrid pooling 

Mixed pooling and stochastic pooling are extensions of the concept of L pooling. Random pooling will randomly select a value according to a specific probability distribution in its pooling area to ensure that some non-maximum excitation signals can enter the next construction. Hybrid pooling can be expressed as a linear combination of mean pooling and max pooling:

 Studies have shown that hybrid pooling and random pooling are beneficial to prevent overfitting of convolutional neural networks, and have better performance than mean and maximum pooling.

3. Spectral pooling

Spectral pooling is an FFT-based pooling method that can be used together with FFT convolution to build an FFT-based convolutional neural network. When the size of the feature map and the output size of the pooling layer are given , the spectral pooling performs DFT transformation on each channel of the feature map, and intercepts from the center of the spectrum, and performs DFT inverse transformation on the sequence of the size to obtain the pooling result. Spectral pooling has a filtering function, which can preserve the low-frequency change information to the greatest extent, and can effectively control the size of the feature map. In addition, based on the mature FFT algorithm, spectral pooling can be done with a small amount of calculation.

Inception module (Inception module)

 The Inception module is a special hidden layer construction obtained by stacking multiple convolutional layers and pooling layers. Specifically, an Inception module will contain multiple different types of convolution and pooling operations at the same time, and use the same padding to make the above operations obtain feature maps of the same size, and then superimpose the channels of these feature maps in the array and pass incentive function. Since the above method introduces multiple convolution calculations in one construction, the calculation amount will increase significantly. Therefore, in order to simplify the calculation amount, the Inception module usually designs a bottleneck layer. First, the unit convolution kernel is used, that is, the NIN structure reduces the feature map. The number of channels, and then perform other convolution operations. The Inception module was first applied to GoogLeNet and achieved remarkable success, and also inspired the idea of ​​depthwise separable convolution in the Xception algorithm.

fully-connected layer

The fully connected layer in the convolutional neural network is equivalent to the hidden layer in the traditional feedforward neural network. The fully connected layer is usually built in the last part of the hidden layer of the convolutional neural network, and only transmits signals to other fully connected layers. The feature map loses its 3D structure in the fully connected layer, is expanded into a vector and passed to the next layer through the activation function.

In some convolutional neural networks, the function of the fully connected layer can be partially replaced by global average pooling (global average pooling), which averages all the values ​​​​of each channel of the feature map, that is, if there is a feature map , Global mean pooling will return a vector of 256 where each element is mean pooling with stride 7 and no padding.

output layer

The upstream of the output layer in a convolutional neural network is usually a fully connected layer, so its structure and working principle are the same as the output layer in a traditional feedforward neural network. For image classification problems, the output layer uses a logistic function or a normalized exponential function (softmax function) to output classification labels. In the object detection problem, the output layer can be designed to output the center coordinates, size and classification of the object. In image semantic segmentation, the output layer directly outputs the classification result of each pixel. 

 nature

connectivity

The connection between the convolutional layers in the convolutional neural network is called a sparse connection (sparse connection), that is, compared with the full connection in the feedforward neural network, the neurons in the convolutional layer are only part of the adjacent layer, while Not all neurons are connected. Specifically, any pixel (neuron) in the feature map of layer l of the convolutional neural network is only a linear combination of pixels in the receptive field defined by the convolution kernel in layer l-1. The sparse connection of the convolutional neural network has a regularization effect, which improves the stability and generalization ability of the network structure and avoids overfitting. At the same time, the sparse connection reduces the total amount of weight parameters, which is conducive to the rapid learning of the neural network. and reduce memory overhead when computing.

All pixels in the same channel of the feature map in the convolutional neural network share a set of convolution kernel weight coefficients, which is called weight sharing. Weight sharing distinguishes convolutional neural networks from other neural networks that contain locally connected structures, which use sparse connections but have different weights for different connections. Weight sharing, like sparse connections, reduces the total number of parameters in convolutional neural networks and has a regularizing effect.

From the perspective of fully connected network, the sparse connection and weight sharing of convolutional neural network can be regarded as two infinitely strong priors (pirior), that is, all weight coefficients of a hidden layer neuron outside its receptive field are constant. is 0 (but the receptive field can move in space); and in a channel, the weight coefficients of all neurons are the same.

 Representational

Feature reconstruction of convolutional neural network based on deconvolution and upward pooling 

As a representative algorithm of deep learning, convolutional neural network has representation learning ability, that is, it can extract high-order features from input information. Specifically, the convolutional layer and the pooling layer in the convolutional neural network can respond to the translation invariance of the input features, that is, they can identify similar features located in different positions in space. The ability to extract translation-invariant features is one of the reasons why convolutional neural networks are used in computer vision problems.

The transmission of translation-invariant features in convolutional neural networks has general rules. In image processing problems, the feature map at the front of the convolutional neural network usually extracts representative high-frequency and low-frequency features in the image; the subsequent pooled feature map will show the edge features (aliasing artifacts) of the input image; When the signal enters a deeper hidden layer, its more general and complete features will be extracted. Deconvolution and un-pooling can visualize hidden layer features of convolutional neural networks. In a successful convolutional neural network, the feature maps passed to the fully connected layers will contain the same features as the learning target, such as the full image of each class in image classification.

biological similarity

The sparse connection based on the receptive field setting in the convolutional neural network has a clear corresponding neuroscience process - the organization of the visual space by the visual cortex in the visual nervous system. Cells in the visual cortex receive signals from photoreceptors on the retina, but individual visual cortex cells do not receive all the signals from the photoreceptors, but only those within the stimulus area they innervate, the receptive field. Only stimuli in the receptive field can activate this neuron. Multiple visual cortex cells receive the signals transmitted by the retina and establish visual space by systematically superimposing the receptive fields. In fact, the term "receptive field" in machine learning comes from its corresponding biological research. The nature of weight sharing in convolutional neural networks has no clear evidence in biology, but in the study of target-propagation (TP) and feedback alignment (FA) mechanisms closely related to brain learning, Weight sharing improves the learning effect.

application

Computer Vision (Small Editor)

Image recognition (image classification) 

object recognition

Action recognition

pose estimation

neural style transfer

natural language processing

 In general, due to the limitation of the size of the window or convolution kernel, the long-distance dependence and structured grammatical features of natural language data cannot be well learned. The convolutional neural network in natural language processing (Natural Language Processing, NLP) There are fewer applications than recurrent neural networks, and in many problems will be designed on the framework of recurrent neural networks, but there are also some convolutional neural network algorithms that have been successful in multiple NLP topics.

In the field of speech processing, convolutional neural networks have proven to outperform Hidden Markov Model (HMM), Gaussian Mixture Model (GMM) and other deep algorithms. Some studies use a hybrid model of convolutional neural network and HMM for speech processing. The model uses a small convolution kernel and replaces the pooling layer with a fully connected layer to improve its learning ability. Convolutional neural networks can also be used for speech synthesis and language modeling. For example, WaveNet uses the generation model built by convolutional neural network to output the conditional probability of speech, and samples the synthesized speech. The combination of convolutional neural network and long short term memory model (Long Short Term Memory model, LSTM) can well complete the input sentence. Other related work includes genCNN, ByteNet, etc.

other

physics

 

remote sensing science

Convolutional neural networks are widely used in remote sensing science, especially satellite remote sensing. When analyzing the geometric, texture and spatial distribution features of remote sensing images, convolutional neural networks have obvious advantages in terms of computational efficiency and classification accuracy. According to the source and purpose of remote sensing images, convolutional neural networks are used for the study of underlying surface use and type change (land use/land cover change) and remote sensing inversion of physical quantities such as sea-ice concentration. In addition, convolutional neural networks are widely used in object recognition and image semantic segmentation of remote sensing images. The latter two are direct computer vision problems and will not be repeated here.

Atmospheric Science

Contains programming modules for Convolutional Neural Networks

Modern mainstream machine learning libraries and interfaces, including TensorFlow, Keras, Thenao, Microsoft-CNTK, etc., can run convolutional neural network algorithms. In addition, some commercial numerical computing software, such as MATLAB, also have convolutional neural network construction tools available.


 at last

Those who need convolutional neural network learning materials and deep learning papers can add my assistant WeChat

The editor has sorted out all kinds of artificial intelligence learning materials on the Internet, a total of 500G, you can add more if necessary

 ❶ Artificial intelligence courses and projects (including courseware source code) can be written into the resume of enterprise-level project practice

❷ High-quality must-read books on artificial intelligence ("Bible" flower books, etc.) + collection of artificial intelligence papers

❸ Tutorials and supporting zi materials of well-known domestic and foreign masters (Goddess Li Feifei, Wu Enda, Li Mu)

❹ Super detailed explanation of artificial intelligence learning path + system learning zi materials

❺ High-quality artificial intelligence resource website arrangement, artificial intelligence industry report

If it is helpful to everyone, remember to like and collect~ I love you~ 

Guess you like

Origin blog.csdn.net/Java_college/article/details/122045361