Convolutional neural network CNN brief introduction

Problem solved

Before the advent of convolutional neural networks (CNN), images were a difficult problem for artificial intelligence for 2 reasons:

  • The amount of data that images need to be processed is too large , resulting in high costs and low efficiency.
  • It is difficult to retain the original features of images during the digitization process , resulting in low image processing accuracy.

The amount of data to be processed is too large

Images are composed of pixels, and each pixel is composed of colors: for example, a 1,000×1,000 pixel picture, each pixel has an RGB parameter to represent the color, then a picture corresponds to 3×1,000× 1,000 = 3,000,000 parameters, then the computing resources and storage resources required to process the image will be extremely large.

The first problem solved by convolutional neural networks is to "simplify complex problems": while reducing the number of parameters, maintain the data characteristics of the original data as much as possible (intuitive understanding: while the picture becomes blurred, it does not affect our human resolution out of the picture).

The core of this step is actually dimensionality reduction/embedding of the data.

It is difficult to preserve image features

For example, two pictures with symmetrical content have no difference in their content (essence). The only difference is that they have been "left and right flipped". In linear vector representation (that is, for a 1,000×1,000 pixel picture, generating a vector of [1, 1000000, 3]), which has considerable variation.

The second problem solved by the convolutional neural network is "viewing image data in a visual way": that is, when the image is inverted, rotated, moved, scaled, etc., the machine can recognize that these pictures are all the same as humans. It's the same content.
The core of this step is actually to use visual principles to capture the data characteristics in the image data.

Fundamental

A typical CNN consists of three parts: "convolution layer", "pooling layer" and "fully connected layer":

  • Convolution layer: responsible for extracting local features in the image
  • Pooling layer: greatly reduces parameter magnitude (dimensionality reduction)
  • Fully connected layer: a part similar to a traditional neural network, used to output the desired results

Convolution layer - extract image features

Convolution kernel filters image data
Convolution can be understood as using a filter (convolution kernel) to filter small areas of the image to obtain the feature values ​​of these small areas .

In specific applications, there are often multiple convolution kernels. It can be considered that " each convolution kernel represents an image mode ." If the value of an image block convolved with this convolution kernel is large, the image is considered The block is very close to this convolution kernel.

If we design 6 convolution kernels, it can be understood: We believe that there are 6 underlying texture modes on this image, that is, we can depict an image using 6 basic modes.

The convolutional layer extracts local features in the image through the filtering of the convolution kernel , which is similar to the feature extraction of human vision.

Pooling layer - data dimensionality reduction (avoiding overfitting)

The pooling layer is simply downsampling , which can greatly reduce the dimensionality of the data.

The reason why a pooling layer is needed: Even after the convolution is completed, the image is still large (because the convolution kernel is usually relatively small), so in order to reduce the data dimension, downsampling is performed.

The pooling layer function is actually a statistical function, such as maximum pooling, average pooling, cumulative pooling, etc.
So
Max pooling and mean pooling
will the pooling layer function have side effects on the image data?

The answer is: generally not.

Regarding the pooling layer, there is an invariant theory of local linear transformation : if the input data undergoes a local linear transformation operation (such as translation or rotation, etc.), then after the pooling operation, the output result will not occur. Variety.

Local translation "invariance" is particularly useful when we care about whether a feature appears, but not where it appears (for example, in a pattern recognition scenario, when we detect faces, we only care about whether there is a feature in the image. features of the face, regardless of whether the face is in the upper left or lower right corner of the image).

Why can the pooling layer reduce the probability of overfitting?

Because the pooling function makes the model pay more attention to global features (rather than local), you can try to avoid letting the model focus on some specialized details of the image (for example, letting the model pay more attention to a face rather than the size of his eyes) .

Summary: The pooling layer can reduce the data dimension more effectively than the convolution layer. This can not only greatly reduce the amount of calculations, but also effectively avoid overfitting.

Fully connected layer - output results


Insert image description here

The data processed by the convolutional layer and the pooling layer are input to the fully connected layer to obtain the final desired result.

Only the fully connected layer can "run" the data that has been dimensionally reduced through the convolutional layer and the pooling layer. Otherwise, the amount of data is too large, the calculation cost is high, and the efficiency is low.

"Fully connected" means that all neurons in the previous layer of the network are connected to all neurons in the next layer.

The purpose of the design of the fully connected layer is to map the "distributed feature representation" learned in the previous layers to the "sample label space" , then use the loss function to regulate the learning process, and finally give the classification prediction of the object.

Although the pooling layer seems to be the most inconspicuous step in the entire network structure, because it "connects" all parameters, it will cause a large number of redundant parameters, and poor design will easily cause problems in the fully connected layer. The phenomenon of " over-fitting " can be alleviated by using the Dropout method; at the same time, its extremely high number of parameters will lead to reduced performance . In this regard, Dr. Yan Shuicheng's team once published the paper Network in Network (NIN) [4], proposing Use Global Average Pooling ( GAP ) strategy to replace the fully connected layer.

Practical application

Image classification and retrieval

Image classification is a relatively basic application. It can save a lot of labor costs and effectively classify images. For some pictures in specific fields, the classification accuracy can reach 95%+, which is already a highly usable application.
Insert image description here

In the development of computer vision classification algorithms, MNIST is the first benchmark of general academic significance. This is a classification standard for handwritten digits, including 60,000 training data and 10,000 test data. The images are all grayscale images. The general version size is 28×28.

In the late 1990s and early 2000s, SVM and KNN methods were widely used. Methods represented by SVM can reduce the MNIST classification error rate to 0.56%, exceeding the neural network methods represented by the LeNet series ( The error rate of LeNet5 is about 0.7%).

In the early part of this century, neural networks began to show signs of recovery. However, limited by the size of the data set and the development of hardware, the training and optimization of neural networks were still difficult: MNIST only has 60,000 images and only performs 10 classification tasks. For industry The implementation requirements of the industry are far from enough.

In 2009, after several years of compilation by Li Feifei and others, the ImageNet data set was released. It contains 14,000,000+ images and covers 20,000+ categories. The 1,000-category benchmark is commonly used in comparisons of paper methods.

In the early days of ImageNet's release, traditional machine learning methods such as SVM and Boost were still dominant until the emergence of AlexNet in 2012 . AlexNet is the first truly deep network. Compared with the 5 layers of LeNet5, the number of network layers has increased by 3 layers, the network parameters have greatly increased, and the input has changed from 28 to 224. At the same time, the advent of GPU has made deep learning This is the era of training where GPU is king.

The champion and runner-up networks in 2014 were GoogLeNet and VGGNet respectively .

In 2015, ResNet won the championship in classification tasks.

Many classic models were still born in 2016, including ResNeXt , which won second place in the classification competition . The 101-layer ResNeXt can reach the accuracy of ResNet152.

Target Detection

You can locate objects in the image and determine their location and size.
Insert image description here

Image segmentation

A simple understanding is a pixel-level classification .

It can distinguish the foreground and background at the pixel level, and at a more advanced level, it can also identify and classify targets.

Typical scenarios: beautiful pictures, video post-processing, image generation
, etc.Insert image description here

natural language processing

In addition to the field of image processing, CNN has also made considerable achievements in the field of Natural Language Processing (NLP) .

In the task of " text classification with sentence granularity ", Zhang et al. [5] used CNN to extract features from sentence vectors in the paper A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification
Insert image description here
: In the " Emotion-Cause Pair Extraction " task, Chen et al. [6] used CNN to convolve a matrix composed of multiple sentence vectors in the paper A Unified Sequence Labeling Model for Emotion Cause Pair Extraction to obtain paragraphs Context Information:
InInsert image description here
the " Knowledge Graph Embedding " task, Dettmers et al. proposed a knowledge graph triplet (head entity, relationship) in the paper Convolutional 2D Knowledge Graph Embeddings (ConvE) [7] perform a 2D convolution operation on the head entity and relationship in relation and tail entity) to enhance the representation ability of the model through the interaction between embeddings
Insert image description here
:

References

  1. Understand Convolutional Neural Networks - CNN (Basic Principles + Unique Value + Practical Applications) in One Article - Product Manager’s Artificial Intelligence Learning Library
  2. [Deep Learning] Activation introduces nonlinearity, and pooling prevents overfitting_Wendy Dongxuepiao's Blog-CSDN Blog_Pooling layer prevents overfitting
  3. Basic components of neural network - pooling layer, dropout layer, BN layer, fully connected layer 13
  4. Lin M, Chen Q, Yan S. Network in network[J]. arXiv preprint arXiv:1312.4400, 2013.
  5. Zhang Y, Wallace B. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification[J]. arXiv preprint arXiv:1510.03820, 2015.
  6. Xinhong Chen, Qing Li, and Jianping Wang. 2020. A Unified Sequence Labeling Model for Emotion Cause Pair Extraction. In Proceedings of the 28th International Conference on Computational Linguistics, pages 208–218, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  7. Dettmers T, Minervini P, Stenetorp P, et al. Convolutional 2d knowledge graph embeddings[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32(1).
  8. Longpeng-Bi has three famous quotes: [Technical Review] Do you really understand image classification?
  9. Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[J]. Advances in neural information processing systems, 2012, 25.

Guess you like

Origin blog.csdn.net/m0_46261993/article/details/124604394