Summary of basic knowledge of deep learning

background

The Big Three in Deep Learning: On March 27, 2019, ACM (American Association for Computing Machinery) announced that Yoshua Bengio (Joshua Bengio, Paris, France), Yann LeCun (Yang Likun, French) are known as the "Big Three in Deep Learning". ), Geoffrey Hinton (Jeffrey Hinton) jointly won the 2018 Turing Award

Related papers published by the "Three Giants of Deep Learning": https://github.com/longpeng2008/Awesome_DNN_Researchers

The relationship between deep learning/machine learning/artificial intelligence, computer vision/machine vision/image processing…

insert image description here
Graphs are less accurate, and deep learning can also be used for unsupervised learning.

Artificial intelligence, machine learning, and deep learning are sciences (or disciplines), machine vision, and image processing are technologies (or applications), and computer vision can be called science or technology.

Artificial Intelligence is a branch of computer science. It attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a manner similar to human intelligence. Research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence is a very broad science that consists of different fields such as machine learning, computer vision, etc. A major goal of artificial intelligence research is to enable machines to perform complex tasks that normally require human intelligence.

Machine learning (Machine Learning) is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. Specializes in the study of how computers simulate or implement human learning behaviors to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance. It is the core of artificial intelligence and the fundamental way to make computers intelligent.
Traditional machine learning methods can roughly include: support vector machines, decision trees, random forests, hidden Markov models, principal component analysis, etc. In machine learning, the difference between different methods is still a bit big. For example, SVM, etc. are discriminative models, hidden Markov, etc. are generative models; principal component analysis, etc. are unsupervised learning, and random forests, etc., are supervised learning.

Deep learning (DL, Deep Learning) is a new research direction in the field of machine learning (ML, Machine Learning). It is introduced into machine learning to make it closer to the original goal-artificial intelligence (AI, Artificial Intelligence).
Deep learning is to learn the internal laws and representation levels of sample data. The information obtained during the learning process is of great help to the interpretation of data such as text, images and sounds. Its ultimate goal is to enable machines to have the ability to analyze and learn like humans, and to be able to recognize data such as text, images, and sounds. Deep learning is a complex machine learning algorithm that has achieved results in speech and image recognition that far exceed previous related techniques.
Deep learning has achieved many results in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization technology, and other related fields. Deep learning enables machines to imitate human activities such as audio-visual and thinking, and solves many complex pattern recognition problems, making great progress in artificial intelligence-related technologies.
Deep Learning: A technique for implementing machine learning. Deep learning is not an independent learning method. It also uses supervised and unsupervised learning methods to train deep neural networks. However, due to the rapid development of this field in recent years, some unique learning methods have been proposed one after another (such as residual network), so more and more people regard it as a learning method alone.

Computer Vision (Computer Vision) is a comprehensive subject. Computer vision is a simulation of biological vision using computers and related equipment. Computer vision is a science that studies how to make machines "see". To put it further, it refers to using cameras and computers instead of human eyes to identify, track and measure targets, and further processing graphics to make computers It is processed into an image that is more suitable for human observation or sent to the instrument for detection.

Machine vision is a comprehensive technology, including image processing. Machine vision is a rapidly developing branch of artificial intelligence. Simply put, machine vision is the use of machines instead of human eyes for measurement and judgment. The machine vision system converts the captured target into an image signal through the machine vision product (that is, the image capture device, which is divided into CMOS and CCD), and transmits it to a dedicated image processing system to obtain the shape information of the captured target. According to the pixel distribution and Brightness, color and other information are converted into digital signals; the image system performs various operations on these signals to extract the characteristics of the target, and then controls the on-site equipment actions according to the results of the discrimination.

Image processing is the technique of analyzing images with computers to achieve the desired results. Also known as image processing. Image processing generally refers to digital image processing. A digital image refers to a large two-dimensional array obtained by shooting with industrial cameras, video cameras, scanners and other equipment. The elements of this array are called pixels, and their values ​​are called grayscale values. Image processing technology generally includes three parts: image compression, enhancement and restoration, matching, description and recognition.

Relationship:
Computer vision provides a theoretical and algorithmic basis for machine vision, and machine vision is part of the engineering implementation of computer vision. There are more theories of computer vision than machine vision. And the application scenarios of the two are different . Machine vision tends to be industrial production, and computer vision is more biased towards the analysis of images obtained by the computer. Computer vision
is mainly for qualitative analysis , such as classification recognition, this is a cup and that is a dog. Or do identity verification, such as face recognition, license plate recognition. Or do behavioral analysis, such as personnel intrusion, wandering, remnants, crowd gathering, etc. Deep learning is more suitable for computer vision . Moreover, prerequisites such as light, distance, and angle are often dynamic, so the accuracy requirements are generally low.
Machine vision mainly focuses on the analysis of quantity , such as measuring the diameter of a part through vision. Generally speaking, it requires high accuracy. The resolution of general machine vision is much higher than that of computer vision, and often requires real-time, so processing speed is critical. Machine vision generally uses commercial library halcon or VisionPro software, while computer vision generally uses opencv library.
Now the distinction between computer vision and machine vision seems less and less clear.

We cannot simply say that deep learning is an advanced step in machine learning. Generally speaking, machine learning solves structured data, and deep learning solves unstructured data.
Structured data: https://worktile.com/kb/ask/4895.html, https://baike.baidu.com/item/structured data/5910594?fr=aladdin

Image processing method based on deep learning. Using machine learning to solve image processing problems is called machine vision, and the current mainstream method is deep learning.
Machine vision is a kind of application of deep learning, machine learning is a discipline including deep learning, and image processing mainly applies the method of machine vision.

Science solves theoretical problems (what, why), technology solves practical problems (what to do, how to do it).
insert image description here

Reference: Understand the difference and relationship between artificial intelligence, machine learning, deep learning and neural network

Supervised learning, unsupervised learning, semi-supervised learning

Supervised learning : The process of using a set of samples of known categories to adjust the parameters of the classifier to achieve the required performance.
https://baike.baidu.com/item/supervised learning/9820109?fr=aladdin
supervised learning methods include vector machine (Support Vector Machine, SVM), decision tree, random forest, neural network, linear regression, KNN (K most Proximity algorithms, which can be used for regression, classification, or dimensionality reduction), Naive Bayesian algorithms.
K-nearest neighbor (KNN) is to find objects similar to samples within a certain range.
SVM is a classic generalized linear classifier for binary classification.
Understanding decision trees: https://zhuanlan.zhihu.com/p/197476119

Solving various problems in pattern recognition based on training samples with unknown categories (unlabeled) is called unsupervised learning . This method uses the data distribution of training samples or the relationship between samples to divide samples into different clusters or give the corresponding low-dimensional structure of samples.
Unsupervised learning methods mainly include principal component analysis (PCA), K-means, anomaly detection method, self-encoding algorithm, deep belief network, Herbie learning method, generative confrontation network, and self-organizing mapping network.
https://baike.baidu.com/item/unsupervised learning/810193?fr=aladdin

Semi-supervised learning uses large amounts of unlabeled data, as well as labeled data, for pattern recognition. When using semi-supervised learning, as few people as possible will be required to do the work, and at the same time, it can bring relatively high accuracy.
https://baike.baidu.com/item/semi-supervised learning/9075473?fr=aladdin

Image classification, object detection, semantic segmentation, instance segmentation

reference link

basic knowledge

activation function

The role of the activation function

The activation function is an important part of the neural network, and its role has the following aspects:

  1. Introduce nonlinearity: Activation functions are the main means of introducing nonlinearity in neural networks, because the combination of linear functions is still linear. By using nonlinear activation functions, neural networks can learn more complex functions, thereby improving the expressive ability of the model and better adapting to complex data sets.

  2. Normalized output: The activation function can map the output of the neuron to a specific range, such as [0,1] or [-1,1], which can make the output of the neuron have a uniform scale for easy comparison and processing.

  3. Activate neurons: The name of the activation function comes from its main role, which is to activate neurons. By setting an appropriate activation function, the neuron can output 1 when the input signal reaches a certain threshold, otherwise it outputs 0, thereby realizing the classification of the input signal.

  4. Preventing vanishing gradients: Activation functions can also help with vanishing gradients. In the backpropagation algorithm, the gradient needs to be passed back continuously. If the derivative of the activation function is very small in some regions, the backpropagation algorithm will not be able to effectively update the weights. By using some activation functions with appropriate gradients, such as the ReLU function, the neural network can be easier to train and optimize.

To sum up, the activation function plays a key role in the neural network. It can introduce nonlinearity, normalize the output, activate neurons and prevent gradient disappearance, etc., which has an important impact on the performance and training effect of the neural network. .

Activation functions are generally non-linear

The activation function of the neural network must be nonlinear, because if a linear function is used as the activation function, the output of multiple neurons will only be a linear combination of inputs, plus a bias term, so that nonlinear features cannot be expressed, thus It limits the fitting ability of the neural network. Therefore, the use of non-linear activation functions can make the neural network more expressive, thus able to deal with more complex problems.

Common Activation Functions

  1. Sigmoid function: The Sigmoid function maps the input to between 0 and 1 and has smooth non-linear properties. The value range of the function is between 0 and 1, and the gradient is the largest when the input is 0, but for inputs larger or smaller than a certain range, the gradient of the Sigmoid function will approach 0, which will cause the problem of gradient disappearance.

  2. ReLU function: The ReLU function outputs the value when the input is a positive number, otherwise it outputs 0. The value of the function ranges from 0 to positive infinity, has simple nonlinear characteristics, and avoids the problem of gradient disappearance.

  3. Tanh function: The Tanh function maps the input to between -1 and 1, and has smooth nonlinear characteristics, but like the Sigmoid function, there will also be a problem of gradient disappearance.

  4. Leaky ReLU function: The Leaky ReLU function outputs a small slope when the input is negative, avoiding the dead neuron problem of the ReLU function.

Choosing an appropriate activation function can significantly affect the performance of a neural network, and usually needs to be chosen according to the specific task and data set. For example, for binary classification tasks, you can use the Sigmoid function; for multivariate classification tasks, you can use the Softmax function; for any type of regression problem, you can use the ReLU or Leaky ReLU functions. In addition, the choice of activation function should also take into account issues such as gradient disappearance and explosion, as well as factors such as computational efficiency and numerical stability.

  • After the input is standardized, the gradient descent converges faster. Standardization is to divide each row of the input x by the square sum of the elements in a row.
  • The sigmoid function is basically not used as the activation function, because the tanh function is basically better than the sigmoid function, except for the binary classification problem. Rule of thumb: Use the sigmoid function for binary classification problems or problems with outputs of 0 and 1, and use relu or leaky relu functions for others. Because sigmoid and tanh when z is particularly large or small, the gradient or slope of the derivative will become very small, resulting in a very slow gradient descent.
  • If only the linear activation function is used, then the neural network knowledge will linearly combine the input and then output. At this time, no matter how many hidden layers are used, it will have no effect. The complexity of the model is the same as the standard logistic regression without any hidden layer. Usually the hidden layer does not use linear activation function, the only layer that can use linear activation function is the output layer.
  • Random initialization: When training the neural network, the weights should be initialized randomly, otherwise all hidden units will perform the same calculation, the result is the same, and the output weights are also the same. After multiple iterations, all hidden layers calculate the same function, then Meaningless, unless there is only one hidden layer. The parameter b has no symmetry breaking problem, so it does not need to be initialized randomly, just initialize it with 0 directly.
  • Why is it more efficient to use a deep network (number of hidden layers) rather than a larger network (number of hidden units)?
    The theory in the circuit explains the neural network, which is inseparable from which functions you use circuit components to calculate. According to different basic logic gates, such as AND gates, OR gates, and NOT gates, the same function (or expression) can be Computing with a relatively small but deep neural network is like connecting basic logic gates in series and parallel, but if you can't use many hidden layers, you need an exponentially increasing number of units to achieve the same computational effect. So many mathematical functions are much easier to compute with a deep network than with a shallow network.

training set/validation set/test set, cross-validation...

Andrew Ng:

  • The training set participates in the training, and the model learns experience from the training set, thereby continuously reducing the training error.
  • The purpose of the verification set is to verify different algorithms and test which algorithm is more effective.
  • The purpose of the test set is to make an unbiased estimate of the final selected neural network system.

In order to facilitate understanding, people often compare these three data sets to students' textbooks, homework and final exams: training set-textbook exercises, students master the knowledge verification set-homework questions
based on the contents of the textbook , through homework.
Different students' real-time learning situation and speed of progress.
Test set—examination questions. The questions in the test are not usually seen before, and examine students' ability to draw inferences from one instance.

Training set

Participate in training , common parameters are trained with the training set, their gradients are calculated, and the parameters are continuously updated to reduce the value of the cost function.
Refers to the trainable parameters (such as w, b) in the training model.

Validation set (dev set)

Does not participate in training , usually used for model selection and adjustment of hyperparameters , and determines which set of hyperparameters has the best performance based on the performance of several sets of model verification sets. At this time, the common parameters that have been trained in the training set are used.

Some understanding:

  • At the same time, the verification set can also be used to monitor whether the model is overfitting during the training process . Generally speaking, after the performance of the verification set is stable, if you continue to train, the performance of the training set will continue to rise, but the verification set will not increase but decrease. , so overfitting generally occurs. So the validation set is also used to judge when to stop training.
  • If you tune the hyperparameters directly on the training set without a validation set, then we don't know if it is overfitting or what its true performance is.
  • The purpose of the validation set is to simulate the test set . If your hyperparameters work on the validation set, they will presumably also work on the test set. If there is no validation set, there is only a test set. After many times of training, the model also has a certain degree of overfitting on the test set, although this overfitting may not be as serious as the overfitting on the training data. But this is because we can no longer objectively evaluate the quality of the model with the optimal evaluation index on the test set we manually selected.

test set

It does not participate in training and is used to evaluate the model after training to evaluate its generalization ability .
It's just for evaluation , just for evaluation , just for evaluation , to estimate the generalization ability of the model's actual use process.

Some understanding:

  • The test set is used to evaluate the final selected neural network model, and the model has never seen the test set before. Strictly speaking, the test set can only be used once, otherwise it is like cheating.
  • In the previous model, the [Verification Set] was used to determine the [Hyperparameters], the [Training Set] was used to adjust the [Trainable Parameters], and finally a data set that had never been seen was used to judge the quality of the model. It should be noted that the test set is only used for final evaluation of the model, and the indicators obtained on the test set can be used to compare with models trained by others, or to report to others how effective your model is. Remember that you must not adjust the model hyperparameters according to the model's indicators on the test set (this is what the verification set should do), which will cause the model to overfit the test set, making the test set lose the objectivity and accuracy of its test effect sex.

Cross-validation

The effect of cross-validation is more pronounced for small-scale datasets.

N-fold cross-validation has two purposes: model evaluation and model selection.
N-fold crossover is just a strategy for partitioning a dataset. If you want to know its advantages, you can compare it with the traditional way of dividing the data set. It can avoid the limitations and particularities of fixedly divided data sets, and this advantage is more obvious on small-scale data sets.
Using this strategy to divide the training set and test set can be used for model evaluation; using this strategy to divide the training set and validation set can be used for model selection.
Can't model evaluation and model selection be done without N-fold cross-validation? of course not. As long as there is a test set, model evaluation can be performed; as long as there is a validation set, model selection can be performed. So N-fold cross-validation is just an optional optimization when doing these two things.

Target Detection

  • If you want to see relevant classic and latest papers , check out github: https://github.com/amusi/awesome-object-detection

  • one-stage (single-stage target detection): such as SSD, YOLO.
    Classification and bounding box adjustment are performed directly based on anchors.

  • Two-stage (two-stage target detection): For example, faster-RCNN
    generates a candidate frame (RPN) through a special module, finds the foreground and adjusts the bounding box (based on anchors);
    classifies based on the previously generated candidate frame, and further adjusts the bounding box ( Based on the proposal).

YOLO algorithm

YOLO algorithm development process

YOLO (You Only Look Once) is a target detection algorithm proposed by Joseph Redmon et al. in 2015. Its main idea is to treat the target detection task as a regression problem, and predict the position and category of the target simultaneously in one neural network.

Since the first release of YOLO in 2015, the YOLO series has undergone many updates and improvements. The following is the development history of the YOLO series:

YOLO v1: In 2015, Joseph Redmon et al first proposed YOLO. YOLO v1 uses a single convolutional neural network that divides the input image into grids and predicts the class and location of objects in each grid. A paper was published in 2016. https://arxiv.org/pdf/1612.08242.pdf

YOLO v2: In 2016, YOLO v2 was released. It adopted some improvement strategies, including using deeper network structure, higher resolution input image, Batch Normalization and other technologies to improve detection accuracy and speed.

YOLO v3: In 2018, YOLO v3 was released. It adopts a deeper network structure and multi-scale detection strategy, which can detect targets of different scales, and its accuracy and speed are better than YOLO v2.

YOLO v4: In 2020, YOLO v4 will be released. It uses more technologies, including SPP (Spatial Pyramid Pooling), CSP (Cross Stage Partial Network), Mosaic Data Augmentation, Drop Block, etc., which can further improve detection accuracy and speed .

YOLO v5: In 2020, YOLO v5 is released, which uses a lightweight network structure and a new training strategy to achieve high-precision target detection at a faster speed.

In general, the YOLO series of algorithms continue to develop, continuously improving the accuracy and speed of target detection, and becoming one of the most widely used target detection algorithms.
Original link: https://blog.csdn.net/qq_54372122/article/details/129537219

convolution

  • Convolution operation: The multiplication of a fixed convolution kernel and data in different windows is called a convolution operation.
  • Local connection: Unlike the full connection in the traditional neural network structure, the nodes of the convolutional layer are only connected to some nodes of the previous layer, and are only used to learn local features (local correlation theory). It can effectively reduce the number of weight parameters, speed up the learning rate of the model, and avoid over-fitting to a certain extent.
  • Weight sharing: In the image sliding of the convolutional layer, its convolution kernel parameters are the same when the entire image is sliding.

dilated convolution

Dilated/Atrous Convolution is widely used in tasks such as semantic segmentation and target detection. The classic deeplab series and DUC in semantic segmentation have made in-depth thinking on Atrous Convolution. SSD and RFBNet in target detection also use hole convolution.
Atrous convolution: Fill 0 in the middle of the 3*3 convolution kernel. There are two implementation methods. First, the convolution kernel is filled with 0, and second, the input is equally spaced.
insert image description here
Function:
Expand the receptive field: In order to increase the receptive field and reduce the amount of calculation in deep net, downsampling (pooling or s2/conv) is always performed, so that although the receptive field can be increased, the spatial resolution is reduced. In order not to lose resolution and still expand the receptive field, dilated convolutions can be used. This is very useful in detection and segmentation tasks. On the one hand, a large receptive field can detect and segment large targets, and on the other hand, a high resolution can accurately locate targets.
Capturing multi-scale context information: Atrous convolution has a parameter that can set the dilation rate. The specific meaning is to fill the convolution kernel with dilation rate-1 zeros. Therefore, when different dilation rates are set, the receptive field will be different, and also That is, multi-scale information is obtained. Multi-scale information is very important in vision tasks.
It can be seen from this that the dilated convolution can arbitrarily expand the receptive field without introducing additional parameters, but if the resolution is increased, the overall calculation of the algorithm will definitely increase.

ps: Although dilated convolution has so many advantages, it is not easy to optimize in practice, and the speed will be greatly reduced.

Disadvantages:
Loss of local information: Since the calculation method of hole convolution is similar to the checkerboard format, the convolution result obtained by a certain layer comes from an independent set of the previous layer, and there is no interdependence between the convolution results of this layer. Correlation, i.e. local information loss.
The information acquired at a long distance has no correlation: due to the sparsely sampled input signal of the hole convolution, there is no correlation between the information obtained by the long-distance convolution, which affects the classification result.
Reference:
Summary - Dilated/Atrous Convolution

receptive field

The receptive field is the area of ​​the input image corresponding to the response of a node in the output featuremap is the receptive field. In other words, the value of each output node of the CONV layer only depends on an area of ​​the input of the CONV layer, and other input values ​​outside this area will not affect the output value, and this area is the receptive field.
Padding does not affect the receptive field, stride only affects the receptive field of the next layer featuremap, and size affects the receptive field of this layer.
insert image description here

The size of the receptive field of the nth layer = the size of the receptive field of the previous layer + (the size of the convolution kernel of the nth layer - 1) multiplied by the product of all strides before this layer.
There is a position-receptive field correspondence between any two layers, but we more commonly use the feature map layer to the receptive field of the input image.

overfitting

The essence of the overfitting phenomenon is that the model has learned the characteristics of the training data itself, which is not a global characteristic.

noise

Data that is misleading to the model.
Example:
If there are both mice and cats in a training picture, the label of the training picture is "cat", and the object we want to recognize is also "cat", then the mouse in the picture is noise. If deep learning mistakenly memorizes the information corresponding to "mouse" and "Gaussian white noise" as labels, it will cause model training to fail.

IOU

Intersection over Union (IOU) generally refers to the ratio of the intersection and union between the suggested box and the real box predicted by the model.
A is a suggestion box, B is a real box, the union of set A and set B includes all parts of A and B, and set C is the intersection of set A and set B.
insert image description here
insert image description here

Model building related

Dropout method

Reduces tendency to overfit.
The general dropout method is used between the fully connected layer and the fully connected layer. Linear-non-linear activation (such as relu)-dropout.

flatten

x = torch.flatten(x, start_dim=1)
starts to flatten from dimension 1, the dimension in pytorch is (N, C, H, W), and starts to flatten from channel.

PyTorch related

convolution operation

output H and W

When calculating N according to the formula, sometimes N is a non-integer (such as the output of the first layer of the alexnet, googlenet network)), and then for example the input matrix H=W=5, the F=2 of the convolution kernel, S =2, Padding=1. After calculation, we get N = (5-2+2 1)/2+1= 3.5 in Pytorch at this time: in the convolution process, the last row and the last column will be ignored directly to ensure that N is an integer. At this time N= (5-2+ 2 1-1)/2+1=3. Some people say that convolution and pooling in pytorch are rounded down, others say that convolution is rounded down, and pooling is rounded up (not verified).

nn.Conv2d

(128,192,kernelsize=3,padding=1),

Among them, padding can only input int or tuple.
For example, tuple: (1, 2)
1 represents a row of zeros on the top and bottom,
and 2 represents two columns of zeros on the left and right sides.
If you want to add one column on the left and two columns on the right, use nn.zeropad2d, see below

nn.ZeroPad2d

One column is added to the left side and the upper side, and two columns are added to the lower side of the right side.
After a period of preparation, it was found that the original title could not accurately express the research content, and it was necessary to define the title more rigorously. After consulting with the instructor, apply for modification of the topic.

Some function usage in Python

Import os
to get the root directory where the current file is located, data_root = os.getcwd()
or data_root=os.path.abspath(os.path.join(os.path.abspath( file ),“…/…/…”))

Return to parent directory:…/
Return to parent directory:…/…

Introduction to matplotlib.pyplot.imread function

matplotlib.pyplot.imread(path) is used to read a picture and turn the image data into an array.
Parameters:
the path of the image file to be read.

Return value:
If it is a grayscale image: return an array of (M,N) shape, where M represents the height and N represents the width.

If it is an RGB image, return an array of (M, N, 3) shape, where M represents the height and N represents the width.

If it is an RGBA image, return an array of (M, N, 4) shape, where M represents the height and N represents the width.

Also, PNG images are returned as an array of floats (0-1), all other formats are returned as an array of type int, and the bit depth depends on the specific image.

Therefore, if you need to read the file in pytorch, you need to convert the image from HWC->CHW through .permute(2, 0, 1)

Reference: https://www.zhihu.com/question/265443164/answer/2417856431
Reference: https://zhuanlan.zhihu.com/p/113623623
Receptive Field Reference: https://zhuanlan.zhihu.com/p/ 113487374, https://www.cnblogs.com/ocean1100/p/9864193.html

Guess you like

Origin blog.csdn.net/ThreeS_tones/article/details/129083413