Analysis of the application of convolutional neural network in computer vision (with YOLOv5 experiment summary)

Written in front: This article is a summary of the author's notes on the way to study.

Some textual content is extracted from major websites, including Baidu, Zhihu, CSDN, etc., and shared with everyone as study notes, non-commercial intrusion and deletion.

The experimental part of the article is the real experimental data of the author's experiment, which can be used as a reference for friends.

I. Introduction

        Convolutional neural networks can process images and any data that can be transformed into an image-like structure. Compared with traditional algorithms and other neural networks, convolutional neural networks can efficiently process two-dimensional local information of images, extract image features, and perform image classification. Through massive labeled data input, the model is trained with gradient descent and error backpropagation.

2. The structure of convolutional neural network

        The general structure of a convolutional neural network is a convolutional layer, an activation layer, a pooling layer, and a fully connected layer, and some also contain other layers, such as regularization layers, advanced layers, etc. The basic structure of convolutional neural network is shown in the figure. The structure and principle of common layers will be described in detail below.

                

 

2.1 Convolution layer

        Each convolutional layer in the convolutional neural network is composed of several convolutional units, and the parameters of each convolutional unit are optimized through the backpropagation algorithm . The purpose of the convolution operation is to extract different features of the input. The first convolutional layer may only extract some low-level features such as edges, lines, and corners. More layers of networks can iteratively extract more complex features from low-level features. Characteristics.

 

 2.2 Activation function

        Activation functions are very important for artificial neural network models to learn and understand very complex and nonlinear functions. They introduce nonlinear properties into our network. In the neuron, the input inputs are weighted and summed, and then applied to a function, which is the activation function. The activation function is introduced to increase the nonlinearity of the neural network model .

              

 

2.3 pooling layer

        Pooling is another important concept in convolutional neural networks, which is actually a form of downsampling. There are many different forms of nonlinear pooling functions, of which "Max pooling" is the most common. It divides the input image into several rectangular areas and outputs the maximum value for each sub-area. Intuitively, this mechanism works because, after a feature is discovered, its precise location is far less important than its relative location to other features. The pooling layer will continuously reduce the space size of the data, so the number of parameters and the amount of calculation will also decrease, which also controls overfitting to a certain extent . Generally speaking, pooling layers are periodically inserted between the convolutional layers of CNN.

        Pooling layers typically act on each input feature separately and reduce its size. The most common form of pooling layer is to divide every 2 elements from the image into blocks, and then take the maximum value of 4 numbers in each block. This will reduce data volume by 75%.

  

 2.4 Fully connected layer

        The fully connected layer in the convolutional neural network is equivalent to the hidden layer in the traditional feedforward neural network. The fully connected layer is located at the last part of the hidden layer of the convolutional neural network and only passes signals to other fully connected layers. The feature map will lose the spatial topology in the fully connected layer, which is expanded into a vector and passed through the activation function 

                From the perspective of representation learning , the convolutional layer and pooling layer in the convolutional neural network can extract features from the input data, and the role of the fully connected layer is to nonlinearly combine the extracted features to obtain an output, that is, the fully connected layer itself It is not expected to have feature extraction capabilities, but tries to use existing high-order features to accomplish learning goals.

                                    

 3. Convolutional neural network YOLOv5

        There are two types of target detection architectures, one is two-stage and the other is one-stage. The difference is that two-stage has a region proposal process, which is similar to a sea selection process. The network will generate locations and categories based on candidate regions. , while one-stage generates the position and category directly from the image. The YOLO mentioned today is a one-stage method. YOLO is the abbreviation of You Only Look Once, which means that the neural network only needs to look at the picture once to output the result. YOLO has released a total of five versions, of which YOLOv1 laid the foundation for the entire series, and the following series are improvements based on the first version in order to improve performance.

        And today I am introducing YOLOv5, the latest convolutional neural network of YOLO! On June 9, Ultralytics opened up YOLOv5, which is less than 50 days since the last release of YOLOv4. And this time YOLOv5 is completely based on PyTorch! While we were still amazed at the various high-end operations and rich experimental comparisons of YOLOv4, YOLOv5 brought stronger real-time target detection technology. According to the official number, the current version of YOLOv5 has the fastest inference time of 0.007 seconds per image, which is 140 frames per second (FPS), but the weight file size of YOLOv5 is only 1/9 of YOLOv4.

YOLOv5 has 4 versions of performance as shown in the figure:

  

  • experimental training

I first trained YOLOv5 on the thin and light notebook HUAWEI MateBook 14.

Experimental platform: CPU: Intel core i7 (10th) GPU: Nvidia Geforce MX-250

Training parameters: Epoch: 100 img_size: 224*224 batch_size: 16

Experimental results: 

Comprehensive performance indicators:          

        A confusion matrix is ​​a summary of the prediction results for a classification problem. Aggregating the number of correct and incorrect predictions using count values, broken down by each class, is the crux of the confusion matrix. The confusion matrix shows which parts of the classification model are confused when making predictions. It not only allows us to understand the mistakes made by classification models, but more importantly, which types of errors are occurring.

This experiment generates a confusion matrix:

                          

         The LABELS generated by the experiment and its related curves:

              

        Precision starts from the perspective of prediction results and describes how many of the positive examples predicted by the binary classifier are real positive examples, that is, how many of the positive examples predicted by the binary classifier are accurate. "The proportion of the part that the classifier considers to be the positive class and is indeed the positive class to all the classifiers that are considered to be the positive class" measures the probability that the positive class classified by a classifier is indeed the positive class.

                        

        Recall starts from the perspective of real results and describes how many real positive examples in the test set are selected by the binary classifier, that is, how many real positive examples are recalled by the binary classifier. "The proportion of the part that the classifier considers to be the positive class and is indeed the positive class to all the positive classes" measures the ability of a classification to find out all the positive classes.

Experimental P curve and R curve:

                                                              

        Precision and Recall are usually a pair of contradictory performance metrics. Generally speaking, the higher the Precision, the lower the Recall. The reason is that if we want to increase the Precision, that is, the positive examples predicted by the two classifiers are as real as possible, then we need to increase the threshold for the two classifiers to predict positive examples. We mark the sample as a positive example, so now we have to increase the probability to be greater than or equal to 0.7 before we mark it as a positive example, so as to ensure that the positive examples selected by the binary classifier are more likely to be real positive examples; and this goal is precisely related to improving Recall On the contrary, if we want to improve Recall, that is, the two classifiers can pick out the real positive examples as much as possible, then it is necessary to lower the threshold for the two classifiers to predict positive examples. Just mark it as a real positive example, so now we have to reduce it to greater than or equal to 0.3 and we will mark it as a positive example, so as to ensure that the binary classifier selects as many real positive examples as possible.

Experimental PR curve:

   

        F1 score (F1-score) is a measure of classification problems. Some multi-category problems often use F1-score as the final evaluation method. It is the harmonic mean of precision and recall, with a maximum of 1 and a minimum of 0.

        For a certain classification, a judgment index of Precision and Recall is combined, and the value of F1-Score is from 0 to 1, 1 is the best and 0 is the worst. In short, I want to control recall and precision at the same time to evaluate the quality of the model.

                                      

 The picture shows the experimental F1 score curve:

The following is the experimental effect diagram of the final model on the validation set:       

 

 

        The analysis of the experimental results shows that the trained model has achieved certain results. However, limited by the performance gap of the graphics card of thin and light notebooks , I will use the server for more in-depth training in the next step. Experiment with different parameters, continuously adjust and improve the YOLOv5 network structure to select the optimal experimental parameters, and obtain a more perfect training result model.

Remarks: The current server training mAP has reached 93.5% (not completed)

Guess you like

Origin blog.csdn.net/ClintonCSDN/article/details/120606518