Paddle-based computer vision introductory tutorial - the basic implementation process of the fourth lecture of deep learning

B station tutorial address

https://www.bilibili.com/video/BV18b4y1J7a6/

The basic implementation process of deep learning

Dataset preparation

Deep learning requires the support of a large amount of data . After building a good model and learning through a large amount of data, it will have a strong generalization ability . The model continuously learns all the pictures in the data set. After converging to a certain level, inputting a brand new picture (not in the data set) will also output a relatively correct result . The more scenes the dataset contains and the more complex the background , the better the final effect will be achieved. The data set should contain all the situations that may appear in the project, such as lighting , stickers and other adverse interference situations, all need to appear in the data set.

But for our single specific project, it is more difficult to shoot objects in multiple scenes . If you set up your own camera, the background you shoot is relatively single , which is difficult to meet the rich, complex and random requirements.

As shown in the above picture, the harmful garbage data set made by myself, because the mechanical structure is assumed, the background is very single. The types of batteries are also very limited, and it is impossible to purchase all types of batteries. Taking too many similar pictures will not only fail to meet the requirements of the dataset, but will also greatly affect the final model accuracy . Therefore, it is recommended to add some existing data sets on the Internet, or use crawlers to expand the complexity and randomness of the data sets . You can also deliberately create some unique scenes by shooting yourself.

Taking target detection as an example, ideally, the number of pictures containing a single object should be greater than 1,500 , and the number of occurrences of a single object should be greater than 10,000 , and all of them need to be labeled . The production of data sets is very time-consuming , especially when labeling, which requires a lot of energy.

data augmentation

In addition to increasing the number and complexity of pictures as much as possible, data augmentation is the most commonly used method of expanding data sets , and it is also of great help to improve the accuracy of the model. It can be said that data enhancement is a very practical technique.

insert image description here

It can be seen that compared with the original image, the contrast and brightness can be randomly changed, or some pixels can be deducted , or even the overlapping and splicing of multiple images are all means of data enhancement. Data enhancement can increase the difficulty. It is also possible to expand the dataset exponentially, which is very effective for most scenarios , especially when the dataset is not sufficient, using data augmentation can greatly improve the accuracy and generalization ability . Then, it should be noted that for most lightweight small models, data augmentation should be minimized , because the fitting ability of small models is limited, and excessively increasing the complexity of the image is not conducive to the final convergence process.

model building

The model of deep learning is the most critical link. For computer vision, the model is generally a convolutional neural network composed of a convolutional structure .

insert image description here

The figure above shows the overall model structure of YOLOv3 , which is a classic algorithm for target detection. It can be seen that the entire model is composed of a convolutional structure. For the original (416, 416, 3) original image, after inputting the model, three results are output. After filtering these three results, the prediction frame can be obtained. position, and finally achieve the following effects.

There are generally many parameters to be found in the model . Unlike our basic disciplines, using some polynomial fitting methods, the parameters of the computer vision model are very large, which means that its fitting ability is very strong , and the model is also It is very large, and can find corresponding fitting results for general image tasks , and also has a strong generalization ability .

model training

Obviously, the original model parameters cannot obtain the results we want at once . For our specific project requirements, such as I want to detect cans, we detect pedestrians, etc., we need to train our model to let the model know what we want. identify something. In other words, it is to import the data set we just prepared into our model continuously, and our model will output a result (pred) , which is the predicted value , and compare this predicted value with our marked The true value (target) seeks the deviation (loss) , and using mathematical methods to reduce this deviation means that the predicted value will get closer and closer to the true value, that is, the output of the model is more and more in line with my requirements.

To give a simple example, for example, the input is x, the true value is y, the model is F(X) , and the predicted value pred is:
pred = F ( x ) pred = F(x)before _ _ _=F ( x )
and use the simplest formula to find the error:
loss = 1 2 × ( pred − y ) 2 = 1 2 × ( F ( x ) − y ) 2 loss =\frac{1}{2}\times( pred - y)^2 = \frac{1}{2}\times(F(x)-y)^2loss=21×( prev _ _ _y)2=21×(F(x)y)2
It can be seen that loss is a function related to all parameters of the model. Supposethe curve of lossis as follows:

Our goal is to make the error loss as low as possible , that is, close to point C. For the cases of A, B, and D, we all hope to modify the model parameter x to reduce the loss . This is the job of the optimizer . If gradient descent method (GD) is used , it is corrected like this:
xi = xi − α × ∂ ∂ xi ( F ( x ) ) {x_i} = {x_i} - \alpha \times {\partial \over {\partial {x_i}}}\left( {F\left( x \right)} \right)xi=xia×xi(F( x ) )
wherea is the step size, also known as the learning rate, the gradient descent method can make other points converge to point C, but there are two main problems.

  1. The gradient of all parameters needs to be solved, and the amount of calculation is very large
  2. It is easy to fall into the local optimal solution, such as the first minimum value in the figure

At present, our commonly used optimizers include SGD, Adam, etc. SGD is also the stochastic gradient descent method , but although it contains a certain amount of randomness, it is equal to the correct derivative from the expectation point of view. It only needs to calculate the gradient for a part of the parameters each time, which greatly reduces the amount of calculation . At the same time, due to the randomness, it can effectively avoid the local optimal solution , that is, the saddle point. It solves the two major problems of GD and is widely used in the field of computer vision.

Model Deployment and Prediction

For most projects, it is generally necessary to deploy the model to specific hardware, such as ARM development boards , FPGAs , etc. Deployment is also a very important part .

It is almost impossible for our terminal equipment to be equipped with a desktop host , which means that there is no strong computing power and storage capacity. The key to deployment is to compress the model volume as much as possible and improve the model prediction speed while ensuring the accuracy . You can use some open source tools, such as ncnn , tensorrt , paddle-lite , etc. for corresponding deployment.

Guess you like

Origin blog.csdn.net/weixin_45747759/article/details/122590962