Deep learning 03-CNN application

Note: This series is the study notes of the July algorithmic deep learning course

1 Overview

The main tasks of CNN include object recognition + positioning, object recognition, and image segmentation.
Insert picture description here

Picture recognition: Picture classification, assuming that the picture is a main object.
Picture recognition + positioning: you can use a rectangular frame to draw the objects in the picture.
Object recognition: In reality, a picture contains multiple objects. Use rectangular boxes to mark all objects.
Image segmentation: On the basis of object recognition, the edge of the object is detected.

2 Picture recognition + positioning

Picture recognition: input: picture; output: category label; evaluation criterion: accuracy.
Picture positioning: input: picture; output: object bounding box (x, y, w, h); evaluation criterion: merge criterion
(x, y) Is the point in the upper left corner, w is the width, and h is the height.
Intersection criterion: the area of ​​the intersection of two rectangles/the area of ​​the intersection of two rectangles, the value needs to be >=0.5 before it can be used

Picture recognition is a classification problem, using cross entropy loss.
Image positioning is a regression problem, using the L2 distance (Euclidean distance) loss function.

The picture classification problem is fine-tune on well-known models.
Insert picture description here

At the end of the well-known model, it can be connected after or before the fully connected layer. For classification problems, add a classification header to define a cross-entropy loss function; for regression problems, connect a regression header to define a Euclidean distance loss function.

Idea 1: Think of it as a regression problem.
Can the object be further positioned?
For example, locate 2 eyes, 2 ears, and 1 nose of a cat. In the previous question, it was an object with 4 numbers. The question now has 5 objects, so there will be 20 numbers. Estimate a vector of length 20. The premise here is: the structure of the objects is the same. When encountering only one cat's eyes, it makes a mistake.

It can further recognize the person's posture. Similar to a person made of matches. Each key is a coordinate, do a regression. Paper: Toshev and Szegedy, "DeepPose: Human Pose Estimation via Deep Neural Networks", CVPR 2014

Idea 2: Picture window + recognition and integration.
Take boxes of different sizes and
let the boxes appear in different positions
. Make classification judgment scores
on the obtained content, and make regression (box) judgment scores on the obtained content, and make
the result boxes according to the scores. Extract and merge.
Insert picture description here
Overcome the problem: many parameters, slow calculations.
1 Replace the fully connected layer with the convolution kernel: the parameters can be reused.
Perception eye. Points at different positions of the convolutional layer are equivalent to only seeing different areas of the first layer.
In this way, you don't need to pull out the pictures in different positions in the above method and send them to the neural network for calculation.
2 Reduce the amount of parameters
Insert picture description here
Paper: Sermanet et al, "ntegrated Recognition, Localization and Detection using Convolutional Networks", ICLR2014

3 Object monitoring

Actually I don’t know there are multiple objects

In the above steps, there is a sliding window, and each time the window is sliding, a category recognition is performed to determine whether it is an animal.
Problem: The position of the frame is different, the size is different, and the ratio is different, resulting in a large amount of calculation.
Solution 1: Edge strategy

3.1 Selective search

Similar to clustering, based on the same object pixels are basically the same and merged into regions from bottom to top.
Expand the area into a box.
Paper: Uijilings et al, "Selective Search for Object Recognition", IJCV 2013

The selection algorithm of the box is shown in the following table.
Insert picture description here

3.2 R-CNN

Insert picture description here

Step 1: The input layer is a picture. Use the candidate box selection algorithm to select about 2000 candidate boxes, scale each candidate box to the same size, and give it to the convolutional layer as an input parameter.
Insert picture description here

Do fine-tune on a well-known model. Modify the parameters according to your own model. For example, there are 20 categories, plus a background category, which means that it does not belong to any type.

Step 2: Use the frame candidate algorithm to select the frame and cut out the frame. Scale to the same size. This step is calculated on the CPU.

Step 3: Convolutional layer, learning parameters and features
Insert picture description here

The frame of the previous step is used as input, CNN is used for forward calculation, the fifth pooling layer is taken as the feature, and the feature is saved to the hard disk.

Step 4: Do SVM classification
Use SVM to classify the above features.

Step 5: bbox regression

will regress the above features and compare it with the standard answer. Determine whether the candidate frame should move up, down, left, and right.

论文:Girschick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014

3.3 Fast R-CNN

1 It takes a lot of time to extract the features of 2000 candidate frames through the neural network.
Use the perspective eyes of the convolutional layer to achieve shared figure calculations, thereby speeding up.
It is no longer divided into 2000 sub-images, but uses the perspective of the convolutional layer. A point of a convolutional layer can correspond to a part of the original image, and the amount of calculation is reduced by sharing parameters.

2 Make an end-to-end system

3 How to connect different sizes of graphs to the fully connected layer

Processing in RIP Region of Interest Pooling
If the size of the matrix obtained is 500x300, but the final desired result is 100x100, first divide the picture into many grids and use max-pool for pooling to become the desired size.

Generally speaking, the candidate frame will not be smaller than the target size.

3.4 Faster R-CNN

Does the Region Proposal (candidate figure) have to be done separately?
Candidate figure selection is done on the CPU.
Insert picture description here

The region proposal network
selects the central point as the central sliding window 3x3, 1:2, 2:1, 1:1, and selects the candidate figure according to this ratio.
The method of choosing the center point is hyperparameters, and you can choose by yourself. The original picture can be divided into several blocks, with the center of each block as the center.

论文:Ren et al, “Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks”, NIPS 2015

One-stage vs two-stage
faster-rnn is two-stage . The first stage first finds the candidate frame, and the second stage determines whether there are objects in the candidate frame and the boundary condition
yolo/ssd is one-stage. The original picture is divided into different grids, with the center of the grid as the center, and boxes of different proportions are taken, and the values ​​are obtained through the neural network: (dx, dy, dy, dw, confidence) and class. The first 4 are fine-tuning parameters, confidence indicates whether it is an object, and class indicates what kind of object it is.

4 Semantic segmentation

Semantic segmentation is to classify each pixel.

4.1 Sliding window treatment

Many parameters

4.2 Fully Convolutional Neural Network

Insert picture description here

CxHxW C is the number of categories

The input is 3xHxW, the output is HxW, and the classification
HxW of each pixel output may be very large.
The training data is marked with a border.
Insert picture description here

Do downsampling first, then upsampling.
How to do upsampling?
Professional name transpose convolution.
For example, for a 2x2 area, the step size is 2, then the output is 4x4, adding 1 column each time.
Insert picture description here

For example, the pink area=3, step size=2, pad=1, and one grid is expanded to 9 grids. Each grid is filled with 3a (a is a coefficient).

For example, the blue area = 2 and the coefficient is b, then each cell in the blue frame is 2b. The value of the three squares of two crossings = 3a+2b.

Inverse convolution is like this and constantly restores.

5 code

tensorFlow object detection https://github.com/tensorflow/models/tree/master/research/object_detection

You can use TensorFlow for object recognition, starting with install.

faster-rnn https://github.com/rbgirshick/py-faster-rcnn
found that the object detection piece is actually written in the cloud and fog , I found a blog and wrote it well and recorded it.

Finally, summarize the steps of the major algorithms:
RCNN
1. Determine about 1000-2000 candidate boxes in the image (using Selective Search)
2. The image block in each candidate box is scaled to the same size and input into the CNN Perform feature extraction
3. Use the classifier to determine whether the feature extracted from the candidate box belongs to a specific class
. 4. For the candidate box belonging to a certain category, use a regression to further adjust its position

Fast R-CNN
1. Determine about 1000-2000 candidate frames in the image (using Selective Search)
2. Input the entire image into CNN to get the feature map
3. Find the feature map of each candidate frame Map the patch, and input the patch as the convolution feature of each candidate box into the SPP layer and subsequent layers
. 4. For the features extracted from the candidate box, use the classifier to determine whether it belongs to a specific class.
5. For a certain category Candidate box, use the regressor to further adjust its position

Faster R-CNN
1. Input the entire picture into CNN to get the feature map
2. Input the convolution feature into RPN to get the feature information
of the candidate box 3. Use the classifier to determine whether the feature extracted from the candidate box belongs to one Specific category
4. For the candidate frame belonging to a certain category, use the regression to further adjust its position

(The original article of the CSDN blogger "Go back to Mars", the original link: https://blog.csdn.net/H_hei/article/details/87298097)

Guess you like

Origin blog.csdn.net/flying_all/article/details/114092649