Wu Enda [Deep Learning Engineer] 04. Convolutional Neural Network Object Detection (1) Basic Object Detection Algorithm

This note is about the third week of the "Convolutional Neural Networks" series: Object Detection

The main contents are:

1. Target positioning

2. Feature point detection

3. Object detection

 


 

 target setting

Use the algorithm to determine whether the target object is in the picture, if it is, mark its position in the picture and mark it with a frame

    

Among the problems we have studied, the idea of ​​image classification can help learning classification and positioning, and the idea of ​​classification and positioning can help learning object detection

    a. Classification and positioning problems, usually there is a larger object in the middle of the picture.

    b. Object detection problem, the picture can contain multiple objects, or there will be multiple objects of different classifications in one picture.

The help of image classification for classification and positioning:

    For a common image classification network:

    

Classification positioning is to add the output b x , b y , b h , b w and a classification label (c 1 , c 2 , c 3 )     of the fully connected layer to the image classification network.

       

 

Define the target label y

The label y is:

       

Element meaning:

    a. Element P c : represents the probability of a certain classification of the detected object (whether it contains the detected object of interest). For the Liezi mentioned in the video, it is necessary to inspect cars, motorcycles, pedestrians, and scenery. But the first three detection objects are what we care about, then if P c is 1, the picture is a scene or other things that are not our relationship, then P c is 0.

    b. Elements b x , b y : the coordinates of the center position of the marked border, generally represented by (b x , b y ). The upper left corner of the picture is marked with (0, 0) and the lower right corner is marked with (1, 1).

    c. Elements b h , b w : the length and height of the marker border. b w is the length, and b h is the height.

    d. Elements c 1 , c 2 , c 3 ..... c n : are classification labels, and n corresponds to the actual number of classification labels. But only one of c 1 , c 2 , c 3 ..... c n is 1. The only classification labels concerned in the video are cars, motorcycles and pedestrians, so n is only 3.

Loss function:

    When P c is 1, the loss value is equal to the square of the corresponding difference of each element

      

    When P c is 0, only need to pay attention to the accuracy of the neural network output P c , y 1 is P c

      

Feature point detection

The neural network can recognize the target features by outputting the (x, y) coordinates of the feature points on the image.

In order to build such a network, you need to select the number of feature points, and generate a label training set image X and label Y containing these feature points (these features are manually labeled), and then use the neural network to train, output The location of the feature points in the image

                                  

 

 

Target Detection

 

Object Detection Algorithm Based on Sliding Window

 

Step 1: Create a label training set. The training set is an appropriately cropped image sample, so that the detection object is located in the center and basically occupies the entire image.

    

Step 2: Start training the neural network, and then use this network to implement sliding window detection

         

Step 3: Sliding window detection

First select a window of a specific size to fix the sliding window, traverse each area of ​​the intercepted image, input the intercepted small picture into the convolutional network trained above, and classify each position by 0 or 1 (judgment interception whether there is an object to be detected in the image). Select a larger window and repeat the above operation.

                      

 

 Convolutional sliding window (improvement of the above algorithm)

Convert the fully connected layer to a convolutional layer:

Principle: The video says yes. From a mathematical point of view, the converted convolutional layer is the same as the fully connected layer. Each of the 400 nodes has a 5x5x16 dimension filter. These values ​​are the activation values ​​of the previous layer. The output of an arbitrary linear function.

Fully connected model:

Change to convolutional layer: convert the FC fully connected layer to use 16 5x5 filters to implement, and then use 400 1x1 filters to implement the FC layer (the last 1x1 filter is processed like FC's softmax function).

 

The sliding window object detection algorithm is implemented by convolution:

 For a single convolution implementation:

Convolution sliding window implementation process: Convolution sliding can actually regard the window as the filter of the convolutional neural network, and the sliding step size is the step size of the filter. In this way, we do not need to divide the input image but input it into the convolutional network as a whole image for calculation, in which the common area can share many calculations.

  

      

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324731244&siteId=291194637