Some basic understanding of image recognition by convolutional neural network

 

Look at Wu Enda's deep learning record . Some gains of convolutional neural networks in image processing

Compared with the fully connected network, the convolutional neural network has three strategies of local receptive field, weight sharing and downsampling, which reduces the complexity of the network model, and at the same time, it has a degree of change in the form of translation, rotation, scale scaling, etc. The immutability. , And compared to a fully connected network, it can also take into account the change of location, so it is widely used in image classification. Take the detection of the vertical edge of a picture as an example. How to detect edges in an image? Let’s look at this example. Take a 6×6 grayscale image as an example. It is a 6×6×1 matrix.

In order to detect the edges of the image, we construct a 3×3 matrix, which is also called a filter or kernel. This matrix is ​​1, 1, 1, 0, 0, 0, -1, -1, -1. As shown below:

For this 6×6 image, we perform a convolution operation on it. Get a 4×4 image.

These pictures and filters are matrices of different dimensions, but the matrix on the left is actually a grayscale picture, the middle one is understood as a filter, and the right matrix, we understand it as another picture. This is the vertical edge detector. To give a simple example, this is a simple 6×6 image, the three columns on the left are all 10, and the right is 0.

If you think of it as an image, a pixel value of 10 is a relatively bright pixel value, and the pixel value on the right is relatively dark. There is a relatively obvious vertical edge in the middle of the picture. When we use a 3×3 filter for convolution operation, if the 3×3 filter is visualized, it looks like this, with bright pixels on the left, and then a transition segment 0 in the middle, and then dark on the right of. After the convolution operation, the matrix on the right is obtained. If you consider the rightmost matrix as an image, it looks like this. He has a bright area in the middle, corresponding to the vertical edge in the middle of the initial 6×6 image. We found that it will detect the vertical edges in the image very well.

Usually when we perform supervised classification learning of images through neural networks, we need neural network softmax to output its classification labels. If we also need to locate the position of the target in the image, we can let him output a few more units and output a bounding box . Specifically, let the neural network output 4 more numbers, labeled bx, by, bh, and bw. These 4 numbers are the parameterized representation of the bounding box of the detected object. As shown below:

In the volume integral network, we make the output y of the neural network as shown in the figure below, pc is the flag of whether there is a target in the picture, bx, by, bh and bw are the position parameters of the target, and c1, c2, c3 are the label categories . In the loss function, we simplify it to the calculation of its variance,

Then it is calculated as a regression parameter.

Project practical experience

  1. Choice of algorithm

After introducing the principle, the following is the algorithm realization process. Ordinary sliding window target detection algorithms use a fixed step to detect a box window with a sliding target. Many small squares need to be cut out in the picture. The convolutional network needs to be processed one by one, and the computational cost is very high. Sometimes if the selected step size is large, it will obviously reduce the number of windows input to the convolutional network, but it will seriously affect the performance. To solve this problem, we use the YOLO algorithm (You only look once). The core of this algorithm is that it divides the picture into X * X grids. Note that these grids do not cut a picture into X * X pieces, but It means that a grid will output the corresponding result below, as shown in the figure below. Extract deep features through upsampling, whose dimensions are the same as the dimensions of the feature layer to be fused (different channels), and then use threshold and non-maximum suppression to determine whether the grid contains the image to be classified, to get the classification we want Information

 

 

Guess you like

Origin blog.csdn.net/qq_44665418/article/details/106432736