Learning: YOLO yolo v1 Series

Copy the link
yolo v1 published in the CVPR2016, it is a classic one-stage detection algorithm . Before no contact yolo, brother and I have just argued, whether the predicted coordinates of the bounding box as the width and height as classified network , there Yolo v1 proved my guess was right.

Thesis title: "You Only Look Once: Unified , Real-Time Object Detection"
paper Address: https://arxiv.org/pdf/1506.02640.pdf

v1 is the pioneer for yolo series, with simple network structure, a simple reproduction process (the author gives a detailed tutorial) CVer been sought after.
yolo_v1 laid yolo family of algorithms ** "divide and conquer" tone ** on yolo_v1, the input picture is divided into a grid of 7X7, as shown below:

Here Insert Picture Description
As shown above, the input image is divided into 7x7 cells, with each cell independently for detection.

Here easily be misled: limited field of view of each grid cell and it may be a local feature, so difficult to understand yolo why the object can be detected much larger than grid_cell . In fact, yolo's practice is not to each individual grid feed as an input to the model, in the process of inference, the grid just division of the center point of the object with, not the picture slice, will not let the grid from the overall relationship .

May be further understood by the structure yolo_v1 compared faster r-cnn kind of two-stage complex network in terms of structure, yolo_v1 network structure appears much more close to the people. The basic idea is this: location prediction frame, size and classify objects by both CNN violence predict it .
Here Insert Picture Description
The above is a configuration block diagram of FIG yolo_v1 can easily know the forward propagation calculation, it is easy to be understood by the reader configuration diagram. The output v1 is a tensor of 7x7x30, 7x7 represents the input picture framing bit 7x7 grid, and the other dimension is equal to each of the cells = 30.30 (2 * 5 + 20). Representative predict two frame five parameters (x, y, w, h , score) and 20 species .
Here Insert Picture Description
Effect of depth can be seen that the output of the tensor yolo_v1 target species can be detected. v1 only 30 output depth, means that each cell only two prediction blocks (20 and only recognized class of objects), which for the intensive target detection and detection of small objects can not be well suited .

training

As previously mentioned, Yolo end to end training, the prediction frame position, size, type, prediction information confidence (Score) like a loss function through training.
Here Insert Picture Description
Here Insert Picture Description

to sum up

For yolo v1 value of the entire series of characteristics that is v2 / v3 also reserved, can be summarized as three points:

  1. Leaky RELU , compared to ordinary ReLU, leaky not make negative 0 directly, but multiplied by a small factor (constant), to retain a negative output, but attenuates negative output; the following formula:
    Here Insert Picture Description
  2. Divide and conquer , the picture area is divided by a grid, each independently of the detection target region;
  3. To-end training . Function can reverse propagation loss throughout the network, which is the advantage of one-stage detection algorithm.

Guess you like

Origin blog.csdn.net/czp_374/article/details/91883969