Target detection algorithm algorithm introduced YOLO

YOLO算法(You Only Look Once)

For example, you are 100x100 input image, and the image put on a network, in order to facilitate talk, as used herein 3x3 grid, will use a finer mesh (e.g., 19x19) in actual implementation. The basic idea is to use image classification and localization algorithm, then the algorithm is applied to the nine squares. More specifically, that you need to do this training labels, for each of the nine grid is assigned a label y, wherein y is an 8-dimensional vector (as described earlier, the same, are Pc, bx, by, bh, bw , c1, c2, c3, where Pc = 1 indicates containing a target, Pc = 0 indicates the background; c1, c2, c3 represents three targets to be classified, such as people, cars, motorcycles, excluding the background). 9 grid, each grid has such an 8-dimensional vector, that do not contain the target grid, the label y = [0,?,? ,?,?,?,?,?,?] ,? represent any value . For lattices with targets, YOLO algorithm is treated , taking the center point of each object, the object is then assigned to the center point of the object grid contains , so the next figure to the left of the car assigned to the four lattice ( from left to right, from top to bottom number, block grid in green), and therefore five intermediate lattices contain no target. For the destination tag of the four lattice is such that: y = [1, bx, by, bh, bw, 0, 1, 0]. Thus, for any grid, you will get an 8-dimensional output vector, since there is a grid of 3x3, so there are nine grid, so the total output size 3x3x8.

         If you now want to train a neural network input 100x100x3, eventually mapped to the output size of a 3x3x8. So you have to do is have an input x, there is a corresponding target label of 3x3x8. When you use a back propagation neural network is trained, the input x mapped to any kind of output vector y.

         The advantage of this algorithm is that the network can accurately output the bounding box. So the test, all you do is provide the input image x, then ran forward propagation, until you get the output y, then here for the corresponding nine 3x3 output position, we can read as 1 or 0. As long as the number in each grid target of no more than one, this algorithm should be no problem; for the existence of multiple objects in a grid problem, discussed later. In practice, we may use a finer grid of 19x19, so is 19x19x8, such a fine grid of more, then the probability of multiple objects are assigned to the same grid is very small. Also mention one, the object is assigned to a lattice process is that you observe the center point of the object, and then assign this object to its center point where the grid, so even if the object can span multiple grid, it will only be allocated to which one of the nine grid.

1) the output of the neural network may have any bounding box aspect ratio, and can output more accurate coordinate, the step size will not be limited slip window classifiers

2) This is a Convolution, and you did not run the algorithm on nine 3x3 grid, do not need to make the same algorithm ran nine times, on the contrary, it is the realization of a single convolution, but you use a convolution network there are many computational steps to share, so the algorithm is high efficiency

In fact, YOLO algorithm has an advantage, because it is a Convolution, it runs very fast, can achieve real-time recognition.

Another small detail, how to encode the bounding box (bx, by, bh, bw)?

  The above figure, there are two cars, we have a 3x3 grid, to the right of the car, for example, the sixth grid have a car, so the target tag y in Pc = 1, behind the c1, c2, c3 are 0 and 1 , 0 (assuming behalf of pedestrians, cars, motorcycles three classes).

         In YOLO algorithm, for the bounding box, we agreed to the upper left corner of each bin is (0, 0), the lower right corner coordinates are (1, 1), to specify the auto center point position (orange dot in the figure), bx about 0.4, by about 0.3, then the height of the bounding box, is represented by the ratio of the overall width of the grid, so that the width of the red line may be 90% of the width of the grid, thus bh = 0.9; its height is about the height of the lattice half, thus bw = 0.5. In other words, bx , by , BH , BW units ratio to the grid dimensions, so bx and by must be between 0 and 1 between, and BH , BW may be greater than 1 . Of course there are other ways of agreement.

 

How to tell a good target detection algorithm work?

  And ---- cross ratio (IoU, intersection over union), object detection algorithm may be used to evaluate

 

  IOU, two bounding boxes calculated intersection (orange hatched portion in the drawing) and the union (green shaded portion in the figure) ratio, i.e., the size of the calculated intersection

 

  In general, computer vision tasks agreed that if IoU greater than or equal to 0.5, it means detecting correct; if the prediction and the actual perfect overlapping bounding boxes, it is IoU 1. Under the circumstances, IoU threshold value may be set depending on the particular task.

 

Non-maxima suppression suppression NMS

 

  So far, a target detection problem is, your algorithm may make multiple tests on the same object, non-maximal suppression of this method can ensure that your algorithm detects only once for each object.

 

         For example, if you need to detect pedestrians and cars in this picture, you may put a 19x19 grid on it, in theory, only a midpoint of the car, so it should only be assigned to one yard box, but practice when you run the target classification and localization algorithm, for each grid are run once, there may be multiple objects plaid think the central point in its own grid.

 

         因为你要在361格子上都跑一次,图像检测和定位算法,那么可能很多格子都会说我这个格子里有车的概率很高,所以当你跑算法的时候,最后可能会对同一个对象做出多次检测,如下图所示。因此非极大值抑制做的就是清理这些检测结果,这样一辆车只检测一次,而不是每辆车都出发多次检测。

所以具体上,这个算法是这样做的,首先看看每次报告,每个检测结果相关的概率为Pc。首先看概率最大的那个,在这个例子中是0.9,然后就说这是最可靠的检测,之后,非极大值抑制就会逐一审视剩下的矩形,所有和这个最大的边界框有很高交并比,高度重叠的其他边界框,那么这些输出就被被抑制,所以这两个矩形Pc分别为0.6和0.7,它们和0.9的矩形有很高的重叠度,因此这两个矩形就会被抑制。接下来逐一审视剩下的矩形,找出概率最高的那个,是左边0.8概率的那个矩形,然后非极大值抑制算法就会去掉其他IoU值很高的矩形。最后剩下的矩形框就是最终结果。

         非极大值抑制意味着,你只输出概率最大的分类结果,但是会抑制那些很接近但不是最大的预测结果。

算法细节

         首先在这个19x19网格上跑一下算法,你会得到19x19x8的输出尺寸,不过对于这个例子,我们简化一下,我们只做汽车检测,因此每个格子(总共19x19=361个格子)输出的预测值就是[Pc, bx, by, bh, bw],Pc表示有对象的概率。

 

现在要实现非极大值抑制,你可以做的第一件事就是去掉所有的Pc值小于等于某一阈值(如0.6)的边界框,即抛弃所有概率低的边界框;接下来处理剩下的边界框,我们重复的选择概率Pc最高的边界框,然后把它输出成预测结果;接下来去掉所有剩下的边界框,所有任何没有达到输出标准的边界框,把这些和输出边界框有很高重叠面积和上一步输出的边界框有很高交并比的边界框全部抛弃,所有while循环的第二步是(上一张幻灯片变暗的那些边界框和高亮标记的边界框重叠面积很高的那些边界框抛弃掉),不停的循环,直到每个边界框都判断过了,它们有的作为输出结果,另外的就被抛弃。

         上述算法是针对单个目标的情况,如果你尝试同时检测三个对象,比如说行人、汽车、摩托车,那么输出向量就会有三个额外的分量;那么正确的做法就是独立进行三次非极大值抑制,对每个类别都做一次。

 

Anchor Boxes

 

         目前为止,每个格子只能检测出一个对象,如果你想让一个格子检测出多个对象,你可以使用anchor box。

         假设你有这样一张图,对于这个例子,我们继续使用3x3的网格,注意行人的中心点和汽车的中心点,几乎在同一个地方,两者都落到同一个格子中,所有对于那个格子,如果y输出这个向量,你可以检测3个类,行人、汽车和摩托,它将无法输出检测结果,所以我必须从两个检测结果中选择一个。

 而anchor box的思路是这样的,预先定义两个不同形状的anchor box,你要做的就是把预测结果和这两个anchor box关联起来,一般来说,你可能会用更多的anchor box,可能要5个或者更多,但是此处为了讲解方便,就用两个anchor box。你要做的就是定义类别标签,用的向量不是上面那个,而是重复两次。即为每个anchor box赋予一个与上面一样的标签y=[Pc, bx, by, bh, bw, c1, c2, c3]。因为行人的形状更类似于anchor box1的形状,而不是anchor box2的形状,所以你可以用前8个数值来预测行人。

 

 

面来看一个具体的例子。

         由于行人更类似于anchor box1的形状,所以对于行人来说,我们将他分配到向量的上半部分,同理,汽车被分配到下半个格子。现在,其中一个格子有车,没有行人,那么当一个格子中有三个对象的时候,这种情况下算法处理不好;或者,同一个格子中两个对象的anchor box形状也一样,这样的情况算法也处理不好。

      最后,如何选择anchor box?大家一般是手工指定anchor box形状,你可以选择5到10个的anchor box形状,覆盖你想要检测的对象的各种形状。K-Means可以将两类对象形状聚类,如果我们用它来选择一组anchor box,选择最具有代表性的一组anchor box,可以代表你试图检测的十几个对象类别,这是自动选择anchor box的高级方法。

 

YOLO算法

 

         首先看看如何构造数据集。假设你要训练一个算法,要检测三种对象,行人,汽车和摩托,你还需要显式指定完整的背景类别。如果你要用两个anchor box,那么输出y就是3x3(因为使用的是3x3的网格),然后x2(anchor box的数量),最后x8。要构造训练集,你需要遍历9个格子,然后构成对应的目标向量y。所以对于第一个格子,里面没有出现要检测的对象,所以第一个格子的目标y=[0,?,?,?,?,?,?,?,0,?,?,?,?,?,?,?],对于下图中汽车所在的那个格子,其对应的目标向量y=[0,?,?,?,?,?,?,?,1,bx,by,bh,bw,0,1,0](汽车的形状与第二个anchor box类似)。最终输出尺寸是3x3x16(实际中可以使用19x19x16,如果需要用到更多的anchor box,那可能是19x19x5x8)。这是训练集,然后你训练一个卷积网络,输入是图片,大小可能是100x100x3,然后你的卷积网络最后输出尺寸是3x3x16或者3x3x2x8。

 接下来,我们看看算法如何做出预测。输入图像,你的神经网络的输出尺寸是3x3x2x8,对于9个格子,每个都有对应的向量。最后要跑一下这个非极大值抑制。

 为了让内容更有趣一些,我们看看一张新的测试图片,这就是运行非极大值抑制的过程。如果你使用两个anchor box,那么对于9个格子中任何一个都会有两个预测的边界框,其中一个的概率Pc很低,但9个格子中,每个都有两个预测的边界框,比如说我们得到的边界框是下面这样的,注意有些边界框可以超出所在格子的高度和宽度。

 接下来,你抛弃概率低的预测,去掉那些连神经网络都说这里很可能什么都没有的边界框。

 最后,如果你有三个对象检测类别,你希望检测行人、车子和摩托,那么你要做的就是对于每个类别,单独运行非极大值抑制,处理预测结果就是那个类别的边界框。

 

Guess you like

Origin blog.csdn.net/qq_24946843/article/details/90516844