Deep Learning Application - Computer Vision - Target Detection [4]: Overview, bounding box, anchor box (Anchor box), intersection ratio, non-maximum value suppression NMS, SoftNMS

insert image description here
[Introduction to advanced deep learning] must-see series, including activation function, optimization strategy, loss function, model tuning, normalization algorithm, convolution model, sequence model, pre-training model, adversarial neural network, etc.

insert image description here
The column introduces in detail: [Introduction to advanced deep learning] must-see series, including activation function, optimization strategy, loss function, model tuning, normalization algorithm, convolution model, sequence model, pre-training model, adversarial neural network, etc.

This column is mainly to facilitate beginners to quickly grasp relevant knowledge. In the follow-up, we will continue to analyze the knowledge principles involved in deep learning to everyone, so that everyone can reserve knowledge while practicing the project, knowing what it is, why it is, and why to know why it is.

Disclaimer: Some projects are online classic projects for everyone to learn quickly, and practical links will be added in the future (competitions, papers, practical applications, etc.)

Column subscription:

Deep Learning Application - Computer Vision - Target Detection [4]: ​​Overview, bounding box, anchor box (Anchor box), intersection ratio, non-maximum value suppression NMS, SoftNMS

1. Overview of target detection

For a computer, what can "see" is the number after the image is encoded. It is difficult for it to understand high-level semantic concepts, such as whether the target appearing in the image or video frame is a person or an object, and it is impossible to locate the target in the image. area. The main purpose of target detection is to allow the computer to automatically identify the category of all targets in a picture or video frame, and draw a bounding box around the target to mark the position of each target, as shown in Figure .

Figure 1 Schematic diagram of image classification and target detection

  • Figure 1(a) is an image classification task, only need to identify the category of this picture.
  • Figure 1(b) is a target detection task. It is not only necessary to identify the category in this picture as a zebra, but also to mark the position of the zebra in the picture.

1.1 Application scenarios

As shown in Figure 2, today's target detection has many application scenarios in both daily life and industrial production.

  • Consumer entertainment: face unlocking of smartphones and face payment in payment apps; product detection for vending machines; picture and video review on video websites, etc.;

  • Smart transportation: pedestrian detection, vehicle detection, traffic light detection, etc. in automatic driving;

  • Industrial production: parts counting and defect detection in industrial production; equipment status monitoring in equipment inspection scenarios; pyrotechnic detection in the factory area, safety helmet detection, etc.;

  • Smart medical care: detection of fundus, lung and other organ lesions; mask detection in the new crown epidemic, etc.

Figure 2 Target detection application scenario

1.2 Development History of Object Detection

In the image classification task, we first use the convolutional neural network to extract image features, and then use these features to predict the classification probability, establish a classification loss function based on the training sample labels, and start end-to-end training, as shown in Figure .

Figure 3 Schematic diagram of image classification process

But for the target detection problem, following the flow chart in Figure 3 will not work. Because the difference between different targets cannot be reflected in the process of extracting features from the entire picture, and ultimately it is impossible to mark the location of each object separately.

In order to solve this problem, combined with the successful experience of image classification tasks, we can split the target detection task. Suppose we use some method to generate a series of regions that may contain objects on the input image, these regions are called proposal regions. For each candidate area, it can be treated as a separate image, and the image classification model is used to classify the candidate area to see which category or background it belongs to (that is, a category that does not contain any objects). We have learned how to solve the image classification task in the previous section, and it is no longer a difficult task to use a convolutional neural network to classify an image.

So, the key to the problem now is how to generate candidate regions? For example, we can use the exhaustive method to generate candidate regions, as shown in Figure 4.

Figure 4 Candidate area

A is a certain pixel on the image, B is another pixel on the lower right of A, A and B can define a rectangular frame, denoted as AB.

  • As shown in Figure 4(a): A is in the upper left corner of the picture, and B traverses all positions except A to generate a rectangular frame A 1 B 1 , … , A 1 B n , … A_1B_1, …, A_1B_n, …A1B1,,A1Bn,
  • As shown in Figure 4(b): A is at a certain position in the middle of the picture, and B traverses all positions at the bottom right of A to generate rectangular boxes A k B 1 , … , A k B n , … A_kB_1, …, A_kB_n, …AkB1,,AkBn,

When A traverses all the pixels on the image, B traverses all the pixels on the lower right of it, and finally generates a set of rectangular boxes A i B j {A_iB_j}AiBjwill contain all selectable regions on the image.

As long as we classify each candidate area accurately enough, we will be able to find an area that is close enough to the actual object. The exhaustive method may be able to get the correct prediction result, but its calculation amount is also very huge, and the total number of candidate regions generated by it is about W 2 H 2 4 \frac{W^2 H^2}{4}4W2H _2, suppose H = W = 100 H=W=100H=W=100 , the total will reach2.5 × 1 0 7 2.5 \times 10^{7}2.5×107 , so many candidate regions make this method of little practicality. But in this way, we can see that assuming that the classification task is completed perfectly enough, the detection task can also be solved theoretically. The urgent problem to be solved is how to design a suitable method to generate candidate regions.

Scientists began to think, whether it is possible to apply traditional image algorithms to generate candidate regions first, and then use convolutional neural networks to classify these regions?

  • In 2013, Ross Girshick and others applied the CNN method to the target detection task for the first time. They used the traditional image algorithm Selective Search to generate candidate regions and achieved great success. This is the regional convolution that has a profound impact on the target detection field. Neural Network (R-CNN [1] ) model.
  • In 2015, Ross Girshick improved this method and proposed the Fast R-CNN [2] model. By sharing the calculation of the convolutional layer with objects in different areas, the amount of calculation is greatly reduced and the processing speed is improved. Moreover, a regression method for adjusting the position of the target object is introduced to further improve the accuracy of position prediction.
  • In 2015, Shaoqing Ren and others proposed the Faster R-CNN [3] model, and proposed the RPN method to generate candidate regions of objects. This method no longer needs to use traditional image processing algorithms to generate candidate regions, and further improves processing speed.
  • In 2017, Kaiming He et al. proposed the Mask R-CNN [4] model, which only needs to add a relatively small amount of calculation to the Faster R-CNN model, and can simultaneously realize the two tasks of target detection and object instance segmentation.

All of the above are based on the well-known models of the R-CNN series, which have a great influence on the development of target detection. In addition, there are some other models, such as SSD [5] , YOLO [6,7,8] , R-FCN [9] , etc., which are also popular model structures in the field of target detection. Figure 5 is a picture in the target detection review article [10] , which sorts out the development process of target detection algorithms in recent years.

Figure 5. Development process of target detection algorithm

Among them, because the R-CNN series of algorithms described above divide the target detection task into two stages, first generate candidate areas on the image, then classify the candidate areas and predict the position of the target object, so they are usually called two stages detection algorithm . The SSD and YOLO series algorithms use a network to simultaneously generate candidate regions and predict the category and location of objects, so they are usually called single-stage detection algorithms .

As mentioned above, it is unrealistic to obtain candidate regions by exhaustive method. Therefore, in the later classic algorithms, a commonly used idea is to use Anchor to extract candidate target frames. Anchor is a set of candidate frames with a preset ratio, and the candidate area can be obtained by sliding on the picture.

Since these algorithms use Anchor to extract candidate target boxes. At each point of the feature map, the Anchor is classified and regressed. Therefore, these algorithms are also collectively referred to as Anchor-based algorithms.

However, this Anchor-based method has some problems in practical applications:

  • Anchor is designed manually, so how much should we set if we change the data set? How big is the setting? How to set the aspect ratio?
  • Anchor has a large number of dense frames, how to choose positive and negative samples during training?
  • Anchor settings also lead to more hyperparameters, which is relatively troublesome in actual business expansion.

Due to the above shortcomings, researchers have proposed another class of algorithms with excellent results in recent years. These algorithms no longer use the anchor regression prediction frame, so they are also called Anchor-free algorithms, such as: CornerNet [11 ] and CenterNet [12] and so on. Figure 6 simply lists the classic Anchor-base and Anchor-free algorithms for everyone.

Figure 6 Development process of target detection algorithm based on deep learning

The Anchor-base and Anchor-free algorithms also have their own advantages. The following table briefly compares the advantages and disadvantages of several types of algorithms.

Anchor-Based single stage Anchor-Based two-stage Anchor-Free
network structure Simple complex Simple
precision excellent better better
prediction speed quick slightly slower quick
hyperparameters more many relatively few
Scalability generally generally better

1.3 Common Datasets

In the field of target detection, commonly used open source datasets mainly include the following four: Pascal VOC [13] , COCO [14] , Object365 [15] , OpenImages [16] . These datasets vary in number of categories, number of images, total number of object boxes, and thus vary in difficulty. The details of the four data sets are sorted out here, as shown in the table below.

data set Number of categories The number of train pictures, the number of boxes val number of pictures, number of boxes boxes/Image
Pascal VOC-2012 20 5717, 13,000+ 5823, 13,000+ 2.4
COCO 80 118287, 40,000+ 5000, 36,000+ 7.3
Object365 365 600k, 9623k 38k, 479k 16
OpenImages18 500 1643042, 860,000+ 100,000, 696,000+ 7.0
  • Pascal VOC-2012: The VOC dataset is a dataset used in the PASCAL VOC Challenge, which contains 20 common categories of pictures and is one of the classic academic datasets in the field of target detection.
  • COCO: The COCO dataset is a classic large-scale target detection, segmentation, and pose estimation dataset. The image data is mainly intercepted from complex daily scenes, with a total of 80 categories. Current academic papers often use the COCO dataset for accuracy evaluation.
  • Object365: A large-scale general-purpose object detection dataset released by Questyle Technology, with a total of 365 categories.
  • OpenImages18: A very large-scale dataset released by Google, with a total of 500 categories.
  • references

[1] Rich feature hierarchies for accurate object detection and semantic segmentation

[2] Fast R-CNN

[3] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

[4] Mask R-CNN

[5] SSD: Single Shot MultiBox Detector

[6] You Only Look Once: Unified, Real-Time Object Detection

[7] YOLO9000: Better, Faster, Stronger

[8] YOLOv3: An Incremental Improvement

[9] R-FCN: Object Detection via Region-based Fully Convolutional Networks

[10] Object Detection in 20 Years: A Survey

[11] CornerNet: Detecting Objects as Paired Keypoints

[12] Objects as Points

[13] Pascal VOC

[14] COCO

[15] Object365

[16] OpenImages

2. Bounding box

In the detection task, we need to predict the category and location of the object at the same time, so we need to introduce some concepts related to location. Usually, a bounding box (bbox) is used to represent the position of an object. The bounding box is a rectangular box that can just contain the object. As shown in Figure figure correspond to three bounding boxes.

Figure 1 Bounding box

There are usually two ways to express the position of the bounding box:

  1. ( x 1 , y 1 , x 2 , y 2 ) (x_1, y_1, x_2, y_2)(x1,y1,x2,y2) , where( x 1 , y 1 ) (x_1, y_1)(x1,y1) is the coordinates of the upper left corner of the rectangle,( x 2 , y 2 ) (x_2, y_2)(x2,y2) are the coordinates of the lower right corner of the rectangular box. The three red rectangles inFigure 1xyxy xyxyThe x y x y format is expressed as follows:
  • Left: (40.93, 141.1, 226.99, 515.73) (40.93, 141.1, 226.99, 515.73)(40.93,141.1,226.99,515.73)
  • In: ( 214.29 , 325.03 , 399.82 , 631.37 ) (214.29, 325.03, 399.82, 631.37)(214.29,325.03,399.82,631.37)
  • Right: (247.2, 131.62, 480.0, 639.32) (247.2, 131.62, 480.0, 639.32)(247.2,131.62,480.0,639.32)
  1. x y w h xywh xywh,即 ( x , y , w , h ) (x, y, w, h) (x,y,w,h ) , where( x , y ) (x, y)(x,y ) is the coordinates of the center point of the rectangle,www is the width of the rectangle,hhh is the height of the rectangle.

In the detection task, the label of the training data set will give the corresponding ( x 1 , y 1 , x 2 , y 2 ) (x_1, y_1, x_2, y_2) corresponding to the real bounding box of the target object(x1,y1,x2,y2) , such a bounding box is also called a ground truth box, andFigure 1shows the ground truth boxes corresponding to the three portraits. The model predicts where the target object may appear, and the bounding box predicted by the model is called a prediction box.

To complete a detection task, we usually hope that the model can output some predicted bounding boxes based on the input picture, as well as the category of objects contained in the bounding box or the probability of belonging to a certain category, such as this format: [ L , P , x 1 , y 1 , x 2 , y 2 ] [L, P, x_1, y_1, x_2, y_2][L,P,x1,y1,x2,y2] , whereLLL is the predicted class label,PPP is the probability that the predicted object belongs to that category. An input image may generate multiple prediction boxes, let's learn how to accomplish this task together.


Notice:

  1. When reading the code, pay attention to which format representation is used.
  2. The origin of the picture coordinates is in the upper left corner, xxThe right direction of the x- axis is the positive direction,yyThe downward direction of the y axis is the positive direction.

3. Anchor box

Object detection algorithms usually sample a large number of regions in the input image, then determine whether these regions contain the object of interest, and adjust the edge of the region to more accurately predict the target's ground-truth bounding box (ground-truth bounding box). Different models may use different methods of area sampling. Here we introduce one of these methods: it generates multiple bounding boxes with different sizes and aspect ratios centered on each pixel. These bounding boxes are called anchor boxes.

In the target detection task, we will first set the size and shape of the anchor box, and then draw these anchor boxes centered on a certain point on the image, and regard these anchor boxes as possible candidate areas.


At present, the commonly used anchor box size selection methods are:

  1. human experience selection
  2. k-means clustering
  3. Learning as hyperparameters

The model predicts whether these candidate areas contain objects, and if they contain target objects, it is necessary to further predict the category to which the object belongs. More importantly, the model needs to predict the magnitude of fine-tuning. This is because the position of the anchor frame is fixed, and it is unlikely to coincide with the bounding box of the object, so it needs to be fine-tuned on the basis of the anchor frame to form a prediction frame that can accurately describe the position of the object.

During the training process, the model can learn how to judge whether the candidate area represented by the anchor box contains an object by learning and continuously adjusting parameters, and if it contains an object, which category the object belongs to, and the position of the object bounding box relative to the anchor box. The magnitude of the adjustment. Different models often have different ways of generating anchor boxes.

In the figure below, the following program can be used to generate three frames centered on the pixel point [300, 500], as shown in the blue frame in Figure 2, where the anchor frame A1 is very close to the portrait area.

Figure 2 Anchor box

#画图展示如何绘制边界框和锚框
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.image import imread
import math

#定义画矩形框的程序    
def draw_rectangle(currentAxis, bbox, edgecolor = 'k', facecolor = 'y', fill=False, linestyle='-'):
    # currentAxis,坐标轴,通过plt.gca()获取
    # bbox,边界框,包含四个数值的list, [x1, y1, x2, y2]
    # edgecolor,边框线条颜色
    # facecolor,填充颜色
    # fill, 是否填充
    # linestype,边框线型

    # patches.Rectangle(xy, width, height,linewidth,edgecolor,facecolor,fill, linestyle)
    # xy:左下角坐标; width:矩形框的宽; height:矩形框的高; linewidth:线宽; edgecolor:边界颜色; facecolor:填充颜色; fill:是否填充; linestyle:线断类型
    rect=patches.Rectangle((bbox[0], bbox[1]), bbox[2]-bbox[0]+1, bbox[3]-bbox[1]+1, linewidth=1,
                           edgecolor=edgecolor,facecolor=facecolor,fill=fill, linestyle=linestyle)
    currentAxis.add_patch(rect)

    
plt.figure(figsize=(10, 10))
#传入图片路径
filename = '/home/aistudio/work/images/section3/000000086956.jpg'
im = imread(filename)
plt.imshow(im)

#使用xyxy格式表示物体真实框
bbox1 = [214.29, 325.03, 399.82, 631.37]
bbox2 = [40.93, 141.1, 226.99, 515.73]
bbox3 = [247.2, 131.62, 480.0, 639.32]

currentAxis=plt.gca()
#绘制3个真实框
draw_rectangle(currentAxis, bbox1, edgecolor='r')
draw_rectangle(currentAxis, bbox2, edgecolor='r')
draw_rectangle(currentAxis, bbox3,edgecolor='r')

#绘制锚框
def draw_anchor_box(center, length, scales, ratios, img_height, img_width):
    """
    以center为中心,产生一系列锚框
    其中length指定了一个基准的长度
    scales是包含多种尺寸比例的list
    ratios是包含多种长宽比的list
    img_height和img_width是图片的尺寸,生成的锚框范围不能超出图片尺寸之外
    """
    bboxes = []
    for scale in scales:
        for ratio in ratios:
            h = length*scale*math.sqrt(ratio)
            w = length*scale/math.sqrt(ratio) 
            x1 = max(center[0] - w/2., 0.)
            y1 = max(center[1] - h/2., 0.)
            x2 = min(center[0] + w/2. - 1.0, img_width - 1.0)
            y2 = min(center[1] + h/2. - 1.0, img_height - 1.0)
            print(center[0], center[1], w, h)
            bboxes.append([x1, y1, x2, y2])

    for bbox in bboxes:
        draw_rectangle(currentAxis, bbox, edgecolor = 'b')

img_height = im.shape[0]
img_width = im.shape[1] 
#绘制锚框
draw_anchor_box([300., 500.], 100., [2.0], [0.5, 1.0, 2.0], img_height, img_width)

################# 以下为添加上图中的文字说明和箭头###############################
plt.text(285, 285, 'G1', color='red', fontsize=20)
plt.arrow(300, 288, 30, 40, color='red', width=0.001, length_includes_head=True, \
         head_width=5, head_length=10, shape='full')

plt.text(190, 320, 'A1', color='blue', fontsize=20)
plt.arrow(200, 320, 30, 40, color='blue', width=0.001, length_includes_head=True, \
         head_width=5, head_length=10, shape='full')

plt.text(160, 370, 'A2', color='blue', fontsize=20)
plt.arrow(170, 370, 30, 40, color='blue', width=0.001, length_includes_head=True, \
         head_width=5, head_length=10, shape='full')

plt.text(115, 420, 'A3', color='blue', fontsize=20)
plt.arrow(127, 420, 30, 40, color='blue', width=0.001, length_includes_head=True, \
         head_width=5, head_length=10, shape='full')

plt.show()

The concept of the anchor box was first proposed in the Faster rcnn [1] target detection algorithm, and was later used for reference by various target detection algorithms such as YOLOv2 [2] . Compared with the sliding window or Selective Search methods used in early target detection algorithms, using anchor boxes to extract candidate regions greatly reduces the time overhead. Compared with YOLOv1 [3] , which directly regresses the coordinate values ​​to calculate the detection frame, using the anchor frame can simplify the target detection problem, so that the network can only learn the position offset of the anchor frame, making the network model easier to learn.

[1] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

[2] YOLO9000: Better, Faster, Stronger

[3] You Only Look Once: Unified, Real-Time Object Detection

4. Cross-merge ratio

In target detection tasks, Intersection of Union (IoU) is usually used as a measure to measure the relationship between two rectangular boxes. For example, in the target detection algorithm based on the anchor box, we know that when the anchor box contains an object, we need to predict the object category and fine-tune the coordinates of the anchor box to obtain the final predicted box. At this time, to judge whether the anchor frame contains the object or not, we need to use the intersection ratio. When the intersection ratio between the anchor frame and the real frame is large enough, we can consider that the object is contained in the anchor frame; and the intersection of the anchor frame and the real frame When the ratio is very small, we can think that the object is not included in the anchor frame. In addition, in the subsequent NMS calculation process, the intersection and union ratio is also used to determine whether different rectangular frames overlap.

The concept of intersection and union ratio comes from the set in mathematics and is used to describe two sets AAA andBBThe relationship between B , which is equal to the number of elements contained in the intersection of two sets, divided by the number of elements contained in their union, the specific calculation formula is as follows:

I o U = A ∩ B A ∪ B IoU = \frac{A\cap B}{A \cup B} IoU=ABAB

We will use this concept to describe the degree of coincidence between two boxes. Two boxes can be regarded as a collection of two pixels, and their intersection ratio is equal to the area of ​​the overlapping part of the two boxes divided by their combined area. The cyan area in the "Intersection" in the figure below is the overlapping area of ​​the two boxes, and the blue area in the "Union" in the figure below is the combined area of ​​the two boxes. Divide these two areas to get the intersection ratio between them, as shown in Figure 1.

Figure 1 Cross-merge ratio

Suppose the positions of two rectangular boxes A and B are:

A : [ x a 1 , y a 1 , x a 2 , y a 2 ] A: [x_{a1}, y_{a1}, x_{a2}, y_{a2}] A:[xa 1,ya 1,xa 2,ya 2]

B : [ x b 1 , y b 1 , x b 2 , y b 2 ] B: [x_{b1}, y_{b1}, x_{b2}, y_{b2}] B:[xb 1,yb 1,xb 2,yb 2]

If the positional relationship is shown in Figure 2:

Figure 2 Calculation of cross-merge ratio

If there is an intersection between the two, the coordinates of the upper left corner of the intersection are:

x 1 = m a x ( x a 1 , x b 1 ) ,       y 1 = m a x ( y a 1 , y b 1 ) x_1 = max(x_{a1}, x_{b1}), \ \ \ \ \ y_1 = max(y_{a1}, y_{b1}) x1=max(xa 1,xb 1),     y1=max(ya 1,yb 1)

The coordinates of the lower right corner of the intersecting part are:

x 2 = m i n ( x a 2 , x b 2 ) ,       y 2 = m i n ( y a 2 , y b 2 ) x_2 = min(x_{a2}, x_{b2}), \ \ \ \ \ y_2 = min(y_{a2}, y_{b2}) x2=min(xa 2,xb 2),     y2=min(ya 2,yb 2)

Calculate the area of ​​the first intersection:

i n t e r s e c t i o n = m a x ( x 2 − x 1 + 1.0 , 0 ) ⋅ m a x ( y 2 − y 1 + 1.0 , 0 ) intersection = max(x_2 - x_1 + 1.0, 0) \cdot max(y_2 - y_1 + 1.0, 0) intersection=max(x2x1+1.0,0)max(y2y1+1.0,0)

The areas of rectangles A and B are respectively:

S A = ( x a 2 − x a 1 + 1.0 ) ⋅ ( y a 2 − y a 1 + 1.0 ) S_A = (x_{a2} - x_{a1} + 1.0) \cdot (y_{a2} - y_{a1} + 1.0) SA=(xa 2xa 1+1.0)(ya 2ya 1+1.0)

S B = ( x b 2 − x b 1 + 1.0 ) ⋅ ( y b 2 − y b 1 + 1.0 ) S_B = (x_{b2} - x_{b1} + 1.0) \cdot (y_{b2} - y_{b1} + 1.0) SB=(xb 2xb 1+1.0)(yb 2yb 1+1.0)

Compute the area of ​​the combined parts:

u n i o n = S A + S B − i n t e r s e c t i o n union = S_A + S_B - intersection union=SA+SBintersection

Calculate the intersection ratio:

I o U = i n t e r s e c t i o n u n i o n IoU = \frac{intersection}{union} IoU=unionintersection

The implementation code of the intersection ratio is as follows:

  • When the coordinate form of the rectangle is xyxy
import numpy as np

#计算IoU,矩形框的坐标形式为xyxy
def box_iou_xyxy(box1, box2):
    # 获取box1左上角和右下角的坐标
    x1min, y1min, x1max, y1max = box1[0], box1[1], box1[2], box1[3]
    # 计算box1的面积
    s1 = (y1max - y1min + 1.) * (x1max - x1min + 1.)
    # 获取box2左上角和右下角的坐标
    x2min, y2min, x2max, y2max = box2[0], box2[1], box2[2], box2[3]
    # 计算box2的面积
    s2 = (y2max - y2min + 1.) * (x2max - x2min + 1.)
    
    # 计算相交矩形框的坐标
    xmin = np.maximum(x1min, x2min)
    ymin = np.maximum(y1min, y2min)
    xmax = np.minimum(x1max, x2max)
    ymax = np.minimum(y1max, y2max)
    # 计算相交矩形行的高度、宽度、面积
    inter_h = np.maximum(ymax - ymin + 1., 0.)
    inter_w = np.maximum(xmax - xmin + 1., 0.)
    intersection = inter_h * inter_w
    # 计算相并面积
    union = s1 + s2 - intersection
    # 计算交并比
    iou = intersection / union
    return iou


bbox1 = [100., 100., 200., 200.]
bbox2 = [120., 120., 220., 220.]
iou = box_iou_xyxy(bbox1, bbox2)
print('IoU is {}'.format(iou))  
  • When the coordinate form of the rectangular box is xywh
import numpy as np

#计算IoU,矩形框的坐标形式为xywh
def box_iou_xywh(box1, box2):
    x1min, y1min = box1[0] - box1[2]/2.0, box1[1] - box1[3]/2.0
    x1max, y1max = box1[0] + box1[2]/2.0, box1[1] + box1[3]/2.0
    s1 = box1[2] * box1[3]

    x2min, y2min = box2[0] - box2[2]/2.0, box2[1] - box2[3]/2.0
    x2max, y2max = box2[0] + box2[2]/2.0, box2[1] + box2[3]/2.0
    s2 = box2[2] * box2[3]

    xmin = np.maximum(x1min, x2min)
    ymin = np.maximum(y1min, y2min)
    xmax = np.minimum(x1max, x2max)
    ymax = np.minimum(y1max, y2max)
    inter_h = np.maximum(ymax - ymin, 0.)
    inter_w = np.maximum(xmax - xmin, 0.)
    intersection = inter_h * inter_w

    union = s1 + s2 - intersection
    iou = intersection / union
    return iou

bbox1 = [100., 100., 200., 200.]
bbox2 = [120., 120., 220., 220.]
iou = box_iou_xywh(bbox1, bbox2)
print('IoU is {}'.format(iou))  

In order to intuitively show the relationship between the size of the intersection ratio and the degree of overlap, Figure 3 illustrates the relative positional relationship between the two boxes under different intersection ratios, from IoU = 0.95 to IoU = 0.

Figure 3 Schematic diagram of the relative position between two boxes under different intersection and union ratios


question:

  1. Under what circumstances is the IoU of two rectangular boxes equal to 1?

    Answer: The two rectangles are completely coincident.

  2. Under what circumstances is the IoU of two rectangular boxes equal to 0?

    Answer: The two rectangles do not intersect at all.

5. Non-Maximum Suppression NMS

In the actual target detection process, no matter what method is used to obtain the candidate area, there will be a common problem, that is, the network may perform multiple detections on the same target. This also leads to multiple prediction frames for the same object. Therefore, it is necessary to eliminate redundant prediction frames with large overlaps. The specific processing method is non-maximum suppression (NMS).

Assuming that the model is used to predict the picture, a total of 11 prediction frames and their scores are output, and the prediction frames are drawn on the graph as shown in Figure 1. Around each portrait, multiple prediction frames appear, and redundant prediction frames need to be eliminated to obtain the final prediction result.

Figure 1 Schematic diagram of prediction box

The code to output 11 prediction boxes and their scores is as follows:

#画图展示目标物体边界框
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.image import imread
import math

#定义画矩形框的程序    
def draw_rectangle(currentAxis, bbox, edgecolor = 'k', facecolor = 'y', fill=False, linestyle='-'):
    # currentAxis,坐标轴,通过plt.gca()获取
    # bbox,边界框,包含四个数值的list, [x1, y1, x2, y2]
    # edgecolor,边框线条颜色
    # facecolor,填充颜色
    # fill, 是否填充
    # linestype,边框线型
    
    # patches.Rectangle(xy, width, height,linewidth,edgecolor,facecolor,fill, linestyle)
    # xy:左下角坐标; width:矩形框的宽; height:矩形框的高; linewidth:线宽; edgecolor:边界颜色; facecolor:填充颜色; fill:是否填充; linestyle:线断类型
    rect=patches.Rectangle((bbox[0], bbox[1]), bbox[2]-bbox[0]+1, bbox[3]-bbox[1]+1, linewidth=1,
                           edgecolor=edgecolor,facecolor=facecolor,fill=fill, linestyle=linestyle)
    currentAxis.add_patch(rect)

    
plt.figure(figsize=(10, 10))
#传入图片路径
filename = '/home/aistudio/work/images/section3/000000086956.jpg'
im = imread(filename)
plt.imshow(im)

currentAxis=plt.gca()

#预测框位置,由网络预测得到
boxes = np.array([[4.21716537e+01, 1.28230896e+02, 2.26547668e+02, 6.00434631e+02],
       [3.18562988e+02, 1.23168472e+02, 4.79000000e+02, 6.05688416e+02],
       [2.62704697e+01, 1.39430557e+02, 2.20587097e+02, 6.38959656e+02],
       [4.24965363e+01, 1.42706665e+02, 2.25955185e+02, 6.35671204e+02],
       [2.37462646e+02, 1.35731537e+02, 4.79000000e+02, 6.31451294e+02],
       [3.19390472e+02, 1.29295090e+02, 4.79000000e+02, 6.33003845e+02],
       [3.28933838e+02, 1.22736115e+02, 4.79000000e+02, 6.39000000e+02],
       [4.44292603e+01, 1.70438187e+02, 2.26841858e+02, 6.39000000e+02],
       [2.17988785e+02, 3.02472412e+02, 4.06062927e+02, 6.29106628e+02],
       [2.00241089e+02, 3.23755096e+02, 3.96929321e+02, 6.36386108e+02],
       [2.14310303e+02, 3.23443665e+02, 4.06732849e+02, 6.35775269e+02]])

#预测框得分,由网络预测得到
scores = np.array([0.5247661 , 0.51759845, 0.86075854, 0.9910175 , 0.39170712,
       0.9297706 , 0.5115228 , 0.270992  , 0.19087596, 0.64201415, 0.879036])

#画出所有预测框
for box in boxes:
    draw_rectangle(currentAxis, box)

Here non-maximum suppression (Non-Maximum Suppression, NMS) is used to eliminate redundant frames. The basic idea is that if there are multiple prediction frames corresponding to the same object, only the prediction frame with the highest score is selected, and the remaining prediction frames are discarded.

How to judge that the two prediction boxes correspond to the same object, and how to set the standard?

If the categories of the two prediction boxes are the same, and their position overlap is relatively large, it can be considered that they are predicting the same target. The method of non-maximum value suppression is to select the prediction frame with the highest score of a certain category, and then see which prediction frames and its IoU are greater than the threshold, and discard these prediction frames. The threshold of IoU here is a hyperparameter and needs to be set in advance. Here we refer to the YOLOv3 algorithm, which is set to 0.5.

For example, in the above program, there are a total of 11 prediction boxes in the boxes, and the scores give their predicted scores for the category of "person". The specific method of NMS is as follows.

  • Step0: Create a selected list, keep_list = []
  • Step1: Sort the scores, remain_list = [ 3, 5, 10, 2, 9, 0, 1, 6, 4, 7, 8],
  • Step2: Select boxes[3], at this time keep_list is empty, no need to calculate IoU, put it directly into keep_list, keep_list = [3], remain_list=[5, 10, 2, 9, 0, 1, 6, 4, 7, 8]
  • Step3: Select boxes[5], boxes[3] already exists in the keep_list at this time, calculate IoU(boxes[3], boxes[5]) = 0.0, obviously less than the threshold, then keep_list=[3, 5], remain_list = [10, 2, 9, 0, 1, 6, 4, 7, 8]
  • Step4: Select boxes[10], then keep_list=[3, 5], calculate IoU(boxes[3], boxes[10])=0.0268, IoU(boxes[5], boxes[10])=0.0268 = 0.24, all less than the threshold, then keep_list = [3, 5, 10], remain_list=[2, 9, 0, 1, 6, 4, 7, 8]
  • Step5: Select boxes[2], at this time keep_list = [3, 5, 10], calculate IoU(boxes[3], boxes[2]) = 0.88, which exceeds the threshold, directly discard boxes[2], keep_list =[3, 5, 10], remain_list=[9, 0, 1, 6, 4, 7, 8]
  • Step6: Select boxes[9], at this time keep_list = [3, 5, 10], calculate IoU(boxes[3], boxes[9]) = 0.0577, IoU(boxes[5], boxes[9]) = 0.205, IoU(boxes[10], boxes[9]) = 0.88, exceeding the threshold, discarding boxes[9]. keep_list=[3, 5, 10], remain_list=[0, 1, 6, 4, 7, 8]
  • Step7: Repeat the above Step6 until the remain_list is empty.

The specific implementation code of non-maximum value suppression is defined as the following nmsfunction.

#非极大值抑制
def nms(bboxes, scores, score_thresh, nms_thresh):
    """
    nms
    """
    inds = np.argsort(scores)
    inds = inds[::-1]
    keep_inds = []
    while(len(inds) > 0):
        cur_ind = inds[0]
        cur_score = scores[cur_ind]
        # if score of the box is less than score_thresh, just drop it
        if cur_score < score_thresh:
            break

        keep = True
        for ind in keep_inds:
            current_box = bboxes[cur_ind]
            remain_box = bboxes[ind]
            iou = box_iou_xyxy(current_box, remain_box)
            if iou > nms_thresh:
                keep = False
                break
        if keep:
            keep_inds.append(cur_ind)
        inds = inds[1:]

    return np.array(keep_inds)

Finally, keep_list=[3, 5, 10] is obtained, that is, the prediction boxes 3, 5, and 10 are finally selected, as shown in Figure 2.

Figure 2 Schematic diagram of NMS results

The implementation code of the whole process is as follows:

#画图展示目标物体边界框
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.image import imread
import math

#定义画矩形框的程序    
def draw_rectangle(currentAxis, bbox, edgecolor = 'k', facecolor = 'y', fill=False, linestyle='-'):
    # currentAxis,坐标轴,通过plt.gca()获取
    # bbox,边界框,包含四个数值的list, [x1, y1, x2, y2]
    # edgecolor,边框线条颜色
    # facecolor,填充颜色
    # fill, 是否填充
    # linestype,边框线型
    # patches.Rectangle需要传入左上角坐标、矩形区域的宽度、高度等参数
    rect=patches.Rectangle((bbox[0], bbox[1]), bbox[2]-bbox[0]+1, bbox[3]-bbox[1]+1, linewidth=1,
                           edgecolor=edgecolor,facecolor=facecolor,fill=fill, linestyle=linestyle)
    currentAxis.add_patch(rect)

    
plt.figure(figsize=(10, 10))

filename = '/home/aistudio/work/images/section3/000000086956.jpg'
im = imread(filename)
plt.imshow(im)

currentAxis=plt.gca()

boxes = np.array([[4.21716537e+01, 1.28230896e+02, 2.26547668e+02, 6.00434631e+02],
       [3.18562988e+02, 1.23168472e+02, 4.79000000e+02, 6.05688416e+02],
       [2.62704697e+01, 1.39430557e+02, 2.20587097e+02, 6.38959656e+02],
       [4.24965363e+01, 1.42706665e+02, 2.25955185e+02, 6.35671204e+02],
       [2.37462646e+02, 1.35731537e+02, 4.79000000e+02, 6.31451294e+02],
       [3.19390472e+02, 1.29295090e+02, 4.79000000e+02, 6.33003845e+02],
       [3.28933838e+02, 1.22736115e+02, 4.79000000e+02, 6.39000000e+02],
       [4.44292603e+01, 1.70438187e+02, 2.26841858e+02, 6.39000000e+02],
       [2.17988785e+02, 3.02472412e+02, 4.06062927e+02, 6.29106628e+02],
       [2.00241089e+02, 3.23755096e+02, 3.96929321e+02, 6.36386108e+02],
       [2.14310303e+02, 3.23443665e+02, 4.06732849e+02, 6.35775269e+02]])
 
scores = np.array([0.5247661 , 0.51759845, 0.86075854, 0.9910175 , 0.39170712,
       0.9297706 , 0.5115228 , 0.270992  , 0.19087596, 0.64201415, 0.879036])

left_ind = np.where((boxes[:, 0]<60) * (boxes[:, 0]>20))
left_boxes = boxes[left_ind]
left_scores = scores[left_ind]

colors = ['r', 'g', 'b', 'k']

# 画出最终保留的预测框
inds = nms(boxes, scores, score_thresh=0.01, nms_thresh=0.5)
# 打印最终保留的预测框是哪几个
print(inds)
for i in range(len(inds)):
    box = boxes[inds[i]]
    draw_rectangle(currentAxis, box, edgecolor=colors[i])

It should be noted that when the data set contains multiple categories of objects, multi-category non-maximum value suppression is required. The implementation principle is the same as that of non-maximum value suppression. The difference is that non-maximum value suppression needs to be performed for each category. , the implementation code is shown below multiclass_nms.

#多分类非极大值抑制
def multiclass_nms(bboxes, scores, score_thresh=0.01, nms_thresh=0.45, pre_nms_topk=1000, pos_nms_topk=100):
    """
    This is for multiclass_nms
    """
    batch_size = bboxes.shape[0]
    class_num = scores.shape[1]
    rets = []
    for i in range(batch_size):
        bboxes_i = bboxes[i]
        scores_i = scores[i]
        ret = []
        # 对每个类别都进行NMS操作
        for c in range(class_num):
            scores_i_c = scores_i[c]
            keep_inds = nms(bboxes_i, scores_i_c, score_thresh, nms_thresh)
            if len(keep_inds) < 1:
                continue
            keep_bboxes = bboxes_i[keep_inds]
            keep_scores = scores_i_c[keep_inds]
            keep_results = np.zeros([keep_scores.shape[0], 6])
            keep_results[:, 0] = c
            keep_results[:, 1] = keep_scores[:]
            keep_results[:, 2:6] = keep_bboxes[:, :]
            ret.append(keep_results)
        if len(ret) < 1:
            rets.append(ret)
            continue
        ret_i = np.concatenate(ret, axis=0)
        scores_i = ret_i[:, 1]
        if len(scores_i) > pos_nms_topk:
            inds = np.argsort(scores_i)[::-1]
            inds = inds[:pos_nms_topk]
            ret_i = ret_i[inds]

        rets.append(ret_i)

    return rets

6.Soft NMS

6.1 Background of Soft NMS

The NMS (Non-Maximum Suppression) method is a commonly used post-processing method in target detection tasks. Its basic idea is: if there are multiple prediction frames corresponding to the same object, only the prediction frame with the highest score is selected, and the remaining The prediction box is discarded. Under the treatment of this method, redundant detection frames can be effectively reduced. However, the traditional NMS algorithm has the following disadvantages: it is difficult to determine the IOU threshold, and if the threshold is too small, it is easy to miss detection. When two objects of the same category overlap a lot, objects with lower category scores will be discarded. ; If the threshold is too large, it is difficult to eliminate most of the redundant boxes.

Therefore, in the paper "Improving Object Detection With One Line of Code" [1] , the author proposed the Soft NMS method to effectively alleviate the above problems.

6.2 Soft NMS Algorithm Process

Suppose the detection frame with the highest score is MMM , for another class scoresi s_isiThe detection frame bi b_ibi, the calculation method of the traditional NMS algorithm can be expressed as the following formula:

s i = { s i , i o u ( M , b i ) < N t 0 , i o u ( M , b i ) ≥ N t s_i = \{\begin{matrix} s_i,iou(M,b_i)<N_t\\ 0,iou(M,b_i)\ge N_t \end{matrix} si={ si,iou(M,bi)<Nt0,iou(M,bi)Nt
Among them, N t N_tNtis the set IOU threshold.

The calculation method of the Soft NMS algorithm can be expressed as the following formula:

s i = { s i , i o u ( M , b i ) < N t s i ( 1 − i o u ( M , b i ) ) , i o u ( M , b i ) ≥ N t s_i = \{\begin{matrix} s_i,iou(M,b_i)<N_t\\ s_i(1-iou(M,b_i)),iou(M,b_i)\ge N_t \end{matrix} si={ si,iou(M,bi)<Ntsi(1iou(M,bi)),iou(M,bi)Nt
Here we can actually see the difference between the two methods. In the traditional NMS algorithm, if the IOU between the lower-scoring detection frame and the highest-scoring detection frame is greater than the threshold, the lower-scoring detection frame will be discarded directly; while in the Soft NMS algorithm, the lower-scoring detection frame is not The detection frame score is directly set to 0, but it is reduced. Specifically, in the Soft NMS algorithm, the final border score is determined jointly by the original score and the IOU result, and the original score is linearly attenuated.

However, if the above formula is used for Soft NMS calculation, when the IOU is greater than the threshold, the border score will undergo a large change. At this time, the test results may be greatly affected. Therefore, in the Soft NMS algorithm, another calculation method of the border score is proposed, as shown in the following formula.
si = sie − iou ( M , bi ) 2 σ , ∀ bi ∉ D s_i = s_ie^{-\frac { {iou(M, b_i)^2}}{\sigma}},\forall b_i\notin Dsi=siepiou(M,bi)2,bi/D
At this time, the score of the new bounding box changes little, and there is a chance to be calculated as the correct detection box in the subsequent calculation process.

6.3 Example of Soft NMS Algorithm

A simple example is used here to illustrate the calculation process of the Soft NMS algorithm and its difference from the standard NMS algorithm.

Figure 1 Example of SoftNMS algorithm

Assume that the horse detection model is used to predict the above image, and the above two detection results are obtained. The horse category score in the red detection box is 0.95, and the horse category score in the green dotted line detection box is 0.8. It can be seen that the horses closer to the camera almost completely cover the horses farther away from the camera. At this time, the IOU of the two detection frames is very large.

In the traditional NMS algorithm, when the IOU of the detection frame is very large and exceeds the preset threshold, only the detection frame with the largest score will be kept, and the score of the detection frame with a smaller score will be directly set to 0. At this time, the horses in the green dotted frame are directly discarded. However, the two detection frames themselves correspond to two different horses. Therefore, this NMS method will cause missed detection.

In the SoftNMS algorithm, the new score corresponding to the detection frame of the green dotted line will not be set to 0, but will be calculated using the two calculation methods mentioned above. At this time, the horses in the green dotted frame will not be discarded directly, but the category score will be reduced, and they will continue to participate in subsequent calculations. Corresponding to the situation in the original picture, there is a high probability that the two horses will be kept at the same time at the end, avoiding the occurrence of missed detection.

  • references

[1] 《Improving Object Detection With One Line of Code》

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/131101741