Anchor Boxes and Border Boxes

This article mainly refers to the tutorial of Mr. Li Mu. The anchor box and the edge box are the most important tools in target detection, which are used to frame the objects in the picture.

1. Border frame

The border frame is the real position and range of the object in the picture. There are two representation methods, one is the corner coordinate representation, which represents a rectangular frame through the coordinates of the upper left and lower right corners of the object, and the other is the center representation. , use the center, width and height of the object to represent the rectangular box. code show as below:

import torch
from d2l import torch as d2l
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img = mpimg.imread("./catdog.jpg")
plt.figure(figsize=(5,5))

# 边界框的描述方法
# 1.边角表示，2.中心表示
def box_corner_to_center(boxes):
    x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    cx = (x1 + x2) / 2
    cy = (y1 + y2) / 2
    w = x2 - x1
    h = y2 - y1
    boxes = torch.stack((cx, cy, w, h), axis=-1)
    return boxes

def box_center_to_corner(boxes):
    cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    x1 = cx - w / 2
    y1 = cy - h / 2
    x2 = cx + w / 2
    y2 = cy + h / 2
    boxes = torch.stack((x1, y1, x2, y2), axis=-1)
    return boxes

dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0]
boxes = torch.tensor((dog_bbox, cat_bbox))
box_center_to_corner(box_corner_to_center(boxes)) == boxes

The two representations can be converted to each other.
The code to draw the border box is as follows:

def bbox_to_rect(bbox, color):
    return plt.Rectangle(
        xy=(bbox[0], bbox[1]),
        width=bbox[2] - bbox[0],
        height=bbox[3] - bbox[1],
        fill=False,
        edgecolor=color,
        linewidth=2
    )
fig = plt.imshow(img)
fig.axes.add_patch(bbox_to_rect(dog_bbox, "blue"))
fig.axes.add_patch(bbox_to_rect(cat_bbox, "red"))

The result looks like this:
insert image description here

2. Anchor box

The anchor frame is a series of borders sampled on the image by target detection. Two tasks are completed through the deep learning method. One is the classification task, which is to judge what the object in the anchor frame is, and the second is the regression task, that is, it needs to make The anchor box coincides with the border box as much as possible.

A series of anchor boxes are usually generated centered on pixels, so two parameters need to be defined, $sizes=\{s_1, s_2, s_3, ... ,s_n\}$ 和 $ratios=\{r_1,r_2, r_3,...,r_m\}$ , which are used to control the proportion of the anchor box occupying the picture and the aspect ratio of the anchor box respectively. To avoid too many combinations, we write it as follows $n + m - 1$ 种组合:
${(s_1, r_1), (s_2, r_1), (s_3, r_1),....,(s_n, r_1), (s_1, r_2), (s_1, r_2),..., (s_1, r_m)\}$
Due to the characteristics of convolutional neural network extraction features, the size of the image will always be changed every time it passes through the convolutional layer. Therefore, it is necessary to calculate the ratio of the width and height of the anchor box to the image at this time.

The calculation is as follows:
Let $w$ , $h$ is the actual width and height of the anchor frame, $W$ , $H$ is the width and height of the picture, define $\frac{wh}{WH}=s^2$ , $\frac{w}{h}=r$ 。

Then, $rw=s\sqrt{WHr}$ , $h=s\sqrt{\frac{WH}{r}}$ , normalize it to get $w_0=s\sqrt{\frac{Hr}{W}}$ and $h_0=s\sqrt{\frac{W}{Hr}}$ .
The code looks like this:

def multibox_prior(data, sizes, ratios):
	# data [batch_size, channels, H, W]
    in_height, in_width = data.shape[-2:]
    device, num_sizes, num_ratios = data.device, len(sizes), len(ratios)
    boxes_per_pixel = (num_sizes + num_ratios - 1)
    size_tensor = torch.tensor(sizes, device=device)
    ratio_tensor = torch.tensor(ratios, device=device)

    # 为了将锚点移动到像素的中心，需要设置偏移量。
    # 因为一个像素的的高为1且宽为1，我们选择偏移我们的中心0.5
    # 所有的操作都是归一化的
    offset_h, offset_w = 0.5, 0.5
    steps_h = 1.0 / in_height  # 在y轴上缩放步长
    steps_w = 1.0 / in_width  # 在x轴上缩放步长

    # 生成锚框的所有中心点
    center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h
    center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w
    shift_y, shift_x = torch.meshgrid(center_h, center_w)
    shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1)
    # shiftx, shifty [in_height * in _width]

    # 生成“boxes_per_pixel”个高和宽，
    # 之后用于创建锚框的四角坐标(xmin,xmax,ymin,ymax)
    w = torch.cat((size_tensor[0] * torch.sqrt((in_height * ratio_tensor[:]) / in_width),
                   size_tensor[1:] * torch.sqrt((in_height * ratio_tensor[0]) / in_width)))

    h = torch.cat((size_tensor[0] * torch.sqrt(in_width / (in_height * ratio_tensor[:])),
                  size_tensor[1:] * torch.sqrt(in_width / (in_height * ratio_tensor[0]))))
	# w [n + m - 1, ]
	# h [n + m - 1, ]

    # 除以2来获得半高和半宽
    anchor_manipulations = torch.stack((-w, -h, w, h)).T.repeat(
                                        in_height * in_width, 1) / 2
	# anchor_manipulations [(n + m -1) * in_height * in_width, 4]
    # 每个中心点都将有“boxes_per_pixel”个锚框，
    # 所以生成含所有锚框中心的网格，重复了“boxes_per_pixel”次
    out_grid = torch.stack([shift_x, shift_y, shift_x, shift_y],
                dim=1).repeat_interleave(boxes_per_pixel, dim=0)
    # out_grid [(n + m -1) * in_height * in_width, 4]
    output = out_grid + anchor_manipulations
    return output.unsqueeze(0)

Next, read the picture, and finally generate $\times W \times (n + m - 1)$ Anchor boxes.

img = mpimg.imread("./catdog.jpg")
h, w = img.shape[:2]

print(h, w)
X = torch.rand(size=(1, 3, h, w))
Y = multibox_prior(X, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5])
print(Y.shape)
boxes = Y.reshape(h, w, 5, 4)

Finally, the anchor box is displayed, and a pixel is selected here for generation:

def show_bboxes(axes, bboxes, labels=None, colors=None):
    """显示所有边界框"""
    def _make_list(obj, default_values=None):
        if obj is None:
            obj = default_values
        elif not isinstance(obj, (list, tuple)):
            obj = [obj]
        return obj

    labels = _make_list(labels)
    colors = _make_list(colors, ['b', 'g', 'r', 'm', 'c'])
    for i, bbox in enumerate(bboxes):
        color = colors[i % len(colors)]
        rect = bbox_to_rect(bbox.detach().numpy(), color)
        axes.add_patch(rect)
        if labels and len(labels) > i:
            text_color = 'k' if color == 'w' else 'w'
            axes.text(rect.xy[0], rect.xy[1], labels[i],
                      va='center', ha='center', fontsize=9, color=text_color,
                      bbox=dict(facecolor=color, lw=0))
plt.figure(figsize=(5, 5))
bbox_scale = torch.tensor((w, h, w, h))
fig = plt.imshow(img)
show_bboxes(fig.axes, boxes[250, 250, :, :] * bbox_scale,
            ['s=0.75, r=1', 's=0.5, r=1', 's=0.25, r=1', 's=0.75, r=2',
             's=0.75, r=0.5'])

Displayed as follows:
insert image description here