Computer Vision Basics: Anchor Boxes

Anchor box

Object detection algorithms usually sample a large number of regions in the input image, then determine whether these regions contain the object of our interest, and adjust the region boundary to more accurately predict the target's ground-truth bounding box (ground-truth bounding box). Different models may use different methods of area sampling. Here we introduce one of the methods: centering on each pixel, generate multiple bounding boxes with different zoom ratios and aspect ratios. These bounding boxes are called anchor boxes.



Generate multiple anchor boxes

theory

Suppose the input image has height hhh , 0 width iswww . We generate anchor boxes of different shapes centered on each pixel of the image: the scaling ratio iss ∈ ( 0 , 1 ] s\in(0,1]s(0,1 ] , the aspect ratio isr > 0 r>0r>0 . Then the width and height of the anchor box are respectivelywsr ws \sqrt{r}wsr and hsr hs\sqrt{r}hsr . Note that anchor boxes of known width and height are determined when the center position is given.

To generate multiple anchor boxes of different shapes, let us set many scale values ​​s 1 , . . . , sn s_1,...,s_ns1,...,snand many aspect ratio values ​​r 1 , . . . , rm r_1,...,r_mr1,...,rm. When centered at each pixel using all combinations of these scales and aspect ratios, the input image will have a total of hwnm hwnmh w nm anchor boxes. Although these anchor boxes may cover all ground-truth bounding boxes, the computational complexity can easily be too high. In practice, we only considers_1 containing s 1s1or r 1 r_1r1的组合:
( s 1 , r 1 ) , ( s 1 , r 2 ) , . . . , ( s 1 , r m ) , ( s 2 , r 1 ) , ( s 3 , r 1 ) , . . . , ( s n , r 1 ) (s_1,r_1),(s_1,r_2),...,(s_1,r_m),(s_2,r_1),(s_3,r_1),...,(s_n,r_1) (s1,r1),(s1,r2),...,(s1,rm),(s2,r1),(s3,r1),...,(sn,r1)

That is, the number of anchor boxes centered on the same pixel is n + m − 1 n+m-1n+m1 . For the entire input image, a total ofwh ( n + m − 1 ) wh(n+m-1)wh(n+m1 ) Anchor boxes.

accomplish

generate anchor box

#@save
def multibox_prior(data, sizes, ratios):
    """生成以每个像素为中心具有不同形状的锚框"""
    in_height, in_width = data.shape[-2:]
    device, num_sizes, num_ratios = data.device, len(sizes), len(ratios)
    boxes_per_pixel = (num_sizes + num_ratios - 1)
    size_tensor = torch.tensor(sizes, device=device)
    ratio_tensor = torch.tensor(ratios, device=device)

    # 为了将锚点移动到像素的中心,需要设置偏移量。
    # 因为一个像素的高为1且宽为1,我们选择偏移我们的中心0.5
    offset_h, offset_w = 0.5, 0.5
    steps_h = 1.0 / in_height  # 在y轴上缩放步长
    steps_w = 1.0 / in_width  # 在x轴上缩放步长

    # 生成锚框的所有中心点
    center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h
    center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w
    shift_y, shift_x = torch.meshgrid(center_h, center_w, indexing='ij')
    shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1)

    # 生成“boxes_per_pixel”个高和宽,
    # 之后用于创建锚框的四角坐标(xmin,xmax,ymin,ymax)
    w = torch.cat((size_tensor * torch.sqrt(ratio_tensor[0]),
                   sizes[0] * torch.sqrt(ratio_tensor[1:])))\
                   * in_height / in_width  # 处理矩形输入
    h = torch.cat((size_tensor / torch.sqrt(ratio_tensor[0]),
                   sizes[0] / torch.sqrt(ratio_tensor[1:])))
    # 除以2来获得半高和半宽
    anchor_manipulations = torch.stack((-w, -h, w, h)).T.repeat(
                                        in_height * in_width, 1) / 2

    # 每个中心点都将有“boxes_per_pixel”个锚框,
    # 所以生成含所有锚框中心的网格,重复了“boxes_per_pixel”次
    out_grid = torch.stack([shift_x, shift_y, shift_x, shift_y],
                dim=1).repeat_interleave(boxes_per_pixel, dim=0)
    output = out_grid + anchor_manipulations
    return output.unsqueeze(0)

Function flow:

  1. First get the height and width of the input data data, as well as the device type device, the number of anchor boxes num_sizes and the number of aspect ratios num_ratios.
  2. Calculate the number of anchor boxes boxes_per_pixel generated by each pixel, which is equal to the number of anchor boxes plus the number of aspect ratios minus one.
  3. Define size_tensor and ratio_tensor to represent the size and aspect ratio of the anchor box, respectively.
  4. Defines the offsets offset_h and offset_w to move the center point of the anchor box to the center of the pixel. Since a pixel has a height of 1 and a width of 1, an offset of 0.5 is chosen.
  5. Calculate the steps steps_h and steps_w that need to be scaled on the y-axis and x-axis.
  6. Generate the coordinates of the center points of all anchor boxes, where center_h and center_w represent the y-coordinate and x-coordinate of the center point of each pixel respectively.
  7. Calculate the width and height of each anchor box.
  8. Divide width and height by 2 to get half height and half width.
  9. Generate a grid out_grid containing the center points of all anchor boxes, each center point will have boxes_per_pixel anchor boxes.
  10. Combine the coordinates of the center points of all anchor boxes with the width and height of the anchor boxes to obtain the four coordinates (xmin, ymin, xmax, ymax) of all anchor boxes that are finally generated.
  11. Finally, the dimensions of the coordinate tensor of all anchor boxes are converted from (num_boxes, 4) to the form of (1, num_boxes, 4), where num_boxes represents the number of anchor boxes.

display anchor box

#@save
def show_bboxes(axes, bboxes, labels=None, colors=None):
    """显示所有边界框"""
    def _make_list(obj, default_values=None):
        if obj is None:
            obj = default_values
        elif not isinstance(obj, (list, tuple)):
            obj = [obj]
        return obj

    labels = _make_list(labels)
    colors = _make_list(colors, ['b', 'g', 'r', 'm', 'c'])
    for i, bbox in enumerate(bboxes):
        color = colors[i % len(colors)]
        rect = d2l.bbox_to_rect(bbox.detach().numpy(), color)
        axes.add_patch(rect)
        if labels and len(labels) > i:
            text_color = 'k' if color == 'w' else 'w'
            axes.text(rect.xy[0], rect.xy[1], labels[i],
                      va='center', ha='center', fontsize=9, color=text_color,
                      bbox=dict(facecolor=color, lw=0))

The function name is show_bboxes, and the input parameters are axes (image coordinate axis object), bboxes (bounding box coordinates), labels (bounding box label, optional), colors (bounding box color, optional).

_make_list is an internal function that converts arguments to lists or tuples. If the input argument is None, the default value is used; if the argument is not a list or tuple, it is converted to a one-element list.

labels and colors represent the bounding box labels and colors, respectively, if not specified, it defaults to None and a list of colors ['b', 'g', 'r', 'm', 'c'].

Iterate over all bounding boxes and do the following in order:

  1. Choose a color from the color list.
  2. Convert the bounding box coordinates to a Rectangle object, that is, a rectangular box.
  3. Add a rectangle to the image axis object.
  4. If a bounding box label is specified, a text label is added at the center of the rectangle.

Among them, bbox_to_rect is a helper function used to convert bounding box coordinates to Rectangle objects. This function returns a matplotlib.patches.Rectangle object and sets its border color and fill color.

d2l.set_figsize()
bbox_scale = torch.tensor((w, h, w, h))
fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, boxes[250, 250, :, :] * bbox_scale,
            ['s=0.75, r=1', 's=0.5, r=1', 's=0.25, r=1', 's=0.75, r=2',
             's=0.75, r=0.5'])

insert image description here

Intersection over union (IoU)

theory

IoU (Intersection over Union), also known as the Jaccard coefficient, is an indicator used to measure the degree of overlap between two sets. In computer vision, IoU is often used to evaluate the performance of models in tasks such as object detection and semantic segmentation.

Specifically, suppose there are two sets AAA andBBB , which correspond to two bounding boxes or two image segmentation results, respectively. IntersectionA ∩ BA\cap BAB represents the overlap of two sets, and the unionA ∪ BA\cup BAB represents all parts of both sets. Then the intersection ratio is defined as:

I o U ( A , B ) = A ∩ B A ∪ B IoU(A,B)=\frac{A \cap B}{A \cup B} I or U ( A ,B)=ABAB

The value range of IoU is [ 0 , 1 ] [0, 1][0,1 ] , where0 00 means that the two sets do not overlap,1 11 indicates that the two sets are exactly the same.

In tasks such as object detection and semantic segmentation, IoU is often used as an evaluation metric for model performance. For example, in object detection, a bounding box is considered correct if and only if its IoU with the ground-truth bounding box is greater than a certain threshold; in semantic segmentation, a pixel is considered correct if and only if it Both predicted and true results are labeled as positive classes, and their IoU is greater than a certain threshold.

insert image description here
The next section will use the intersection ratio to measure the similarity between anchor boxes and ground-truth bounding boxes, as well as between different anchor boxes. Given a list of two anchor boxes or bounding boxes, the following box_iou function will compute their pairwise IOU in these two lists.

accomplish

#@save
def box_iou(boxes1, boxes2):
    """计算两个锚框或边界框列表中成对的交并比"""
    box_area = lambda boxes: ((boxes[:, 2] - boxes[:, 0]) *
                              (boxes[:, 3] - boxes[:, 1]))
    # boxes1,boxes2,areas1,areas2的形状:
    # boxes1:(boxes1的数量,4),
    # boxes2:(boxes2的数量,4),
    # areas1:(boxes1的数量,),
    # areas2:(boxes2的数量,)
    areas1 = box_area(boxes1)
    areas2 = box_area(boxes2)
    # inter_upperlefts,inter_lowerrights,inters的形状:
    # (boxes1的数量,boxes2的数量,2)
    inter_upperlefts = torch.max(boxes1[:, None, :2], boxes2[:, :2])
    inter_lowerrights = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])
    inters = (inter_lowerrights - inter_upperlefts).clamp(min=0)
    # inter_areasandunion_areas的形状:(boxes1的数量,boxes2的数量)
    inter_areas = inters[:, :, 0] * inters[:, :, 1]
    union_areas = areas1[:, None] + areas2 - inter_areas
    return inter_areas / union_areas

The function is called box_iou and the input parameters are boxes1 (the first list of bounding boxes) and boxes2 (the second list of bounding boxes).

A box_area function is defined inside the function to calculate the area of ​​the bounding box. The function takes as input a list of bounding boxes and returns a row vector representing the area of ​​each bounding box.

Calculate the area of ​​each bounding box in the bounding box lists boxes1 and boxes2, and store them in areas1 and areas2 respectively.

Computes the intersection of two bounding boxes. For each pair of bounding boxes, first calculate the coordinates of their upper-left and lower-right points, and take their maximum and minimum values ​​to obtain the coordinates of the upper-left and lower-right points of the intersection. The width and height of the intersection are then calculated and truncated to non-negative values ​​using the clamp function. Finally, calculate the area of ​​the intersection and save it in inter_areas.

Computes the union of two bounding boxes. For each pair of bounding boxes, add their areas and subtract their intersection areas to obtain the area of ​​the union, which is stored in union_areas.

Calculate the intersection and union ratio of two bounding boxes, that is, divide the intersection area by the union area to obtain a matrix, where the iirow i , jjThe elements of column j represent theiiThe i bounding box and thejjthIntersection-over-union ratio of j bounding boxes.

Among them, the torch.max and torch.min functions are used to calculate the element-wise maximum and minimum values ​​of two tensors to generate a new tensor. The clamp function is used to truncate elements in a tensor, replacing elements that are less than a specified value with a specified value, and replacing elements that are greater than a specified value with a specified value.

Assign ground-truth bounding boxes to anchor boxes

theory

After you get a series of anchor boxes, how do you match the real bounding box with the anchor box?

For a given image, suppose the anchor boxes are A 1 , . . , A na A_1,..,A_{n_a}A1,..,Ana, the true bounding box is B 1 , . . , B nb B_1,..,B_{n_b}B1,..,Bnb, where na > nb n_a>n_bna>nb. Let us define a matrix X ∈ R na × nb X\in R^{n_a \times n_b}XRna×nb, of which iiline i , linejjElements of column j xij x_{ij}xijis the anchor box A i A_iAiand the ground truth bounding box B j B_jBjThe IoU. The algorithm contains the following steps.

  1. In matrix XXFind the largest element in X , and denote its row index and column index asi 1 i_1i1and j 1 j_1j1. Then the ground truth bounding box B j 1 B_{j_1}Bj1Assigned to anchor box A i 1 A_{i_1}Ai1. This is intuitive since A i 1 A_{i_1}Ai1and B j 1 B_{j_1}Bj1is the closest of all anchor and ground-truth bounding box pairs. After the first allocation is complete, discard i 1 i_1 in the matrixi1row sum j 1 j_1j1All elements in the column.

  2. In matrix XXFind the largest element among the remaining elements in X , and denote its row index and column index asi 2 i_2i2and j 2 j_2j2. We set the ground truth bounding box B j 2 B_{j_2}Bj2Assigned to anchor box A i 2 A_{i_2}Ai2, and discard i 2 i_2 in the matrixi2row sum j 2 j_2j2All elements in the column.

  3. At this time, the matrix XXElements in two rows and two columns of X have been discarded. We continue until the matrixXXnb n_bin XnbAll elements in the column. At this time already for this nb n_bnbAnchor boxes are each assigned a ground-truth bounding box.

  4. Only traverse the remaining na − nb n_a-n_bnanbanchor box. For example, given any anchor box A i A_iAi, in matrix XXX 'siiA i A_iis found in row iAiThe IoU of the largest ground-truth bounding box B i B_iBi, only when this IoU is greater than a predefined threshold, B j B_jBjDistribution A i A_iAi

A specific example is used below to illustrate the above algorithm. As shown on the left, suppose the matrix XXThe largest value in X is x 23 x_{23}x23, we set the ground truth bounding box B 3 B_3B3Assigned to anchor box A 2 A_2A2. We then discard all elements in row 2 and column 3 of the matrix and find the largest x 71 x_{71} among the remaining elements (shaded area)x71, and then the ground-truth bounding box B 1 B_1B1Assigned to anchor box A 7 A_7A7. Next, as shown in the middle figure, discard all elements in the 7th row and 1st column of the matrix, and find the largest x 54 x_{54} among the remaining elements (shaded area)x54, and then the ground-truth bounding box B 4 B_4B4Assigned to anchor box A 5 A_5A5. Finally, as shown on the right, discard all elements in the 5th row and 4th column of the matrix, and find the largest x 92 x_{92} among the remaining elements (shaded area)x92, and then the ground-truth bounding box B 2 B_2B2Assigned to anchor box A 9 A_9A9. After that, we only need to traverse the remaining anchor boxes A 1 , A 3 , A 4 , A 4 , A 6 , A 8 A_1,A_3,A_4,A_4,A_6,A_8A1,A3,A4,A4,A6,A8, and then determine whether to assign ground-truth bounding boxes to them based on a threshold.

insert image description here

accomplish

#@save
def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5):
    """将最接近的真实边界框分配给锚框"""
    num_anchors, num_gt_boxes = anchors.shape[0], ground_truth.shape[0]
    # 位于第i行和第j列的元素x_ij是锚框i和真实边界框j的IoU
    jaccard = box_iou(anchors, ground_truth)
    # 对于每个锚框,分配的真实边界框的张量
    anchors_bbox_map = torch.full((num_anchors,), -1, dtype=torch.long,
                                  device=device)
    # 根据阈值,决定是否分配真实边界框
    max_ious, indices = torch.max(jaccard, dim=1)
    anc_i = torch.nonzero(max_ious >= iou_threshold).reshape(-1)
    box_j = indices[max_ious >= iou_threshold]
    anchors_bbox_map[anc_i] = box_j
    col_discard = torch.full((num_anchors,), -1)
    row_discard = torch.full((num_gt_boxes,), -1)
    for _ in range(num_gt_boxes):
        max_idx = torch.argmax(jaccard)
        box_idx = (max_idx % num_gt_boxes).long()
        anc_idx = (max_idx / num_gt_boxes).long()
        anchors_bbox_map[anc_idx] = box_idx
        jaccard[:, box_idx] = col_discard
        jaccard[anc_idx, :] = row_discard
    return anchors_bbox_map

What this code does is:

The closest ground truth boxes are assigned to anchors. Its steps are:

  1. Calculate the IoU (intersection over union ratio) between the anchor box and the ground truth bounding box, and store the result in the jaccard matrix. The element x_ij in row i and column j is the IoU of anchor box i and ground truth bounding box j.

  2. A variable anchorsbboxmap with a default value of -1 is initialized for each anchor box to store the index of the ground-truth bounding box assigned to it.

  3. Depending on the IoU threshold (0.5 by default), it is decided whether to assign ground-truth bounding boxes to anchor boxes. The indices of the anchor boxes with IoU larger than the threshold are stored in anci, and the corresponding ground-truth bounding box indices are stored in boxj. The values ​​in the anchorsbboxmap of these anchor boxes are updated with the corresponding ground truth bounding box indices.

  4. Then start greedily assigning the remaining ground truth bounding boxes. Each time the element with the largest IoU in the jaccard matrix is ​​selected, its anchor box index ancidx and real bounding box index boxidx are obtained, and the anchorsbboxmap and jaccard matrix are updated.

  5. Repeat step 4 until all ground-truth boxes are assigned to anchor boxes, or the IoU of the remaining ground-truth boxes and anchor boxes are both smaller than the threshold.

  6. Returns the anchorsbboxmap, which stores the index of the ground-truth bounding box corresponding to each anchor box (-1 if not assigned).

The purpose of this function is to assign a ground-truth bounding box to each anchor box, which helps the anchor box predict the position of the ground-truth bounding box, thereby improving the accuracy of object detection.

Tag category and offset

theory

Now we can label the category and offset for each anchor box. Suppose an anchor box AAA is assigned a ground truth bounding boxBBB. _ On the one hand, the anchor boxAAA 's category will be labeled withBBB is the same. On the other hand, anchor boxAAA 's offset will be based onBBB andAAThe relative position of the center coordinates of A and the relative size of the two boxes are marked. Given that different boxes within the dataset have different positions and sizes, we can apply transformations to those relative positions and sizes to obtain offsets that are more evenly distributed and easier to fit. Here is a common transformation. Given boxAAA andBBB , the center coordinates are( xa , ya ) (x_a,y_a)(xa,ya) sum( xb , yb ) (x_b,y_b)(xb,yb) , the width is wa w_arespectivelywaJapanese wb w_bwb, and the heights are ha h_ahaand hb h_bhb, you can put AAA 's offset is marked as:

( x b − x a w a − μ x σ x , y b − y a w a − μ y σ y , l o g w b w a − μ w σ w , l o g h b h a − μ h σ h ) (\frac{\frac{x_b-x_a}{w_a}-\mu_x}{\sigma_x},\frac{\frac{y_b-y_a}{w_a}-\mu_y}{\sigma_y},\frac{log\frac{w_b}{w_a}-\mu_w}{\sigma_w},\frac{log\frac{h_b}{h_a}-\mu_h}{\sigma_h}) (pxwaxbxamx,pywaybyamy,pwlogwawbmw,phloghahbmh)

主位常小是可以值的μ x = μ y = μ w = μ h = 0 , σ x = σ y = 0.1 , σ w = σ h = 0.2 \mu_x=\mu_y=\mu_w=\mu_h=0,\ sigma_x=\sigma_y=0.1,\sigma_w=\sigma_h=0.2mx=my=mw=mh=0,px=py=0.1,pw=ph=0.2 . This transformation is implemented in the offset_boxes function below.

#@save
def offset_boxes(anchors, assigned_bb, eps=1e-6):
    """对锚框偏移量的转换"""
    c_anc = d2l.box_corner_to_center(anchors)
    c_assigned_bb = d2l.box_corner_to_center(assigned_bb)
    offset_xy = 10 * (c_assigned_bb[:, :2] - c_anc[:, :2]) / c_anc[:, 2:]
    offset_wh = 5 * torch.log(eps + c_assigned_bb[:, 2:] / c_anc[:, 2:])
    offset = torch.cat([offset_xy, offset_wh], axis=1)
    return offset

What this function does is calculate the offset between the anchor box and the ground truth bounding box assigned to it. Its steps are:

  1. Convert the corner form of the anchor box (anchors) and the ground truth bounding box (assignedbb) to the form of the center width and height, and store them in canc and caassignedbb respectively.

  2. Calculate the offset_xy of the center of the ground truth bounding box relative to the center of the anchor box. Since the ground-truth bounding box and the anchor box may have different scales, the offset is proportional to the width and height of the anchor box and enlarged by a factor of 10.

  3. Calculate the offset offset_wh of the width and height of the ground truth bounding box relative to the width and height of the anchor box. Since the ratio of width to height is on a logarithmic scale, the log ratio is calculated, and a small number eps is added to prevent log(0).

  4. Splice offsetxy and offsetwh in the direction of axis=1 to get the final offset offset.

  5. return offset.

The offset calculated by this function is used for anchor box prediction to ground truth bounding box. By adding the offset, the position and shape of the anchor box can be adjusted to make it closer to the ground-truth bounding box, thus achieving object detection.

If an anchor box is not assigned a ground-truth bounding box, we simply label the category of the anchor box as background. The anchor boxes of the background category are usually called negative anchor boxes, and the rest are called positive anchor boxes. We implement the following multibox_target function using the ground truth bounding box (labels parameter) to label the category and offset of the anchor box (anchors parameter). This function sets the index of the background category to zero, then increments the integer index of the new category by one.

accomplish

#@save
def multibox_target(anchors, labels):
    """使用真实边界框标记锚框"""
    batch_size, anchors = labels.shape[0], anchors.squeeze(0)
    batch_offset, batch_mask, batch_class_labels = [], [], []
    device, num_anchors = anchors.device, anchors.shape[0]
    for i in range(batch_size):
        label = labels[i, :, :]
        anchors_bbox_map = assign_anchor_to_bbox(
            label[:, 1:], anchors, device)
        bbox_mask = ((anchors_bbox_map >= 0).float().unsqueeze(-1)).repeat(
            1, 4)
        # 将类标签和分配的边界框坐标初始化为零
        class_labels = torch.zeros(num_anchors, dtype=torch.long,
                                   device=device)
        assigned_bb = torch.zeros((num_anchors, 4), dtype=torch.float32,
                                  device=device)
        # 使用真实边界框来标记锚框的类别。
        # 如果一个锚框没有被分配,标记其为背景(值为零)
        indices_true = torch.nonzero(anchors_bbox_map >= 0)
        bb_idx = anchors_bbox_map[indices_true]
        class_labels[indices_true] = label[bb_idx, 0].long() + 1
        assigned_bb[indices_true] = label[bb_idx, 1:]
        # 偏移量转换
        offset = offset_boxes(anchors, assigned_bb) * bbox_mask
        batch_offset.append(offset.reshape(-1))
        batch_mask.append(bbox_mask.reshape(-1))
        batch_class_labels.append(class_labels)
    bbox_offset = torch.stack(batch_offset)
    bbox_mask = torch.stack(batch_mask)
    class_labels = torch.stack(batch_class_labels)
    return (bbox_offset, bbox_mask, class_labels)

This code defines a function called multibox_target whose main purpose is to assign ground truth bounding boxes and corresponding class labels to anchor boxes. This is a critical step in object detection tasks and is often used to train detectors. This function accepts two input parameters: anchors and labels.

Input parameter description:

  1. anchors: A tensor representing anchor boxes, with shape (1, num_anchors, 4). Anchor boxes are predefined bounding boxes used to predict the location of objects in object detection tasks.
  2. labels: A Tensor representing the ground-truth bounding boxes and their classes, of shape (batch_size, num_labels, 5). Each ground-truth bounding box contains category information and coordinate information.

The function returns three tensors: bbox_offset (bounding box offset), bbox_mask (used to filter unassigned anchor boxes) and class_labels (class labels for each anchor box).

The main steps of the function are as follows:

  1. Initialize some variables like batch_size, device and num_anchors.
  2. Iterate through each sample (iteration batch_size):
    a. Assign each anchor box to the closest ground-truth bounding box using the assign_anchor_to_bbox function.
    b. Create a mask for filtering unassigned anchor boxes. >
    c. Initialize class labels and assign bounding box coordinates to zero.
    d. Set the category label of the anchor box to the category corresponding to the ground truth bounding box. Anchor boxes that are not assigned will be marked as background (category zero).
    e. Assign the assigned bounding box coordinates to the corresponding anchor boxes.
    f. Calculate the offset between the anchor box and the assigned ground truth bounding box and multiply it by the mask.
    g. Add offsets, masks, and class labels to the corresponding batch lists.
  3. Convert the batched list to tensor and return.

The function of this bbox_mask is to mask the offset of the unassigned anchor box when calculating the offset, so that it will not be updated.

The output of this function can be used to train an object detector to learn how to predict the location and class of objects.

for example

ground_truth = torch.tensor([[0, 0.1, 0.08, 0.52, 0.92],
                         [1, 0.55, 0.2, 0.9, 0.88]])
anchors = torch.tensor([[0, 0.1, 0.2, 0.3], [0.15, 0.2, 0.4, 0.4],
                    [0.63, 0.05, 0.88, 0.98], [0.66, 0.45, 0.8, 0.8],
                    [0.57, 0.3, 0.92, 0.9]])

fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, ground_truth[:, 1:] * bbox_scale, ['dog', 'cat'], 'k')
show_bboxes(fig.axes, anchors * bbox_scale, ['0', '1', '2', '3', '4']);

insert image description here
Using the multibox_target function defined above, we can annotate the classes and offsets of these anchor boxes from the ground-truth bounding boxes of dogs and cats. In this example, the class indices for background, dog, and cat are 0, 1, and 2, respectively. Below we add a dimension to the anchor box and ground truth bounding box samples.

labels = multibox_target(anchors.unsqueeze(dim=0),
                         ground_truth.unsqueeze(dim=0))

There are three elements in the returned result, all in tensor format. The third element contains the category of the tagged input anchor box.

Let's analyze the class labels returned below based on the locations of anchor boxes and ground truth bounding boxes in the image. First, among all anchor box and ground-truth bounding box pairs, anchor box A 4 A_4A4The IoU with the ground-truth bounding box of the cat is the largest. Therefore, A 4 A_4A4The category is labeled cat. remove contains A 4 A_4A4or the pairing of the ground-truth bounding box of the cat, in the remaining pairing, the anchor box A 1 A_1A1and the ground-truth bounding box of the dog have the largest IoU. Therefore, A 1 A_1A1The category is labeled Dog. Next, we need to traverse the remaining three unmarked anchor boxes: A 2 , A 3 , A 0 A_2,A_3,A_0A2,A3,A0. For A 0 A_0A0, instead of the category of the ground-truth bounding box having the largest IoU is dog, but the IoU is lower than a predefined threshold (0.5), so this category is labeled as background; for A 2 A_2A2, instead of having the category of the ground-truth bounding box with the largest IoU is a cat, the IoU exceeds the threshold, so the category is marked as a cat; for A 3 A_3A3, the class of the ground-truth bounding box with the largest IoU is cat, but the value is below the threshold, so this class is labeled as background.

labels[2]

insert image description here
The second element returned is the mask variable with shape (batch size, four times the number of anchor boxes). The elements in the mask variable are in one-to-one correspondence with the 4 offsets of each anchor box. Since we do not care about the detection of the background, the offset of the negative class should not affect the objective function. With element-wise multiplication, zeros in the mask variable will filter out negative class offsets before computing the objective function.

labels[1]

insert image description here
The first element returned contains the four offset values ​​marked for each anchor box. Note that the offsets of negative anchor boxes are marked as zero.

labels[0]

insert image description here

Learn to calculate the offset, so what is the use of the offset?

The offset of anchor boxes plays an important role in object detection tasks. It is used to represent the position and scale difference between anchor boxes and ground-truth bounding boxes, thus helping the model to localize objects accurately.
Specifically, the offset of the anchor box has the following purposes:

Position localization: By calculating the offset between the anchor box and the ground-truth bounding box, the precise position of the target object in the image can be determined. The offset indicates how much the anchor box needs to be moved to align with the ground truth bounding box.

Object Classification: The offset can help the model for object classification. In object detection tasks, each anchor box is associated with a category. By comparing the location of the anchor box with the corresponding ground-truth bounding box, the correct category can be assigned to the anchor box aligned with the ground-truth bounding box.

Scale adjustment: The offset of the anchor box can also help the model to perform object scale adjustment. By computing the scale difference between the anchor box and the ground-truth bounding box, the size of the anchor box can be adjusted to better adapt to objects of different scales.

Predict bounding boxes using non-maximum suppression

When predicting, we first generate multiple anchor boxes for the image, and then predict categories and offsets for these anchor boxes one by one. A predicted bounding box is generated from one of the anchor boxes with the predicted offset. Below we implement the offset_inverse function, which takes an anchor box and an offset prediction as input, and applies the inverse offset transformation to return the predicted bounding box coordinates.

#@save
def offset_inverse(anchors, offset_preds):
    """根据带有预测偏移量的锚框来预测边界框"""
    anc = d2l.box_corner_to_center(anchors)
    pred_bbox_xy = (offset_preds[:, :2] * anc[:, 2:] / 10) + anc[:, :2]
    pred_bbox_wh = torch.exp(offset_preds[:, 2:] / 5) * anc[:, 2:]
    pred_bbox = torch.cat((pred_bbox_xy, pred_bbox_wh), axis=1)
    predicted_bbox = d2l.box_center_to_corner(pred_bbox)
    return predicted_bbox

This code implements the prediction of the bounding box based on the anchor box and the predicted offset. It does the following steps:

Convert the four-corner coordinates of the anchor box to the center coordinates and width and height anc, use the d2l.boxcornerto_center() function.

The first two values ​​in the predicted offset offsetpreds represent the offset of the center coordinates. Use the center coordinates in anc plus the offset multiplied by 1/10 of the width and height to predict the center coordinates predbbox_xy of the bounding box.

The last two values ​​in offsetpreds represent the logarithm of width and height. Use the e index operator and the width and height in anc to multiply by 1/5 of the offset to predict the width and height of the bounding box predbbox_wh.

Stitch the predicted center coordinates and width and height into pred_bbox.

Convert the predbbox from the center coordinates to the four-corner coordinates, use the d2l.boxcentertocorner() function, and the result is predicted_bbox.

So the whole process is: get the center coordinates and width and height according to the anchor box and the predicted offset, and then convert to the four corner coordinates, so as to predict the bounding box.

This process is very common in target detection. Predicting the bounding box through the anchor box and offset can be more accurate, rather than just sliding a fixed-size window directly on the picture.

When there are many anchor boxes, many similar predicted bounding boxes with significant overlap may be output, all around the same object. To simplify the output, we can use non-maximum suppression (NMS) to merge similar predicted bounding boxes belonging to the same object.

Here's how non-maximum suppression works. For a predicted bounding box BBB , the object detection model calculates the predicted probability for each class. Suppose the maximum predicted probability isppp , then the category BBcorresponding to this probabilityB is the predicted category. Specifically, we willppp is called the predicted bounding boxBBB 's confidence (confidence). Within the same image, all predicted non-background bounding boxes are sorted in descending order of confidence to produce a listLLL. _ We then manipulate the sorted list LLbyL:

  1. from LLSelect the predicted bounding box B 1 B_1with the highest confidence in LB1As a benchmark, then compare all B_1 with B 1B1The IoU exceeds the predetermined threshold ccc 's non-baseline predicted bounding boxes fromLLRemoved from L. At this time,LLL keeps the predicted bounding box with the highest confidence and removes other predicted bounding boxes that are too similar to it. In short, those bounding boxes with non-maximum confidence are suppressed.
  2. from LLSelect the predicted bounding box B 2 B_2with the second highest confidence in LB2As yet another benchmark, then compare all B_2 with B 2B2The IoU is greater than ccc 's non-baseline predicted bounding boxes fromLLRemoved from L.
  3. Repeat the above process until LLAll predicted bounding boxes in L have been used as benchmarks. At this time,LLThe IoU of any pair of predicted bounding boxes in L is less than the threshold ccc ; thus, no pair of bounding boxes is too similar.
  4. output list LLAll predicted bounding boxes in L.

The following nms function sorts the confidences in descending order and returns their indices.

#@save
def nms(boxes, scores, iou_threshold):
    """对预测边界框的置信度进行排序"""
    B = torch.argsort(scores, dim=-1, descending=True)
    keep = []  # 保留预测边界框的指标
    while B.numel() > 0:
        i = B[0]
        keep.append(i)
        if B.numel() == 1: break
        iou = box_iou(boxes[i, :].reshape(-1, 4),
                      boxes[B[1:], :].reshape(-1, 4)).reshape(-1)
        inds = torch.nonzero(iou <= iou_threshold).reshape(-1)
        B = B[inds + 1]
    return torch.tensor(keep, device=boxes.device)

Non-Maximum Suppression Prediction Bounding Box Implementation:

#@save
def multibox_detection(cls_probs, offset_preds, anchors, nms_threshold=0.5,
                       pos_threshold=0.009999999):
    """使用非极大值抑制来预测边界框"""
    device, batch_size = cls_probs.device, cls_probs.shape[0]
    anchors = anchors.squeeze(0)
    num_classes, num_anchors = cls_probs.shape[1], cls_probs.shape[2]
    out = []
    for i in range(batch_size):
        cls_prob, offset_pred = cls_probs[i], offset_preds[i].reshape(-1, 4)
        conf, class_id = torch.max(cls_prob[1:], 0)
        predicted_bb = offset_inverse(anchors, offset_pred)
        keep = nms(predicted_bb, conf, nms_threshold)

        # 找到所有的non_keep索引,并将类设置为背景
        all_idx = torch.arange(num_anchors, dtype=torch.long, device=device)
        combined = torch.cat((keep, all_idx))
        uniques, counts = combined.unique(return_counts=True)
        non_keep = uniques[counts == 1]
        all_id_sorted = torch.cat((keep, non_keep))
        class_id[non_keep] = -1
        class_id = class_id[all_id_sorted]
        conf, predicted_bb = conf[all_id_sorted], predicted_bb[all_id_sorted]
        # pos_threshold是一个用于非背景预测的阈值
        below_min_idx = (conf < pos_threshold)
        class_id[below_min_idx] = -1
        conf[below_min_idx] = 1 - conf[below_min_idx]
        pred_info = torch.cat((class_id.unsqueeze(1),
                               conf.unsqueeze(1),
                               predicted_bb), dim=1)
        out.append(pred_info)
    return torch.stack(out)

This code implements the prediction of bounding boxes using non-maximum suppression. The main steps are as follows:

  1. Obtain the predicted category confidence clsprobs, the predicted offset offsetpreds and the anchor box anchors.

  2. Make predictions for each image. Traverse each picture i, obtain the category confidence clsprob of the picture, offset prediction offsetpred and anchor frame anchors.

  3. Find the maximum class confidence and class classid for each anchor box. Use torch.max() to find the maximum category confidence conf and corresponding category classid for each anchor box.

  4. According to the anchor box and offset prediction decoding, the predicted bounding box predictedbb is obtained. Use the offsetinverse() function to decode.

  5. Perform non-maximum suppression on the predicted bounding box to obtain the reserved bounding box index keep. Use the nms() function to achieve.

  6. Find the index non_keep of the non-keep bounding box and set its class to background-1.

  7. Get all the sorted indexes allidsorted according to keep and nonkeep, and take out the corresponding category, confidence and prediction frame from classid, conf and predicted_bb according to this index.

  8. Find a prediction box with a confidence less than pos_threshold, set its category to background-1, and set its confidence to 1-conf.

  9. Concatenate the category, confidence and prediction frame into pred_info as the prediction result of the picture.

  10. Concatenate and return the prediction results pred_info of all pictures.

The predicted result looks like the picture below:
insert image description here
Now we can call the multibox_detection function to perform non-maximum suppression, where the threshold is set to 0.5. Note that we added dimensions to the example's tensor input.

We can see that the shape of the returned result is (batch size, number of anchor boxes, 6). The six elements in the innermost dimension provide output information for the same predicted bounding box. The first element is the predicted class index, starting from 0 (0 for dog and 1 for cat), with a value of -1 indicating background or was removed in non-maximum suppression. The second element is the confidence of the predicted bounding box. The remaining four elements are ( x , y ) (x,y) of the upper left corner and lower right corner of the predicted bounding box respectively(x,y ) axis coordinate (range between 0 and 1).

After maximum suppression:
insert image description here

Guess you like

Origin blog.csdn.net/qq_51957239/article/details/130914308