model

1、Backbone
- 1）ResNet-50
- 2) Intercept the first half of ResNet-50 as the backbone
2、Module
3、Loss Function

1、Backbone

Here we introduce the use of ResNet-50 as the backbone (the backbone used in the original paper is VGG-16)

1）ResNet-50

https://blog.csdn.net/weixin_37804469/article/details/111773914

2) Intercept the first half of ResNet-50 as the backbone

Block 1 of layer 3 is intercepted, and the subsequent discarding is not used
Block 1 of layer 3 needs to be slightly modified. Resnet-50 originally needs to do downsample here (the picture size is doubled), so it needs to be modified to not do downsample. That is, the stride is changed from the previous 2 to 1, as shown in the figure
According to the original text, the input picture is (3, 300, 300), then, to block 1 of layer 3, the output feature map size is (1024, 38, 38). This feature map will be used as the first layer of multiple feature layers.

insert image description here

class Backbone(nn.Module):
    def __init__(self, pretrain_path=None):
        super(Backbone, self).__init__()
        net = resnet50()
        self.out_channels = [1024, 512, 512, 256, 256, 256]

        if pretrain_path is not None:
            net.load_state_dict(torch.load(pretrain_path))

        self.feature_extractor = nn.Sequential(*list(net.children())[:7])

        conv4_block1 = self.feature_extractor[-1][0]

        # 修改conv4_block1的步距，从2->1
        conv4_block1.conv2.stride = (1, 1)
        conv4_block1.downsample[0].stride = (1, 1)

    def forward(self, x):
        x = self.feature_extractor(x)
        return x

At this point, our backbone is built. The backbone is the first half of resent-50. Next, we will continue to build the model on the basis of the backbone.

2、Module

On the basis of the backbone, redesign the second half of the network to form a complete network for the output of the feature layer. The output of the feature layer of the next
few layers in the figure below is (512, 19, 19), (512, 10, 10 ), (256, 5, 5), (256, 3, 3) (256, 1, 1)
insert image description here

def _build_additional_features(self, channels):
    additional_blocks = []

    # channels = [1024, 512, 512, 256, 256, 256]
    middle_channels = [256, 256 ,128, 128, 128]

    for i, (input_ch, output_ch, middle_ch) in enumerate(zip(channels[:-1], channels[1:], middle_channels)):
        padding, stride = (1, 2) if i < 3 else (0, 1)
        layer = nn.Sequential(
            nn.Conv2d(input_ch, middle_ch, kernel_size=1, bias=False),
            nn.BatchNorm2d(middle_ch),
            nn.ReLU(inplace=True),
            nn.Conv2d(middle_ch, output_ch, kernel_size=3, padding=padding, stride=stride, bias=False),
            nn.BatchNorm2d(output_ch),
            nn.ReLU(inplace=True)
        )
        additional_blocks.append(layer)
        self.additional_blocks = nn.ModuleList(additional_blocks)

The 6 feature maps are further subjected to location extraction and confidence extraction (location extractor & confidence extractor)
insert image description here

1） location extractor

Extract the position information of the corresponding default box from the 6 feature layers, where:

5776, 2166, 600, 150, 36, 4 respectively represent the number of default boxes corresponding to each feature layer
4 represents the value of the coordinate parameter of 4 coordinates (ctr_x, ctr_y, width, height)

# confidence extractor
(1) Conv2d(1024, 4 * 4, kernel_size=3, padding=1))  ===>> (16, 38, 38)   ==view==>> (4, 5776)
(2) Conv2d(512, 6 * 4, kernel_size=3, padding=1))   ===>> (24, 19, 19)   ==view==>> (4, 2166)
(3) Conv2d(512, 6 * 4, kernel_size=3, padding=1))   ===>> (24, 10, 10)   ==view==>> (4, 600)
(4) Conv2d(256, 6 * 4, kernel_size=3, padding=1))   ===>> (24, 5, 5)     ==view==>> (4, 150)
(5) Conv2d(256, 4 * 4, kernel_size=3, padding=1))   ===>> (16, 3, 3)     ==view==>> (4, 36)
(6) Conv2d(256, 4 * 4, kernel_size=3, padding=1))   ===>> (16, 1, 1)     ==view==>> (4, 4)

====>> concatenate 得 (4, 8732)  : 表示所有 8732 个 default box 的位置参数

2）confidence extractor

Extract the confidence of the object in the corresponding default box from the 6 feature layers, where:

5776, 2166, 600, 150, 36, 4 respectively represent the number of default boxes corresponding to each feature layer
21 represents the confidence level of 21 categories

# confidence extractor
(0): Conv2d(1024, 84, kernel_size=3, stride=1, padding=1) ===>> (84, 38, 38)   ==view==>> (21, 5776)
(1): Conv2d(512, 126, kernel_size=3, stride= 1, padding=1)===>> (126, 19, 19)   ==view==>> (21, 2166)
(2): Conv2d(512, 126, kernel_size=3, stride=1, padding=1)===>> (126, 10, 10)   ==view==>> (21, 600)
(3): Conv2d(256, 126, kernel_size= 3, stride=1, padding=1)===>> (126, 5, 5)   ==view==>> (21, 150)
(4): Conv2d(256, 84, kernel_size=3, stride=1, padding=1)===>> (84, 3, 3)   ==view==>> (21, 36)
(5): Conv2d(256, 84, kernel_size=3, stride=1, padding=1)===>> (84, 1, 1)   ==view==>> (21, 4)

====>> concatenate 得 (21, 8732)：表示这 8732 个 default box 分别为 21个分类的概率

3、Loss Function

After getting the coordinate parameters and confidence (classification probability) of the predicted boxes, we will calculate the loss. The calculation of Loss is divided into two parts:

Coordinate regression parameters: the loss function of the coordinate regression parameters uses SmoothL1Loss
Classification: The loss function of the classification uses CrossEntropy
$\quad$

1）location loss

1. Convert the coordinates of gt boxes in the form of (ctr_x, ctr_y, w, h) to their regression parameters relative to default boxes

def _location_vec(self, loc):
	# （1） self.scale_xy = 10.0  ,  self.scale_wh = 5.0
	gxy = self.scale_xy * (loc[:, :2, :] - self.dboxes[:, :2, :]) / self.dboxes[:, 2:, :]  # Nx2x8732
	gwh = self.scale_wh * (loc[:, 2:, :] / self.dboxes[:, 2:, :]).log()  # Nx2x8732
	return torch.cat((gxy, gwh), dim=1).contiguous()

vec_gd = self._location_vec(gloc)

2. Calculation 预测回归参数：plocand ground truth boxes 回归参数： vec_gdthe SmoothL1 Loss converted from the previous step
The introduction of SmoothL1 Loss is here

vec_gd = self._location_vec(gloc)   # vec_gd shape=[N, 4, 8732]
loc_loss = nn.SmoothL1Loss(reduction='none')(ploc, vec_gd)   # loc_loss shape=[N, 4, 8732]

3. Accumulate the parameters of 4 positions, that is, add the loss of ctr_x, ctr_y, w, and h of each box

loc_loss = loc_loss.sum(dim=1)   # loc_loss shape=[N, 8732]

3. Only extract the location loss of the positive sample, that is, only extract the location loss of the foreground image)

loc_loss = (mask.float() * loc_loss).sum(dim=1)   # loc_loss shape=[N]

$\quad$
$\quad$

2）confidence loss

Select the top k negative samples with the largest confidence loss

Negative sample: background image with label=0
The first k refers to: k is determined by the number of positive samples in the image, and it is required to select negative samples that are 3 times the number of positive samples in the image
confidence loss The largest negative sample, that is, doing Hard negative mining, mining the most difficult negative sample to classify

       # hard negative mining Tenosr: [N, 8732]
        con = self.confidence_loss(plabel, glabel)

        # positive mask will never selected
        # 获取负样本
        con_neg = con.clone()
        con_neg[mask] = 0.0
        # 按照confidence_loss降序排列 con_idx(Tensor: [N, 8732])
        _, con_idx = con_neg.sort(dim=1, descending=True)
        _, con_rank = con_idx.sort(dim=1)  # 这个步骤比较巧妙

        # number of negative three times positive
        # 用于损失计算的负样本数是正样本的3倍（在原论文Hard negative mining部分），
        # 但不能超过总样本数8732
        neg_num = torch.clamp(3 * pos_num, max=mask.size(1)).unsqueeze(-1)
        neg_mask = torch.lt(con_rank, neg_num)  # (lt: <) Tensor [N, 8732]

        # confidence最终loss使用选取的正样本loss+选取的负样本loss
        con_loss = (con * (mask.float() + neg_mask.float())).sum(dim=1)  # Tensor [N]

        # avoid no object detected
        # 避免出现图像中没有GTBOX的情况
        total_loss = loc_loss + con_loss
        # eg. [15, 3, 5, 0] -> [1.0, 1.0, 1.0, 0.0]
        num_mask = torch.gt(pos_num, 0).float()  # 统计一个batch中的每张图像中是否存在正样本
        pos_num = pos_num.float().clamp(min=1e-6)  # 防止出现分母为零的情况
        ret = (total_loss * num_mask / pos_num).mean(dim=0)  # 只计算存在正样本的图像损失

3) overall loss

整体 loss = location loss + confidence loss

total_loss = loc_loss + con_loss

Calculate the average loss of N images in the batch
(only calculate the image loss with positive samples, that is: if there is an image without positive samples in the batch, the image will not participate in the calculation)

num_mask = torch.gt(pos_num, 0).float()  # 统计一个batch中的每张图像中是否存在正样本
pos_num = pos_num.float().clamp(min=1e-6)  # 防止出现分母为零的情况
ret = (total_loss * num_mask / pos_num).mean(dim=0)  # 只计算存在正样本的图像损失

4) loss code

class Loss(nn.Module):
    """
        Implements the loss as the sum of the followings:
        1. Confidence Loss: All labels, with hard negative mining
        2. Localization Loss: Only on positive labels
        Suppose input dboxes has the shape 8732x4
    """
    def __init__(self, dboxes):
        super(Loss, self).__init__()
        # Two factor are from following links
        # http://jany.st/post/2017-11-05-single-shot-detector-ssd-from-scratch-in-tensorflow.html
        self.scale_xy = 1.0 / dboxes.scale_xy  # scale_xy = 1 / 0.1 = 10,
        self.scale_wh = 1.0 / dboxes.scale_wh  # scale_wh = 1 / 0.2 = 5

        self.location_loss = nn.SmoothL1Loss(reduction='none')
        # [num_anchors, 4] -> [4, num_anchors] -> [1, 4, num_anchors]
        self.dboxes = nn.Parameter(dboxes(order="xywh").transpose(0, 1).unsqueeze(dim=0),
                                   requires_grad=False)

        self.confidence_loss = nn.CrossEntropyLoss(reduction='none')

    def _location_vec(self, loc):
        # type: (Tensor) -> Tensor
        """
        Generate Location Vectors
        :param :
            （1） self.scale_xy = 10.0  ,  self.scale_wh = 5.0
            （2） default 匹配到的 gt box，    self.dboxes 就是row default box
        :return: ground truth相对anchors的回归参数
        """
        gxy = self.scale_xy * (loc[:, :2, :] - self.dboxes[:, :2, :]) / self.dboxes[:, 2:, :]  # Nx2x8732
        gwh = self.scale_wh * (loc[:, 2:, :] / self.dboxes[:, 2:, :]).log()  # Nx2x8732
        return torch.cat((gxy, gwh), dim=1).contiguous()

    def forward(self, ploc, plabel, gloc, glabel):
        # type: (Tensor, Tensor, Tensor, Tensor) -> Tensor
        """
            ploc, plabel: Nx4x8732, Nxlabel_numx8732
                predicted location and labels

            gloc, glabel: Nx4x8732, Nx8732
                ground truth location and labels
        """
        # 获取正样本的mask  Tensor: [N, 8732]
        mask = torch.gt(glabel, 0)  # (gt: >)
        # mask1 = torch.nonzero(glabel)
        # 计算一个batch中的每张图片的正样本个数 Tensor: [N]
        pos_num = mask.sum(dim=1)

        # 计算gt的location回归参数 Tensor: [N, 4, 8732]
        vec_gd = self._location_vec(gloc)

        # sum on four coordinates, and mask
        # 计算定位损失(只有正样本)
        loc_loss = self.location_loss(ploc, vec_gd).sum(dim=1)  # Tensor: [N, 8732]
        loc_loss = (mask.float() * loc_loss).sum(dim=1)  # Tenosr: [N]

        # hard negative mining Tenosr: [N, 8732]
        con = self.confidence_loss(plabel, glabel)

        # positive mask will never selected
        # 获取负样本
        con_neg = con.clone()
        con_neg[mask] = 0.0
        # 按照confidence_loss降序排列 con_idx(Tensor: [N, 8732])
        _, con_idx = con_neg.sort(dim=1, descending=True)
        _, con_rank = con_idx.sort(dim=1)  # 这个步骤比较巧妙

        # number of negative three times positive
        # 用于损失计算的负样本数是正样本的3倍（在原论文Hard negative mining部分），
        # 但不能超过总样本数8732
        neg_num = torch.clamp(3 * pos_num, max=mask.size(1)).unsqueeze(-1)
        neg_mask = torch.lt(con_rank, neg_num)  # (lt: <) Tensor [N, 8732]

        # confidence最终loss使用选取的正样本loss+选取的负样本loss
        con_loss = (con * (mask.float() + neg_mask.float())).sum(dim=1)  # Tensor [N]

        # avoid no object detected
        # 避免出现图像中没有GTBOX的情况
        total_loss = loc_loss + con_loss
        # eg. [15, 3, 5, 0] -> [1.0, 1.0, 1.0, 0.0]
        num_mask = torch.gt(pos_num, 0).float()  # 统计一个batch中的每张图像中是否存在正样本
        pos_num = pos_num.float().clamp(min=1e-6)  # 防止出现分母为零的情况
        ret = (total_loss * num_mask / pos_num).mean(dim=0)  # 只计算存在正样本的图像损失
        return ret
        
compute_loss = Loss(default_box)
loss = compute_loss(locs, confs, bboxes_out, labels_out)

[SSD code intensive reading] model (Backbone) & loss