Rewriting wise target detection 23-Pytorch builds an SSD target detection platform-https://blog.csdn.net/weixin_44791964/article/details/104981486?spm=1

What is SSD target detection algorithm

SSD is a very good one-stage target detection method. The one-stage algorithm is to perform target detection and classification at the same time. The main idea is to use CNN to extract features and then evenly conduct dense sampling at different positions of the picture. Different scales and aspect ratios can be used, and the object classification and the regression of the prediction frame are performed at the same time. The whole process only needs one step, so its advantage is that it is fast.

However, an important disadvantage of uniform dense sampling is that training is more difficult. This is mainly due to the imbalance of the background (background) of the positive sample and the negative sample, resulting in a slightly lower accuracy of the model.

The full English name of SSD is Single Shot MultiBox Detector. Single shot means that the SSD algorithm is a one-stage method, and MultiBox means that the SSD algorithm is based on multi-frame prediction.

SSD realization ideas

One, the forecast part

1. Introduction to the backbone network

 

Final detections

The VGG network here has been modified to some extent, the main modifications are:

1. Convert the FC6 and FC7 layers of VGG16 into convolutional layers.

2. Remove all the Dropout layer and FC8 layer

3. Added conv6, Conv7, Conv8, Conv9

As shown in the figure, the input image has gone through an improved VGG network (Conv1->fc7) and several additional convolutional layers (Conv6->conv9) for feature extraction:

a. Enter a picture, which is resized to 300*300 reshape

b. conv1, after two [3, 3] convolutional networks, the output feature layer is 64, the output is (300, 300, 64) and then 2*2 maximum pooling, the output net is (150, 150, 64 )

c, conv2, after two [3, 3] convolutional networks, the output feature layer is 128, the output is (150, 150, 128) and then 2*2 maximum pooling, the output net is (75, 75, 128 )

d, conv3, after three [3, 3] convolutional networks, the output feature layer is 256, the output is (75, 75, 256) and then 2*2 maximum pooling, the output net is (38, 38, 256)

e. conv4, after three [3, 3] convolutional networks, the output feature layer is 512, the output is (38, 38, 512) and then 2*2 maximum pooling, the output net is (19, 19, 512)

f, conv5, after three [3, 3] convolutional networks, the output feature layer is 512, the output is (19, 19, 512) and the step size is 1, the convolution kernel size is 3*3 maximum pooling, The output net is (19, 19, 512)

g. Use convolution to replace the fully connected layer, and perform a [3, 3] convolutional network and a [1, 1] convolutional network. The output feature layer is 1024, so the output net is (19, 19, 1024 )

(From here onwards are the structure of VGG)

h, conv6, after a [1, 1] convolutional network, adjust the number of channels, a [3, 3] convolutional network with a step size of 2, the output feature layer is 512, so the output net is (10, 10 , 512)

i, conv7, after a [1, 1] convolutional network, adjust the number of channels, a [3, 3] convolutional network with a step size of 2, the output feature layer is 256, so the output net is (5, 5 , 256)

j, conv8, after a [1, 1] convolutional network, adjust the number of channels, a [3, 3] convolutional network with valid padding, the output feature layer is 256, so the output net is (1, 1, 256)

k, conv9, after a [1, 1] convolutional network, adjust the number of channels, a padding is valid [3, 3] convolutional network, the output feature layer is 256, so the output net is (1, 1, 256 )

Implementation code:

base = [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'C', 512, 512, 512, 'M', 512, 512, 512]


def vgg(i):
    layers = []
    in_channels = i
    for v in base:
        if v == 'M':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
        elif v == 'C':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)]
        else:
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            layers += [conv2d, nn.ReLU(inplace=True)]
            in_channels = v
    pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)  # (19,19,512)
    conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6)
    conv7 = nn.Conv2d(1024, 1024, kernel_size=1)
    layers += [pool5, conv6,
               nn.ReLU(inplace=True), conv7, nn.ReLU(inplace=True)]
    return layers


def add_extras(i, batch_norm=False):
    # Extra layers added to VGG for feature scaling
    layers = []
    in_channels = i

    # Block6
    # 19,19,1024->10,10,512
    layers += [nn.Conv2d(in_channels, 256, kernel_size=1, stride=1)]
    layers += [nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1)]

    # block7
    # 10,10,512->5,5,256
    layers += [nn.Conv2d(512, 128, kernel_size=1, stride=1)]
    layers += [nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1)]

    # Block 8
    # 5,5,256 ->3,3,256
    layers += [nn.Conv2d(256, 128, kernel_size=1, stride=1)]
    layers += [nn.Conv2d(128, 256, kernel_size=3, stride=1)]

    # Block 9
    # 3,3,256 ->1,1,256
    layers += [nn.Conv2d(256, 128, kernel_size=1, stride=1)]
    layers += [nn.Conv2d(128, 256, kernel_size=3, stride=1)]
    return layers

2. Predict results from features

From the figure above, we can know that we take the features of the third conv4 of conv4, the features of fc7, the features of the second conv6 of conv6, the features of the second conv7 of conv7, and the second conv8 of conv8. The features of the product and the features of the second conv9 of conv9, in order to distinguish them from the ordinary feature layer, we call it the effective feature layer to obtain the prediction result.

For each effective feature layer obtained, we perform a convolution of num-priors*4 and a convolution of num_priors*num_classes, and need to calculate the prior box corresponding to each effective feature layer. And num_priors refers to the number of prior boxes owned by the feature layer.

among them:

The convolution of num_priors*4 is used to predict the change of each a priori box at each grid point on the feature layer. (Why is it a change? This is because the prediction result of ssd needs to be combined with a priori box to obtain the prediction box)

The convolution of num_priors*num_classes is used to predict the category corresponding to each prediction box on each grid point on the feature layer.

The prior frame corresponding to each effective feature layer corresponds to a plurality of preset frames on each grid point on the feature layer.

The shapes of the prediction results corresponding to all the feature layers are as follows:

The implementation code is:

import torch
import torch.nn as nn
class SSD(nn.Module):
    def __init__(self, phase, base, extras, head, num_classes):
        super(SSD, self).__init__()
        self.phase = phase
        self.num_classes = num_classes
        self.cfg = Config
        self.vgg = nn.ModuleList(base)
        self.L2Norm = L2Norm(512, 20)
        self.extras = nn.ModuleList(extras)
        self.priorbox = PriorBox(self.cfg)
        with torch.no_grad():
            self.priors = Variable(self.priorbox.forward())
        self.loc = nn.ModuleList(head[0])
        self.conf = nn.ModuleList(head[1])
        if phase == 'test':
            self.softmax = nn.Softmax(dim=-1)
            self.detect = Detect(num_classes, 0, 200, 0.01, 0.45)

    def forward(self, x):
        sources = list()
        loc = list()
        conf = list()

        # 获得conv4_3的内容
        for k in range(23):
            x = self.vgg[k](x)

        s = self.L2Norm(x)
        sources.append(s)

        # 获得fc7的内容
        for k in range(23, len(self.vgg)):
            x = self.vgg[k](x)
        sources.append(x)

        for k, v in enumerate(self.extras):
            x = F.relu(v(x), inplace=True)
            if k % 2 == 1:
                sources.append(x)

        # 添加回归层和分类层
        for (x, l, c) in zip(sources, self.loc, self.conf):
            loc.append(l(x).permute(0, 2, 3, 1).contiguous())
            conf.append(c(x).permute(0, 2, 3, 1).contiguous())

        # 进行resize
        loc = torch.cat([o.view(o.size(0), -1) for o in loc], 1)
        conf = torch.cat([o.view(o.size(0), -1) for o in conf], 1)
        if self.phase == "test":
            # loc会resize到batch_size,num_anchors,4
            # conf会resize到batch_szie,num_anchors,
            output = self.detect(
                loc.view(loc.size(0), -1, 4),  # loc preds
                self.softmax(conf.view(conf.size(0), -1,
                                       self.num_classes)),  # conf preds
                self.priors
            )
        else:
            output = (
                loc.view(loc.size(0), -1, 4),
                conf.view(conf.size(0), -1, self.num_classes),
                self.priors
            )
        return output


mbox = [4, 6, 6, 6, 4, 4]


def get_ssd(phase, num_classes):
    vgg, extra_layers = add_vgg(3), add_extras(1024)

    loc_layers = []
    conf_layers = []
    vgg_source = [21, -2]
    for k, v in enumerate(vgg_source):
        loc_layers += [nn.Conv2d(vgg[v].out_channels,
                                 mbox[k] * 4, kernel_size=3, padding=1)]
        conf_layers += [nn.Conv2d(vgg[v].out_channels,
                                  mbox[k] * num_classes, kernel_size=3, padding=1)]
    for k, v in enumerate(extra_layers[1::2], 2):
        loc_layers += [nn.Conv2d(v.out_channels, mbox[k]
                                 * 4, kernel_size=3, padding=1)]
        conf_layers += [nn.Conv2d(v.out_channels, mbox[k]
                                  * num_classes, kernel_size=3, padding=1)]

    SSD_MODEL = SSD(phase, vgg, extra_layers, (loc_layers, conf_layers), num_classes)
    return SSD_MODEL

3. Decoding of prediction results

Through the processing of each feature layer, we can obtain three contents, namely:

The convolution of num_priors * 4 is used to predict the change of each prior box on each grid point on the feature layer. **

The convolution of num_priors*num_classes is used to predict the category corresponding to each prediction box on each grid point on the feature layer.

The prior frame corresponding to each effective feature layer corresponds to a plurality of preset frames on each grid point on the feature layer.

We use the convolution of num_priors *4 and the prior box corresponding to each effective feature layer to obtain the true position of the box

The a priori box corresponding to each effective feature layer is as shown in the figure:

Each effective feature layer divides the entire image into grids corresponding to its length and width. For example, the feature layer of conv4-3 divides the entire image into 38*38 grids; and then establishes multiple a priori boxes from the center of each grid, For example, the feature layer of conv4-3 is to establish 4 a priori boxes; for the feature layer of conv4-3, the entire picture is divided into 38*38 grids, each grid center corresponds to 4 a priori boxes, a total of Included, 38*38*4, 5776 a priori boxes.

(Pictures can be dragged directly)

Although the a priori box can represent the position of a certain box and the size of the box, it is limited and cannot represent any situation, so it needs to be adjusted. The ssd uses the result of the convolution of num_priors*4 to adjust the a priori box.

The num_priors in num_priors*4 represents the number of a priori boxes contained in this grid point, and the 4 represents the adjustment of x_offset, y_offset, h and w.

x_offset and y_offset represent the xy axis offset of the real frame from the center of the prior frame.

hw represents the change of the width and height of the real frame relative to the a priori frame.

The decoding process of SSD is to add the center point of each network to its corresponding x_offset and y_offset. The result after the addition is the center of the prediction box, and then use the a priori box and h, w to combine to calculate the position of the prediction box .

Of course, the final prediction structure obtained needs to be sorted by score and non-maximum suppression screening. This part is basically the common part for all target detection.

1. Take out the boxes and scores of each category whose score is greater than self.obj_threshold.

2. Use the sum score of the frame to perform non-maximum suppression.

The implemented code is as follows:

import torch
import torch.nn.functional

# Adapted from https://github.com/Hakuyume/chainer-ssd
def decode(loc, priors, variances):
    boxes = torch.cat((
        priors[:, :2] + loc[:, :2] * variances[0] * priors[:, 2:],
        priors[:, 2:] * torch.exp(loc[:, 2:] * variances[1])), 1)
    boxes[:, :2] -= boxes[:, 2:] / 2
    boxes[:, 2:] += boxes[:, :2]
    return boxes


class Detect(Function):
    def __init__(self, num_classes, bkg_label, top_k, conf_thresh, nms_thresh):
        self.num_classes = num_classes
        self.background_label = bkg_label
        self.top_k = top_k
        self.nms_thresh = nms_thresh
        if nms_thresh <= 0:
            raise ValueError('nms_threshold must be non negative.')
        self.conf_thresh = conf_thresh
        self.variance = Config['variance']

    def forward(self, loc_data, conf_data, prior_data):
        loc_data = loc_data.cpu()
        conf_data = conf_data.cpu()
        num = loc_data.size(0)  # batchsize
        num_priors = prior_data.size(0)
        output = torch.zeros(num, self.num_classes, self.top_k, 5)
        conf_preds = conf_data.view(num, num_priors,
                                    self.num_classes).transpose(2, 1)
        # 对每一张图片进行处理
        for i in range(num):
            # 对先验框解码获得预测框
            decoded_boxes = decode(loc_data[i], prior_data, self.variance)
            conf_scores = conf_preds[i].clone()

            for cl in range(1, self.num_classes):
                # 对每一类进行非极大抑制
                c_mask = conf_scores[cl].gt(self.conf_thresh)
                scores = conf_scores[cl][c_mask]
                if scores.size(0) == 0:
                    continue
                l_mask = c_mask.unsqueeze(1).expand_as(decoded_boxes)
                boxes = decoded_boxes[l_mask].view(-1, 4)
                # 进行非极大抑制
                ids, count = nms(boxes, scores, self.nms_thresh, self.top_k)
                output[i, cl, :count] = \
                    torch.cat((scores[ids[:count]].unsqueeze(1),
                               boxes[ids[:count]]), 1)
            flt = output.contiguous().view(num, -1, 5)
            _, idx = flt[:, :, 0].sort(1, descending=True)
            _, rank = idx.sort(1)
            flt[(rank < self.top_k).unsqueeze(-1).expand_as(flt)].fill_(0)
            return output

4. Draw on the original image

Through the third step, we can get the position of the prediction box on the original image, and these prediction boxes are filtered. These filtered boxes can be drawn directly on the image, and the result can be obtained.

Second, the training part

1. The processing of the real frame

From the prediction part, we know that the prediction result of each feature layer, the convolution of num_priors*4 is used to predict the change of each a priori box on each grid point on the feature layer.

In other words, we directly use the result predicted by the ssd grid, not the real position of the prediction frame on the picture, and it needs to be decoded to get the real position.

During training, we need to calculate the loss function, which is relative to the prediction result of the ssd network. We need to input the picture into the current ssd network to get the prediction result; at the same time, we also need to convert the position information format of the real frame into the format information of the ssd prediction result.

That is, we need to find the a priori box corresponding to each real frame of each picture used for training, and find out what our prediction result should be if we want to get such a real frame.

The process of obtaining the real frame from the prediction result is called encoding, and the process of obtaining the prediction result from the real frame is the encoding process.

Therefore, we only need to reverse the decoding process to be the encoding process.

The implementation code is as follows:

During training, we only need to select the a priori box with the largest iou, which is the a priori box we use to predict the true box.

Therefore, we have to go through a screening process to filter out the prediction result of the larger iou a priori box corresponding to the real box obtained by the above code, the one with the largest iou.

The implemented code is as follows:

def match(threshold, truths, priors, variances, labels, loc_t, conf_t, idx):
    # 计算所有的先验框和真实框的重合程度
    overlaps = jaccard(
        truths,
        point_form(priors)
    )
    #所有真实框和先验框的最好重合程度
    #[truth_box,1]
    best_prior_overlap,best_prior_idx=overlaps.max(1,keepdim=True)
    best_prior_idx.squeeze_(1)
    best_prior_overlap.squeeze_(1)
    #所有先验框和真实框的最好重合程度
    #[1,prior]
    best_truth_overlap,best_truth_idx=overlaps.max(0,keepdim=True)
    best_truth_idx.squeeze(0)
    best_truth_overlap.squeeze(0)

    #找到与真实框重合程度最好的先验框,用于保证每个真实框都对应一个先验框
    best_truth_overlap.index_fill_(0,best_prior_idx,2)

    #对best_truth_idx内容进行设置
    for j in range(best_prior_idx.size(0)):
        best_truth_idx[best_prior_idx[j]]=j

    #找到每个先验框重合程度最好的真实框
    matches = truths[best_truth_idx] #Shape:[num_priors,4]
    conf=labels[best_truth_idx]+1    #Shape:[num_prios]
    
    
    #如果重合程度小于threhold则认为是背景
    conf[best_truth_overlap<threshold]=0 #label as background
    loc=encode(matches,priors,variances)
    loc_t[idx]=loc #[num_priors,4] encoded offsets to learn
    conf_t[idx]=conf #[num_priors] top class label for each prior

2. Use the processed real frame and the prediction result of the corresponding picture to calculate the loss,

Loss is divided into three parts:

1. Get the regression loss of the prediction results of all positive label boxes

2. Get the cross entropy Loss of the prediction results of all positive label types

3. Obtain the cross entropy loss of the prediction result of a certain type of negative label

Because in the ssd training process, the positive and negative samples are extremely unbalanced, that is, there may be only a dozen prior boxes corresponding to the true frame, but there are thousands of negative sample songs that do not correspond to the true frame, which will lead to The loss value of negative samples is extremely large, so we can consider reducing the selection of negative samples. For ssd training, it is common to take three times the negative samples of positive samples for training. This triple can also be modified and adjusted to a number you like.

The implementation code is as follows:

class MultiBoxLoss(nn.Moudle):
    def __init__(self, num_classes, overlap_thresh, prior_for_matching,
                 bkg_label, neg_mining, neg_pos, neg_overlap, encode_target,
                 use_gpu=True):
        super(MultiBoxLoss, self).__init__()
        self.use_gpu = use_gpu
        self.num_classes = num_classes
        self.threshold = overlap_thresh
        self.background_label = bkg_label
        self.encode_target = encode_target
        self.use_prior_for_matching = prior_for_matching
        self.do_neg_minig = neg_mining
        self.negpos_ratio = neg_pos
        self.neg_overlap = neg_overlap
        self.variance = Config['variance']

    def forward(self, predictions, targets):
        # 回归信息,置信度,先验框
        loc_data, conf_data, priors = predictions
        # 计算出batch_size
        num = loc_data.size(0)
        # 取出所有的先验框
        priors = priors[:loc_data.size(1), :]
        # 先验框的数量
        num_priors = (priors.size(0))
        num_classes = self.num_classes
        # 创建一个tensor进行处理
        loc_t = torch.Tensor(num, num_priors, 4)
        conf_t = torch.LongTensor(num, num_priors)
        for idx in range(num):
            # 获得框
            truths = targets[idx][:, :-1].data
            # 获得标签
            labels = targets[idx][:, -1].data
            # 获得先验框
            defaults = priors.data
            # 找到标签对应的先验框
            match(self.threshold, truths, defaults, self.variance, labels,
                  loc_t, conf_t, idx)
            if self.use_gpu:
                loc_t = loc_t.cuda()
                conf_t = conf_t.cuda()

            # 转化成Variable
            loc_t = Variable(loc_t, requires_grad=False)
            conf_t = Variable(conf_t, requires_grad=False)

            # 所有的conf>0的地方,代表内部包含物体
            pos = conf_t > 0
            # 求和得到每一个图片内部有多少正样本
            num_pos = pos.num(dim=1, keepdim=True)
            # 计算回归loss
            pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)
            loc_p = loc_data[pos_idx].view(-1, 4)
            loc_t = loc_t[pos_idx].view(-1, 4)
            loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False)

            # 转化形式
            batch_conf = conf_data.view(-1, self.num_classes)
            # 你可以把softmax函数看成一种接受任何数字并转换为概率分布的非线性方法
            # 获得每个框预测到真实框的类的概率

            loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))
            loss_c = loss_c.view(num, -1)

            loss_c[pos] = 0
            # 获得每一张图片的softmax的结果
            _, loss_idx = loss_c.sort(1, descending=True)
            _, idx_rank = loss_idx.sort(1)
            # 计算每一张图的正样本数量
            num_pos = pos.long().sum(1, keepdim=True)
            # 限制负样本的数量
            num_neg = torch.clamp(self.negpos_ratio * num_pos, max=pos.size(1) - 1)
            neg = idx_rank < num_neg.expand_as(idx_rank)

            # 计算正样本的loss和负样本的loss
            pos_idx = pos.unsqueeze(2).expand_as(conf_data)
            neg_idx = neg.unsqueeze(2).expand_as(conf_data)
            conf_p = conf_data[(pos_idx + neg_idx).gt(0)].view(-1, self.num_classes)
            targets_weighted = conf_t[(pos + neg).gt(0)]
            loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)

            # Sum of losses:L(x,c,l,g)=(Lconf(x,c)+alloc(x,l,g))/N

            N = num_pos.data.sum()
            loss_l /= N
            loss_c /= N
            return loss_l, loss_c

 

Guess you like

Origin blog.csdn.net/weixin_42133481/article/details/114587067