ssd principles and code implementation Comments

By https://github.com/amdegroot/ssd.pytorch , combined with paper https://arxiv.org/abs/1512.02325 to understand ssd.

ssd consists of three parts:

  • base
  • extra
  • Predict
    Base original thesis is vgg16 used to remove the layer fully connected.
    ssd structure _www.wityx.com
    Base + Extra complete feature extraction function. get different size of the feature map, based on these feature maps, then we have a different convolution kernel deconvolution, respectively, to complete category prediction and forecasting coordinates.

Basic feature extraction network

Feature extraction network consists of two parts

  • vgg16
  • extra layer

vgg16 variant

vgg16 structure:
vgg16 structure _www.wityx.com
the layer fully connected vgg16 replaced with a convolutional layer.

Code

ssd.py in

base = {
    '300': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'C', 512, 512, 512, 'M',
            512, 512, 512],
    '512': [],
}
extras = {
    '300': [256, 'S', 512, 128, 'S', 256, 128, 256, 128, 256],
    '512': [],
}
}

Defines the number of each layer of convolution wherein 'M', 'C' represents maxpool cell layer only 'C' uses ceil instead of floor to compute the output shape...
See https: // pytorch. org / docs / stable / nn.html # maxpool2d

def vgg(cfg, i, batch_norm=False):
    layers = []
    in_channels = i
    for v in cfg:
        if v == 'M':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
        elif v == 'C':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)]
        else:
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            if batch_norm:
                layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
            else:
                layers += [conv2d, nn.ReLU(inplace=True)]
            in_channels = v
    pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
    conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6)
    conv7 = nn.Conv2d(1024, 1024, kernel_size=1)
    layers += [pool5, conv6,
               nn.ReLU(inplace=True), conv7, nn.ReLU(inplace=True)]
    return layers

Thus forming a feature extraction based network. Vgg16 the front portion and the same, the whole connection layer replaced conv6 + relu + conv7 + relu.

extra layer

On the basis of the output of the previously obtained and continue to do the convolution, for more different dimensions of feature map.
extra layer_www.wityx.com

Code
extras = {
    '300': [256, 'S', 512, 128, 'S', 256, 128, 256, 128, 256],
    '512': [],
}

def add_extras(cfg, i, batch_norm=False):
    # Extra layers added to VGG for feature scaling
    layers = []
    in_channels = i
    flag = False
    for k, v in enumerate(cfg):
        if in_channels != 'S':
            if v == 'S':
                layers += [nn.Conv2d(in_channels, cfg[k + 1],
                           kernel_size=(1, 3)[flag], stride=2, padding=1)]
            else:
                layers += [nn.Conv2d(in_channels, v, kernel_size=(1, 3)[flag])]
            flag = not flag
        in_channels = v
    return layers
    
add_extras(extras[str(size)], 1024)

[256, 'S', 512 , 128, 'S', 256, 128, 256, 128, 256] to create the layer.
If it is 'S', then, the convolution kernel represented by a is 3 x 3, 1 otherwise x 1, the number of convolution kernel is 'S' in the next figure.

In this case, we will build the extra layers.

Multi-scale detection multibox

We've got a lot of output layer (referred to as feature map) .size sizes. So now we do convolution of certain layers (conv4_3, conv7, conv8_2, conv9_2 , conv10_2, conv11_2) of the feature map obtain location information and category.
were used to do the convolution convolution of 3 x 3 2 groups, one for the predicted category, one for the predicted position. the number of the convolution kernel are boxnum x clasess_num is, boxnum x 4 (coordinate , box width and height can be determined by the four parameters, the center coordinates).

That is done in the feature map mxm convolution we will get a mxmx (boxnum x clasess_num) and a mxmx (boxnum x 4) of Tensor., Respectively, and to calculate the probability of the frame position.

Code

def multibox(vgg, extra_layers, cfg, num_classes):
    loc_layers = []
    conf_layers = []
    vgg_source = [21, -2]
    for k, v in enumerate(vgg_source):
        loc_layers += [nn.Conv2d(vgg[v].out_channels,
                                 cfg[k] * 4, kernel_size=3, padding=1)]
        conf_layers += [nn.Conv2d(vgg[v].out_channels,
                        cfg[k] * num_classes, kernel_size=3, padding=1)]
    for k, v in enumerate(extra_layers[1::2], 2):
        loc_layers += [nn.Conv2d(v.out_channels, cfg[k]
                                 * 4, kernel_size=3, padding=1)]
        conf_layers += [nn.Conv2d(v.out_channels, cfg[k]
                                  * num_classes, kernel_size=3, padding=1)]
    return vgg, extra_layers, (loc_layers, conf_layers)

Wherein each feature map several box prediction given by the following variables.

mbox = {
    '300': [4, 6, 6, 6, 4, 4],  # number of boxes per feature map location
    '512': [],
}

Make predictions on the feature map layer which, according to the thesis is fixed, ssd see the beginning of the block diagram. Was reflected in the code

vgg_source = [21, -2]
extra_layers[1::2]

That is ssd structure _www.wityx.com
the conv4_3, conv7, conv8_2, conv9_2, conv10_2, conv11_2 six layer of feature map.

Generating a priori block

You could call priorbox / default box / anchor box is a meaning.
Let's box in terms of a priori principle. In fact, this is similar to the anchor box yolov3, we do forecast based on box shape of these .

make predictions on priorbox and different feature map is to solve the problem of detecting objects of different sizes . Different feature map responsible for different sized targets while each feature map cell is also responsible for target different aspect of that size.

First, various different feature_map responsible size.
\ [S_k = S_ {min} + \ FRAC {S_ {max} - S_ {min}} {m-. 1} (K-. 1), K \ in [. 1, m ] \]
of smin = 0.2, smax = 0.9.m = 6 ( do we detected on the feature map 6 th layer), hence s = {0.2,0.34,0.48,0.62,0.76,0.9}.
Suppose width high ratios of \ (A_R = {1,2,3,1 / 2,1 /. 3} \) , for the second feature map (19 x 19 this, conv7), then \ [w_k ^ a = S_k \ A_R sqrt {}, a = S_k h_k ^ / \ A_R sqrt {} \] , we calculate the aspect ratio. 1 is a box, the box is obtained (0.2, 0.2). model is the input image size (300,300 ), then the box corresponding to (60, 60). and so on can be obtained deafult box shape of the remaining total of six. (p. 1 aspect ratio of the box, a calculated extra \ (s_k ^ \ prime \) out of the box). so we got a box that feature different shapes responsible predicted map

Figure:
www.wityx.com

So for conv4_3 this layer in terms of words, we set the number of deafault box is 4, so we finally have a 38 x 38 x 4 Ge box. We anticipate our box up on the basis of these box.

We set the number of default box is different layers (4, 6, 6, 6, 4, 4), we predict the final total \ (38 ^ 2 \ times 4 + 19 ^ 2 \ times 6+ 10 ^ 2 \ times 6 + 5 ^ 2 \ times 6 + 3 ^ 2 \ times 4 + 1 ^ 2 \ times 4 = 8732 \) a box.

That the actual parameter adjustment of the focus is adjusted these default box, to try to make it fit your target to be detected. , Yolov3 in tune parameters to adjust the size of the anchor is similar.

Code

prior_box.py PriorBox defined class, forward calculation function implements a default box.
profile domain config.py
ssd config_www.wityx.com
wherein

    'min_sizes': [30, 60, 111, 162, 213, 264],
    'max_sizes': [60, 111, 162, 213, 264, 315],
    'aspect_ratios': [[2], [2, 3], [2, 3], [2, 3], [2], [2]],

Default to calculate a feature map of each box. Profile defined herein makes somewhat confused. Min_size / max_size aspect ratio are used to predict the box 1. [2] for predicting an aspect ratio of 2 : box 2: 1 and 1.

    def forward(self):
        mean = []
        for k, f in enumerate(self.feature_maps):  #config.py中'feature_maps': [38, 19, 10, 5, 3, 1]
            for i, j in product(range(f), repeat=2):
                f_k = self.image_size / self.steps[k]  #基本上除下来和feature_map size类似. 这里直接用f替代f_k区别不大
                # unit center x,y            # 每个feature_map cell的中心
                cx = (j + 0.5) / f_k 
                cy = (i + 0.5) / f_k

                # aspect_ratio: 1
                # rel size: min_size
                s_k = self.min_sizes[k]/self.image_size  #min_sizes预测一个宽高比为1的shape
                mean += [cx, cy, s_k, s_k]

                # aspect_ratio: 1
                # rel size: sqrt(s_k * s_(k+1))
                s_k_prime = sqrt(s_k * (self.max_sizes[k]/self.image_size)) #max_size负责预测一个宽高比为1的shape
                mean += [cx, cy, s_k_prime, s_k_prime]

                # rest of aspect ratios #
                for ar in self.aspect_ratios[k]:         #比如对[2,3]则预测4个shape,1;2,2:1,1:3,3:1
                    mean += [cx, cy, s_k*sqrt(ar), s_k/sqrt(ar)]
                    mean += [cx, cy, s_k/sqrt(ar), s_k*sqrt(ar)]

This example of 38 x 38 of the first cell feature map, calculating a total of four default box. The first two parameters are the mid-point box, followed by width and height. Artwork are relative proportions.

tensor([[0.0133, 0.0133, 0.1000, 0.1000],
        [0.0133, 0.0133, 0.1414, 0.1414],
        [0.0133, 0.0133, 0.1414, 0.0707],
        [0.0133, 0.0133, 0.0707, 0.1414]])

Prediction block generation

tensor meaning of the convolution feature_map

Each feature_map convolution can get a tensor mxmx 4 of which 4 (t_x, t_y, t_w, t_h ), this time we need to use these numbers up to get the coordinates of our forecast frame on the basis of the default box can be considered the neural network is predicted relative to the reference frame offset. this is also called the coordinates of the forecast as meaning .box regression = anchor_box x deformation of the matrix, we return to this argument is that the deformation of the matrix, namely (t_x, t_y, t_w , t_h)
that is

               b_center_x = t_x *prior_variance[0]* p_width + p_center_x
               b_center_y = t_y *prior_variance[1] * p_height + p_center_y
               b_width = exp(prior_variance[2] * t_w) * p_width
               b_height = exp(prior_variance[3] * t_h) * p_height
或者

               b_center_x = t_x * p_width + p_center_x
               b_center_y = t_y * p_height + p_center_y
               b_width = exp(t_w) * p_width
               b_height = exp(t_h) * p_height

Where p_ * represents the default box. B_ * is our final prediction of the coordinates of the box.

This time we get a lot (8732) a box. We screened these box we end up given from the box.
Pseudo code

for every conv box:
    for every class :
        if class_prob < theshold:
            continue
        predict_box = decode(convbox)
        
        nms(predict_box) #去除非常接近的框

Code

detection.py

class Detect(Function):
        def forward(self, loc_data, conf_data, prior_data):
        ##loc_data [batch,8732,4]
        ##conf_data [batch,8732,1+class]
        ##prior_data [8732,4]

        num = loc_data.size(0)  # batch size
        num_priors = prior_data.size(0)
        output = torch.zeros(num, self.num_classes, self.top_k, 5)
        conf_preds = conf_data.view(num, num_priors,
                                    self.num_classes).transpose(2, 1)

        # Decode predictions into bboxes.
        for i in range(num):
            decoded_boxes = decode(loc_data[i], prior_data, self.variance)
            # For each class, perform nms
            conf_scores = conf_preds[i].clone()

            for cl in range(1, self.num_classes):
                c_mask = conf_scores[cl].gt(self.conf_thresh)
                scores = conf_scores[cl][c_mask]
                if scores.size(0) == 0:
                    continue
                l_mask = c_mask.unsqueeze(1).expand_as(decoded_boxes)
                boxes = decoded_boxes[l_mask].view(-1, 4)
                # idx of highest scoring and non-overlapping boxes per class
                ids, count = nms(boxes, scores, self.nms_thresh, self.top_k)
                output[i, cl, :count] = \
                    torch.cat((scores[ids[:count]].unsqueeze(1),
                               boxes[ids[:count]]), 1)
        flt = output.contiguous().view(num, -1, 5)
        _, idx = flt[:, :, 0].sort(1, descending=True)
        _, rank = idx.sort(1)
        flt[(rank < self.top_k).unsqueeze(-1).expand_as(flt)].fill_(0)
        return output
    

In particular core logic box_utils.py

  • The decode result of the convolution calculation for the coordinates of box
def decode(loc, priors, variances):
    boxes = torch.cat((
        priors[:, :2] + loc[:, :2] * variances[0] * priors[:, 2:],
        priors[:, 2:] * torch.exp(loc[:, 2:] * variances[1])), 1)
    boxes[:, :2] -= boxes[:, 2:] / 2
    boxes[:, 2:] += boxes[:, :2]
    return boxes

Here made a center_x, center_y, w, h -> xmin, ymin, xmax, ymax conversion.

    boxes[:, :2] -= boxes[:, 2:] / 2
    boxes[:, 2:] += boxes[:, :2]

Return is already (xmin, ymin, xmax, ymax) is represented in the form of a box.

  • If the two blocks overlap nms exceeds 0.5, it is considered a block of the same object, leaving only a higher probability block
def nms(boxes, scores, overlap=0.5, top_k=200):
    """Apply non-maximum suppression at test time to avoid detecting too many
    overlapping bounding boxes for a given object.
    Args:
        boxes: (tensor) The location preds for the img, Shape: [num_priors,4].
        scores: (tensor) The class predscores for the img, Shape:[num_priors].
        overlap: (float) The overlap thresh for suppressing unnecessary boxes.
        top_k: (int) The Maximum number of box preds to consider.
    Return:
        The indices of the kept boxes with respect to num_priors.
    """

    keep = scores.new(scores.size(0)).zero_().long()
    if boxes.numel() == 0:
        return keep
    x1 = boxes[:, 0]
    y1 = boxes[:, 1]
    x2 = boxes[:, 2]
    y2 = boxes[:, 3]
    area = torch.mul(x2 - x1, y2 - y1)
    v, idx = scores.sort(0)  # sort in ascending order
    # I = I[v >= 0.01]
    idx = idx[-top_k:]  # indices of the top-k largest vals
    xx1 = boxes.new()
    yy1 = boxes.new()
    xx2 = boxes.new()
    yy2 = boxes.new()
    w = boxes.new()
    h = boxes.new()

    # keep = torch.Tensor()
    count = 0
    while idx.numel() > 0:
        i = idx[-1]  # index of current largest val
        # keep.append(i)
        keep[count] = i
        count += 1
        if idx.size(0) == 1:
            break
        idx = idx[:-1]  # remove kept element from view
        # load bboxes of next highest vals
        torch.index_select(x1, 0, idx, out=xx1)
        torch.index_select(y1, 0, idx, out=yy1)
        torch.index_select(x2, 0, idx, out=xx2)
        torch.index_select(y2, 0, idx, out=yy2)
        # store element-wise max with next highest score
        xx1 = torch.clamp(xx1, min=x1[i])
        yy1 = torch.clamp(yy1, min=y1[i])
        xx2 = torch.clamp(xx2, max=x2[i])
        yy2 = torch.clamp(yy2, max=y2[i])
        w.resize_as_(xx2)
        h.resize_as_(yy2)
        w = xx2 - xx1
        h = yy2 - yy1
        # check sizes of xx1 and xx2.. after each iteration
        w = torch.clamp(w, min=0.0)
        h = torch.clamp(h, min=0.0)
        inter = w*h
        # IoU = i / (area(a) + area(b) - i)
        rem_areas = torch.index_select(area, 0, idx)  # load remaining areas)
        union = (rem_areas - inter) + area[i]
        IoU = inter/union  # store result in iou
        # keep only elements with an IoU <= overlap
        idx = idx[IoU.le(overlap)]
    return keep, count

These are the meanings of the ssd network infrastructure, and the output of each layer. These have been enough for us to understand the reasoning. That is, given a diagram, how the model to predict the location of the box. Later we will continue to focus on the training process.

loss calculation

The first problem to be solved is the question box of matches. That every training, how predictive frame predicted to be up? We need to think of these models to calculate the prediction of the box and the real ground truth the difference between the box.
www.wityx.com
as described above, the ground truth box cat matched two default box, dog ground truth box matches a default box.

Matching strategy

Matching strategy _www.wityx.com
Matching strategy is

  • Gt box towards the prior box do match, and gt box of IOU highest prior box was selected positive samples
  • Arbitrary and gt box of IOU greater than 0.5 has also been selected positive samples
    have a problem bothering me for a long time, the second step includes a first step it is not, until suddenly one day, may all prior box and gt box of iou all <threshold value, the first step is to ensure that at least a prior box gt box corresponding to the

box_utils.py

def match(threshold, truths, priors, variances, labels, loc_t, conf_t, idx):
    # jaccard index     #[objects_num,priorbox_num]
    overlaps = jaccard(
        truths,
        point_form(priors)
    )
    # (Bipartite Matching)
    # [num_objects,1] best prior for each ground truth
    best_prior_overlap, best_prior_idx = overlaps.max(1, keepdim=True) #返回每行的最大值,即哪个priorbox与当前obj gt box的IOU最大
    # [1,num_priors] best ground truth for each prior
    best_truth_overlap, best_truth_idx = overlaps.max(0, keepdim=True) #返回每列的最大值,即哪个obj gt box与当前prior box的IOU最大
    best_truth_idx.squeeze_(0) #best_truth_idx的shape是[1,num_priors],去掉第0维度将shape变为[num_priors]
    best_truth_overlap.squeeze_(0) #同上
    best_prior_idx.squeeze_(1) #best_prior_idx的shape是[num_objects,1],去掉第一维度变为[num_objects]
    best_prior_overlap.squeeze_(1)
    best_truth_overlap.index_fill_(0, best_prior_idx, 2)  # ensure best prior #把best_truth_overlap第0维度best_prior_idx位置的值的替换为2,以使其肯定>theshold
    # TODO refactor: index  best_prior_idx with long tensor
    # ensure every gt matches with its prior of max overlap
    for j in range(best_prior_idx.size(0)):
        best_truth_idx[best_prior_idx[j]] = j
    matches = truths[best_truth_idx]          # Shape: [num_priors,4]
    conf = labels[best_truth_idx] + 1         # Shape: [num_priors]
    conf[best_truth_overlap < threshold] = 0  # label as background
    loc = encode(matches, priors, variances)
    loc_t[idx] = loc    # [num_priors,4] encoded offsets to learn
    conf_t[idx] = conf  # [num_priors] top class label for each prior 

The logic here is actually a little wound Give specific examples will help you understand better drop.
We assume that a picture, there are two object. Then there are two gt box, assuming that calculates the three (actually 8732) prior box. iou gt box is calculated for each and every prior box overlaps obtained, i.e. a two rows and three columns.

import torch
#假设一幅图里有2个obj,预测出3个box,其iou如overlaps所示
truths = torch.Tensor([[1,2,3,4],[5,6,7,8]]) #2个gtbox 每个box坐标由四个值确定
labels = torch.Tensor([[5],[6]])#2个obj分别属于类别5,类别6
overlaps = torch.Tensor([[0.1,0.4,0.3],[0.5,0.2,0.6]])
#overlaps = torch.Tensor([[0.9,0.9,0.9],[0.8,0.8,0.8]])

best_prior_overlap, best_prior_idx = overlaps.max(1, keepdim=True) #[2,1]
#print(best_prior_overlap)
#print(best_prior_idx) #与目标gt box iou最大的prior box 下标

best_truth_overlap, best_truth_idx = overlaps.max(0, keepdim=True) #返回每列的最大值,即哪个obj gt box与当前prior box的IOU最大
#print(best_truth_overlap) #[1,3]
#print(best_truth_idx) #与prior box iou最大的gt box 下标

best_truth_idx.squeeze_(0) #best_truth_idx的shape是[1,num_priors],去掉第0维度将shape变为[num_priors]
best_truth_overlap.squeeze_(0) #同上

best_prior_idx.squeeze_(1) #best_prior_idx的shape是[num_objects,1],去掉第一维度变为[num_objects]
best_prior_overlap.squeeze_(1)

print(best_prior_idx)
print(best_truth_idx)

#把和gt box的iou最大的prior box的iou设置为2(只要大于阈值就可以了),以确保这个prior box一定会被保留下来.
best_truth_overlap.index_fill_(0, best_prior_idx, 2)

#比如所有的prior box都和gt box1的iou=0.9,prior box2和gt box2的iou=0.8.  我们要确保prior box2被匹配到gt box2而不是gt box1.
#把overlaps = torch.Tensor([[0.9,0.9,0.9],[0.8,0.8,0.8]])试试就知道了
for j in range(best_prior_idx.size(0)):
    print(j)
    best_truth_idx[best_prior_idx[j]] = j

print(best_truth_idx)
    
matches = truths[best_truth_idx]  #[3,4] 列代表每一个对应的gt box的坐标
print(matches)

print(best_truth_overlap)

conf = labels[best_truth_idx] + 1 #[3,1]每一列代表当前prior box对应的gt box的类别
print(conf.shape)
#conf[best_truth_overlap < threshold] = 0  #过滤掉iou太低的,标记为background

www.wityx.com
At this point, we got the matches, that is, for every prior box have found its corresponding gt box. Has been conf. That prior box belongs to the category. If iou too low, the category is marked as background.

Next

def encode(matched, priors, variances):
    # dist b/t match center and prior's center
    g_cxcy = (matched[:, :2] + matched[:, 2:])/2 - priors[:, :2]
    # encode variance
    g_cxcy /= (variances[0] * priors[:, 2:])
    # match wh / prior wh
    g_wh = (matched[:, 2:] - matched[:, :2]) / priors[:, 2:]
    g_wh = torch.log(g_wh) / variances[1]
    # return target for smooth_l1_loss
    return torch.cat([g_cxcy, g_wh], 1)  # [num_priors,4]

We compared the differences gt box and its corresponding prior box. Note that the format is matched (lefttop_x, lefttop_y, rightbottom_x, rightbottom_y ).
So obtained here is actually offset between gt box and prior box.

Calculation of loss

For all prior box, the total can be divided into three types

  • Positive samples
  • loss of top-ranking xx negative samples
  • Remaining negative samples
    wherein the sample that is positive: and the ground truth box iou iou exceeds a threshold or maximum prior box.
    Negative Sample: prior box than the positive samples.

Loss function is divided into 2 parts, the loss coordinate offset, in part, the loss of information categories.

When calculating loc loss, considering only positive samples in the calculation conf loss, that is, considering the positive samples and the negative samples and holds considering the negative samples: Sample n = 3: 1.

Code implemented:
multibox_loss.py

class MultiBoxLoss(nn.Module):
    def forward(self, predictions, targets):
        
    

It can be expressed as pseudocode

#根据匹配策略得到每个prior box对应的gt box
#根据iou筛选出positive prior box
#计算conf loss
#筛选出loss靠前的xx个negative prior box.保证neg:pos=3:1
#计算交叉熵
#归一化处理
  • Coordinate offset loss
        pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)
        loc_p = loc_data[pos_idx].view(-1, 4)  #预测得到的偏移量
        loc_t = loc_t[pos_idx].view(-1, 4)     #真实的偏移量
        loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False)  #我们回归的就是相对default box的偏移

With smooth_l1_loss. The code is relatively simple, not much talk about it.

Hard negative mining
after the match default box and gt box, there must be a lot of default box is no match on. That is only a small amount of positive samples, there are a lot of negative samples. For each default box, descending in accordance with our confidence loss the sorting we just take the top out of some of the default box to calculate the loss, so that the negative samples: positive samples in 3: 1 so you can make the model more to speed up the optimization, training more stable.

Unbalanced on the target detection reference may https://zhuanlan.zhihu.com/p/60612064

That is to simply drop the negative sample allows us to learn background information, sample making positive target information we learned so both are needed, and to maintain a proper ratio thesis using a 3: 1
corresponding to the code that is MultiBoxLoss.negpos_ratio

        # Compute max conf across batch for hard negative mining
        batch_conf = conf_data.view(-1, self.num_classes)  #[batch*8732,21] 
        loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1)) #conf_t的列方向是类别信息

        # Hard Negative Mining
        loss_c[pos] = 0  # filter out pos boxes for now
        loss_c = loss_c.view(num, -1)
        _, loss_idx = loss_c.sort(1, descending=True)
        _, idx_rank = loss_idx.sort(1)
        num_pos = pos.long().sum(1, keepdim=True)
        num_neg = torch.clamp(self.negpos_ratio*num_pos, max=pos.size(1)-1)
        #得到负样本的index
        neg = idx_rank < num_neg.expand_as(idx_rank)

This time the loss is not conf loss of the network, not the thesis of l_conf .

def log_sum_exp(x):
    """Utility function for computing log_sum_exp while determining
    This will be used to determine unaveraged confidence loss across
    all examples in a batch.
    Args:
        x (Variable(tensor)): conf_preds from conf layers
    """
    x_max = x.data.max()  
    return torch.log(torch.sum(torch.exp(x-x_max), 1, keepdim=True)) + x_max

This uses a trick. Referring https://github.com/amdegroot/ssd.pytorch/issues/203 , https://stackoverflow.com/questions/42599498/numercially-stable-softmax
in order to avoid too e n power big or too small to calculate, often use this trick when calculating softmax.

This function is seriously affected my understanding of loss_c, in fact, you can put x_max remove the above function. That function
then becomes the loss_c

loss_c = torch.log(torch.sum(torch.exp(batch_conf), 1, keepdim=True)) - batch_conf.gather(1, conf_t.view(-1, 1))

Like to understand more.

Conf_t the column direction of the corresponding label index. Batch_conf.gather (1, conf_t.view (-1, 1)) to give a [batch * 8732,1] of the tensor, i.e., retaining only the corresponding probability prior box label prediction information.

That overall loss is the loss of all categories minus the sum of the prior box should be responsible for loss of the label.

After getting loss_c, we went to get index of the positive samples / negative samples

        # 选出loss最大的一些负样本 负样本:正样本=3:1
        # Hard Negative Mining
        loss_c = loss_c.view(num, -1) #[batch,8732]
        loss_c[pos] = 0  # filter out pos boxes for now
        _, loss_idx = loss_c.sort(1, descending=True)  #对每张图的priorbox的conf loss逆序排序
        print(_[0,:],loss_idx[0]) #[batch,8732] 每一列的值为prior box的index
        _, idx_rank = loss_idx.sort(1)  
        print(_[0,:],idx_rank[0,:])  #[batch,8732] 每一列的值为prior box在loss_idx的位置.我们要选取前loss_idx中的前xx个.(xx=3倍负样本)
        num_pos = pos.long().sum(1, keepdim=True)
        print(num_pos) #[batch,1] 列的值为每张图的正样本数量
        #求得负样本的数量,3倍正样本,如果3倍正样本>全部prior box,则设置负样本数量为prior box数量
        num_neg = torch.clamp(self.negpos_ratio*num_pos, max=pos.size(1)-1) 
        print(num_neg)
        #选出loss排名最靠前的num_neg个负样本
        neg = idx_rank < num_neg.expand_as(idx_rank)
        print(neg)

At this point, we get positive and negative samples of the index. Then you can calculate the difference between the predicted value and the true value.

        loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)

        # Sum of losses: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N

        N = num_pos.data.sum()
        loss_l /= N
        loss_c /= N
        return loss_l, loss_c

Measured by cross-entropy loss. Finally divided by the number of positive samples, make the normalization process.
Https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss
www.wityx.com
before calculating the loss, no need to manually softmax converted into a probability value.

training

As already enables the creation of a network structure, loss can be achieved calculate the next train..
Implemented in train.py
main logic streamlined as follows:

    ssd_net = build_ssd('train', cfg['min_dim'], cfg['num_classes'])
    net = ssd_net
    
    optimizer = optim.SGD(net.parameters(), lr=args.lr, momentum=args.momentum,
                          weight_decay=args.weight_decay)
    criterion = MultiBoxLoss(cfg['num_classes'], 0.5, True, 0, True, 3, 0.5,
                             False, args.cuda)
    
    for iteration in range(args.start_iter, cfg['max_iter']):
        # load train data
        images, targets = next(batch_iterator)

        # forward
        out = net(images)

        # backprop
        optimizer.zero_grad()
        loss_l, loss_c = criterion(out, targets)
        loss = loss_l + loss_c
        loss.backward()
        optimizer.step()
    

which is

  • Defined network architecture
  • Back propagation loss function definitions and request a gradient method
  • Load the training set
  • Propagation prediction value obtained before
  • Calculation of loss
  • Back-propagation, updating the network weight parameters

Part of a function related to torch usage Reference: https://www.cnblogs.com/sdu20112013/p/11731741.html

Guess you like

Origin www.cnblogs.com/jwcz/p/11759802.html