By https://github.com/amdegroot/ssd.pytorch , combined with paper https://arxiv.org/abs/1512.02325 to understand ssd.
ssd consists of three parts:
- base
- extra
- Predict
Base original thesis is vgg16 used to remove the layer fully connected.
Base + Extra complete feature extraction function. get different size of the feature map, based on these feature maps, then we have a different convolution kernel deconvolution, respectively, to complete category prediction and forecasting coordinates.
Basic feature extraction network
Feature extraction network consists of two parts
- vgg16
- extra layer
vgg16 variant
vgg16 structure:
the layer fully connected vgg16 replaced with a convolutional layer.
Code
ssd.py in
base = {
'300': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'C', 512, 512, 512, 'M',
512, 512, 512],
'512': [],
}
extras = {
'300': [256, 'S', 512, 128, 'S', 256, 128, 256, 128, 256],
'512': [],
}
}
Defines the number of each layer of convolution wherein 'M', 'C' represents maxpool cell layer only 'C' uses ceil instead of floor to compute the output shape...
See https: // pytorch. org / docs / stable / nn.html # maxpool2d
def vgg(cfg, i, batch_norm=False):
layers = []
in_channels = i
for v in cfg:
if v == 'M':
layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
elif v == 'C':
layers += [nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)]
else:
conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
if batch_norm:
layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
else:
layers += [conv2d, nn.ReLU(inplace=True)]
in_channels = v
pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6)
conv7 = nn.Conv2d(1024, 1024, kernel_size=1)
layers += [pool5, conv6,
nn.ReLU(inplace=True), conv7, nn.ReLU(inplace=True)]
return layers
Thus forming a feature extraction based network. Vgg16 the front portion and the same, the whole connection layer replaced conv6 + relu + conv7 + relu.
extra layer
On the basis of the output of the previously obtained and continue to do the convolution, for more different dimensions of feature map.
Code
extras = {
'300': [256, 'S', 512, 128, 'S', 256, 128, 256, 128, 256],
'512': [],
}
def add_extras(cfg, i, batch_norm=False):
# Extra layers added to VGG for feature scaling
layers = []
in_channels = i
flag = False
for k, v in enumerate(cfg):
if in_channels != 'S':
if v == 'S':
layers += [nn.Conv2d(in_channels, cfg[k + 1],
kernel_size=(1, 3)[flag], stride=2, padding=1)]
else:
layers += [nn.Conv2d(in_channels, v, kernel_size=(1, 3)[flag])]
flag = not flag
in_channels = v
return layers
add_extras(extras[str(size)], 1024)
[256, 'S', 512 , 128, 'S', 256, 128, 256, 128, 256] to create the layer.
If it is 'S', then, the convolution kernel represented by a is 3 x 3, 1 otherwise x 1, the number of convolution kernel is 'S' in the next figure.
In this case, we will build the extra layers.
Multi-scale detection multibox
We've got a lot of output layer (referred to as feature map) .size sizes. So now we do convolution of certain layers (conv4_3, conv7, conv8_2, conv9_2 , conv10_2, conv11_2) of the feature map obtain location information and category.
were used to do the convolution convolution of 3 x 3 2 groups, one for the predicted category, one for the predicted position. the number of the convolution kernel are boxnum x clasess_num is, boxnum x 4 (coordinate , box width and height can be determined by the four parameters, the center coordinates).
That is done in the feature map mxm convolution we will get a mxmx (boxnum x clasess_num) and a mxmx (boxnum x 4) of Tensor., Respectively, and to calculate the probability of the frame position.
Code
def multibox(vgg, extra_layers, cfg, num_classes):
loc_layers = []
conf_layers = []
vgg_source = [21, -2]
for k, v in enumerate(vgg_source):
loc_layers += [nn.Conv2d(vgg[v].out_channels,
cfg[k] * 4, kernel_size=3, padding=1)]
conf_layers += [nn.Conv2d(vgg[v].out_channels,
cfg[k] * num_classes, kernel_size=3, padding=1)]
for k, v in enumerate(extra_layers[1::2], 2):
loc_layers += [nn.Conv2d(v.out_channels, cfg[k]
* 4, kernel_size=3, padding=1)]
conf_layers += [nn.Conv2d(v.out_channels, cfg[k]
* num_classes, kernel_size=3, padding=1)]
return vgg, extra_layers, (loc_layers, conf_layers)
Wherein each feature map several box prediction given by the following variables.
mbox = {
'300': [4, 6, 6, 6, 4, 4], # number of boxes per feature map location
'512': [],
}
Make predictions on the feature map layer which, according to the thesis is fixed, ssd see the beginning of the block diagram. Was reflected in the code
vgg_source = [21, -2]
extra_layers[1::2]
That is
the conv4_3, conv7, conv8_2, conv9_2, conv10_2, conv11_2 six layer of feature map.
Generating a priori block
You could call priorbox / default box / anchor box is a meaning.
Let's box in terms of a priori principle. In fact, this is similar to the anchor box yolov3, we do forecast based on box shape of these .
make predictions on priorbox and different feature map is to solve the problem of detecting objects of different sizes . Different feature map responsible for different sized targets while each feature map cell is also responsible for target different aspect of that size.
First, various different feature_map responsible size.
\ [S_k = S_ {min} + \ FRAC {S_ {max} - S_ {min}} {m-. 1} (K-. 1), K \ in [. 1, m ] \]
of smin = 0.2, smax = 0.9.m = 6 ( do we detected on the feature map 6 th layer), hence s = {0.2,0.34,0.48,0.62,0.76,0.9}.
Suppose width high ratios of \ (A_R = {1,2,3,1 / 2,1 /. 3} \) , for the second feature map (19 x 19 this, conv7), then \ [w_k ^ a = S_k \ A_R sqrt {}, a = S_k h_k ^ / \ A_R sqrt {} \] , we calculate the aspect ratio. 1 is a box, the box is obtained (0.2, 0.2). model is the input image size (300,300 ), then the box corresponding to (60, 60). and so on can be obtained deafult box shape of the remaining total of six. (p. 1 aspect ratio of the box, a calculated extra \ (s_k ^ \ prime \) out of the box). so we got a box that feature different shapes responsible predicted map
Figure:
So for conv4_3 this layer in terms of words, we set the number of deafault box is 4, so we finally have a 38 x 38 x 4 Ge box. We anticipate our box up on the basis of these box.
We set the number of default box is different layers (4, 6, 6, 6, 4, 4), we predict the final total \ (38 ^ 2 \ times 4 + 19 ^ 2 \ times 6+ 10 ^ 2 \ times 6 + 5 ^ 2 \ times 6 + 3 ^ 2 \ times 4 + 1 ^ 2 \ times 4 = 8732 \) a box.
That the actual parameter adjustment of the focus is adjusted these default box, to try to make it fit your target to be detected. , Yolov3 in tune parameters to adjust the size of the anchor is similar.
Code
prior_box.py PriorBox defined class, forward calculation function implements a default box.
profile domain config.py
wherein
'min_sizes': [30, 60, 111, 162, 213, 264],
'max_sizes': [60, 111, 162, 213, 264, 315],
'aspect_ratios': [[2], [2, 3], [2, 3], [2, 3], [2], [2]],
Default to calculate a feature map of each box. Profile defined herein makes somewhat confused. Min_size / max_size aspect ratio are used to predict the box 1. [2] for predicting an aspect ratio of 2 : box 2: 1 and 1.
def forward(self):
mean = []
for k, f in enumerate(self.feature_maps): #config.py中'feature_maps': [38, 19, 10, 5, 3, 1]
for i, j in product(range(f), repeat=2):
f_k = self.image_size / self.steps[k] #基本上除下来和feature_map size类似. 这里直接用f替代f_k区别不大
# unit center x,y # 每个feature_map cell的中心
cx = (j + 0.5) / f_k
cy = (i + 0.5) / f_k
# aspect_ratio: 1
# rel size: min_size
s_k = self.min_sizes[k]/self.image_size #min_sizes预测一个宽高比为1的shape
mean += [cx, cy, s_k, s_k]
# aspect_ratio: 1
# rel size: sqrt(s_k * s_(k+1))
s_k_prime = sqrt(s_k * (self.max_sizes[k]/self.image_size)) #max_size负责预测一个宽高比为1的shape
mean += [cx, cy, s_k_prime, s_k_prime]
# rest of aspect ratios #
for ar in self.aspect_ratios[k]: #比如对[2,3]则预测4个shape,1;2,2:1,1:3,3:1
mean += [cx, cy, s_k*sqrt(ar), s_k/sqrt(ar)]
mean += [cx, cy, s_k/sqrt(ar), s_k*sqrt(ar)]
This example of 38 x 38 of the first cell feature map, calculating a total of four default box. The first two parameters are the mid-point box, followed by width and height. Artwork are relative proportions.
tensor([[0.0133, 0.0133, 0.1000, 0.1000],
[0.0133, 0.0133, 0.1414, 0.1414],
[0.0133, 0.0133, 0.1414, 0.0707],
[0.0133, 0.0133, 0.0707, 0.1414]])
Prediction block generation
tensor meaning of the convolution feature_map
Each feature_map convolution can get a tensor mxmx 4 of which 4 (t_x, t_y, t_w, t_h ), this time we need to use these numbers up to get the coordinates of our forecast frame on the basis of the default box can be considered the neural network is predicted relative to the reference frame offset. this is also called the coordinates of the forecast as meaning .box regression = anchor_box x deformation of the matrix, we return to this argument is that the deformation of the matrix, namely (t_x, t_y, t_w , t_h)
that is
b_center_x = t_x *prior_variance[0]* p_width + p_center_x
b_center_y = t_y *prior_variance[1] * p_height + p_center_y
b_width = exp(prior_variance[2] * t_w) * p_width
b_height = exp(prior_variance[3] * t_h) * p_height
或者
b_center_x = t_x * p_width + p_center_x
b_center_y = t_y * p_height + p_center_y
b_width = exp(t_w) * p_width
b_height = exp(t_h) * p_height
Where p_ * represents the default box. B_ * is our final prediction of the coordinates of the box.
This time we get a lot (8732) a box. We screened these box we end up given from the box.
Pseudo code
for every conv box:
for every class :
if class_prob < theshold:
continue
predict_box = decode(convbox)
nms(predict_box) #去除非常接近的框
Code
detection.py
class Detect(Function):
def forward(self, loc_data, conf_data, prior_data):
##loc_data [batch,8732,4]
##conf_data [batch,8732,1+class]
##prior_data [8732,4]
num = loc_data.size(0) # batch size
num_priors = prior_data.size(0)
output = torch.zeros(num, self.num_classes, self.top_k, 5)
conf_preds = conf_data.view(num, num_priors,
self.num_classes).transpose(2, 1)
# Decode predictions into bboxes.
for i in range(num):
decoded_boxes = decode(loc_data[i], prior_data, self.variance)
# For each class, perform nms
conf_scores = conf_preds[i].clone()
for cl in range(1, self.num_classes):
c_mask = conf_scores[cl].gt(self.conf_thresh)
scores = conf_scores[cl][c_mask]
if scores.size(0) == 0:
continue
l_mask = c_mask.unsqueeze(1).expand_as(decoded_boxes)
boxes = decoded_boxes[l_mask].view(-1, 4)
# idx of highest scoring and non-overlapping boxes per class
ids, count = nms(boxes, scores, self.nms_thresh, self.top_k)
output[i, cl, :count] = \
torch.cat((scores[ids[:count]].unsqueeze(1),
boxes[ids[:count]]), 1)
flt = output.contiguous().view(num, -1, 5)
_, idx = flt[:, :, 0].sort(1, descending=True)
_, rank = idx.sort(1)
flt[(rank < self.top_k).unsqueeze(-1).expand_as(flt)].fill_(0)
return output
In particular core logic box_utils.py
- The decode result of the convolution calculation for the coordinates of box
def decode(loc, priors, variances):
boxes = torch.cat((
priors[:, :2] + loc[:, :2] * variances[0] * priors[:, 2:],
priors[:, 2:] * torch.exp(loc[:, 2:] * variances[1])), 1)
boxes[:, :2] -= boxes[:, 2:] / 2
boxes[:, 2:] += boxes[:, :2]
return boxes
Here made a center_x, center_y, w, h -> xmin, ymin, xmax, ymax conversion.
boxes[:, :2] -= boxes[:, 2:] / 2
boxes[:, 2:] += boxes[:, :2]
Return is already (xmin, ymin, xmax, ymax) is represented in the form of a box.
- If the two blocks overlap nms exceeds 0.5, it is considered a block of the same object, leaving only a higher probability block
def nms(boxes, scores, overlap=0.5, top_k=200):
"""Apply non-maximum suppression at test time to avoid detecting too many
overlapping bounding boxes for a given object.
Args:
boxes: (tensor) The location preds for the img, Shape: [num_priors,4].
scores: (tensor) The class predscores for the img, Shape:[num_priors].
overlap: (float) The overlap thresh for suppressing unnecessary boxes.
top_k: (int) The Maximum number of box preds to consider.
Return:
The indices of the kept boxes with respect to num_priors.
"""
keep = scores.new(scores.size(0)).zero_().long()
if boxes.numel() == 0:
return keep
x1 = boxes[:, 0]
y1 = boxes[:, 1]
x2 = boxes[:, 2]
y2 = boxes[:, 3]
area = torch.mul(x2 - x1, y2 - y1)
v, idx = scores.sort(0) # sort in ascending order
# I = I[v >= 0.01]
idx = idx[-top_k:] # indices of the top-k largest vals
xx1 = boxes.new()
yy1 = boxes.new()
xx2 = boxes.new()
yy2 = boxes.new()
w = boxes.new()
h = boxes.new()
# keep = torch.Tensor()
count = 0
while idx.numel() > 0:
i = idx[-1] # index of current largest val
# keep.append(i)
keep[count] = i
count += 1
if idx.size(0) == 1:
break
idx = idx[:-1] # remove kept element from view
# load bboxes of next highest vals
torch.index_select(x1, 0, idx, out=xx1)
torch.index_select(y1, 0, idx, out=yy1)
torch.index_select(x2, 0, idx, out=xx2)
torch.index_select(y2, 0, idx, out=yy2)
# store element-wise max with next highest score
xx1 = torch.clamp(xx1, min=x1[i])
yy1 = torch.clamp(yy1, min=y1[i])
xx2 = torch.clamp(xx2, max=x2[i])
yy2 = torch.clamp(yy2, max=y2[i])
w.resize_as_(xx2)
h.resize_as_(yy2)
w = xx2 - xx1
h = yy2 - yy1
# check sizes of xx1 and xx2.. after each iteration
w = torch.clamp(w, min=0.0)
h = torch.clamp(h, min=0.0)
inter = w*h
# IoU = i / (area(a) + area(b) - i)
rem_areas = torch.index_select(area, 0, idx) # load remaining areas)
union = (rem_areas - inter) + area[i]
IoU = inter/union # store result in iou
# keep only elements with an IoU <= overlap
idx = idx[IoU.le(overlap)]
return keep, count
These are the meanings of the ssd network infrastructure, and the output of each layer. These have been enough for us to understand the reasoning. That is, given a diagram, how the model to predict the location of the box. Later we will continue to focus on the training process.
loss calculation
The first problem to be solved is the question box of matches. That every training, how predictive frame predicted to be up? We need to think of these models to calculate the prediction of the box and the real ground truth the difference between the box.
as described above, the ground truth box cat matched two default box, dog ground truth box matches a default box.
Matching strategy
Matching strategy is
- Gt box towards the prior box do match, and gt box of IOU highest prior box was selected positive samples
- Arbitrary and gt box of IOU greater than 0.5 has also been selected positive samples
have a problem bothering me for a long time, the second step includes a first step it is not, until suddenly one day, may all prior box and gt box of iou all <threshold value, the first step is to ensure that at least a prior box gt box corresponding to the
box_utils.py
def match(threshold, truths, priors, variances, labels, loc_t, conf_t, idx):
# jaccard index #[objects_num,priorbox_num]
overlaps = jaccard(
truths,
point_form(priors)
)
# (Bipartite Matching)
# [num_objects,1] best prior for each ground truth
best_prior_overlap, best_prior_idx = overlaps.max(1, keepdim=True) #返回每行的最大值,即哪个priorbox与当前obj gt box的IOU最大
# [1,num_priors] best ground truth for each prior
best_truth_overlap, best_truth_idx = overlaps.max(0, keepdim=True) #返回每列的最大值,即哪个obj gt box与当前prior box的IOU最大
best_truth_idx.squeeze_(0) #best_truth_idx的shape是[1,num_priors],去掉第0维度将shape变为[num_priors]
best_truth_overlap.squeeze_(0) #同上
best_prior_idx.squeeze_(1) #best_prior_idx的shape是[num_objects,1],去掉第一维度变为[num_objects]
best_prior_overlap.squeeze_(1)
best_truth_overlap.index_fill_(0, best_prior_idx, 2) # ensure best prior #把best_truth_overlap第0维度best_prior_idx位置的值的替换为2,以使其肯定>theshold
# TODO refactor: index best_prior_idx with long tensor
# ensure every gt matches with its prior of max overlap
for j in range(best_prior_idx.size(0)):
best_truth_idx[best_prior_idx[j]] = j
matches = truths[best_truth_idx] # Shape: [num_priors,4]
conf = labels[best_truth_idx] + 1 # Shape: [num_priors]
conf[best_truth_overlap < threshold] = 0 # label as background
loc = encode(matches, priors, variances)
loc_t[idx] = loc # [num_priors,4] encoded offsets to learn
conf_t[idx] = conf # [num_priors] top class label for each prior
The logic here is actually a little wound Give specific examples will help you understand better drop.
We assume that a picture, there are two object. Then there are two gt box, assuming that calculates the three (actually 8732) prior box. iou gt box is calculated for each and every prior box overlaps obtained, i.e. a two rows and three columns.
import torch
#假设一幅图里有2个obj,预测出3个box,其iou如overlaps所示
truths = torch.Tensor([[1,2,3,4],[5,6,7,8]]) #2个gtbox 每个box坐标由四个值确定
labels = torch.Tensor([[5],[6]])#2个obj分别属于类别5,类别6
overlaps = torch.Tensor([[0.1,0.4,0.3],[0.5,0.2,0.6]])
#overlaps = torch.Tensor([[0.9,0.9,0.9],[0.8,0.8,0.8]])
best_prior_overlap, best_prior_idx = overlaps.max(1, keepdim=True) #[2,1]
#print(best_prior_overlap)
#print(best_prior_idx) #与目标gt box iou最大的prior box 下标
best_truth_overlap, best_truth_idx = overlaps.max(0, keepdim=True) #返回每列的最大值,即哪个obj gt box与当前prior box的IOU最大
#print(best_truth_overlap) #[1,3]
#print(best_truth_idx) #与prior box iou最大的gt box 下标
best_truth_idx.squeeze_(0) #best_truth_idx的shape是[1,num_priors],去掉第0维度将shape变为[num_priors]
best_truth_overlap.squeeze_(0) #同上
best_prior_idx.squeeze_(1) #best_prior_idx的shape是[num_objects,1],去掉第一维度变为[num_objects]
best_prior_overlap.squeeze_(1)
print(best_prior_idx)
print(best_truth_idx)
#把和gt box的iou最大的prior box的iou设置为2(只要大于阈值就可以了),以确保这个prior box一定会被保留下来.
best_truth_overlap.index_fill_(0, best_prior_idx, 2)
#比如所有的prior box都和gt box1的iou=0.9,prior box2和gt box2的iou=0.8. 我们要确保prior box2被匹配到gt box2而不是gt box1.
#把overlaps = torch.Tensor([[0.9,0.9,0.9],[0.8,0.8,0.8]])试试就知道了
for j in range(best_prior_idx.size(0)):
print(j)
best_truth_idx[best_prior_idx[j]] = j
print(best_truth_idx)
matches = truths[best_truth_idx] #[3,4] 列代表每一个对应的gt box的坐标
print(matches)
print(best_truth_overlap)
conf = labels[best_truth_idx] + 1 #[3,1]每一列代表当前prior box对应的gt box的类别
print(conf.shape)
#conf[best_truth_overlap < threshold] = 0 #过滤掉iou太低的,标记为background
At this point, we got the matches, that is, for every prior box have found its corresponding gt box. Has been conf. That prior box belongs to the category. If iou too low, the category is marked as background.
Next
def encode(matched, priors, variances):
# dist b/t match center and prior's center
g_cxcy = (matched[:, :2] + matched[:, 2:])/2 - priors[:, :2]
# encode variance
g_cxcy /= (variances[0] * priors[:, 2:])
# match wh / prior wh
g_wh = (matched[:, 2:] - matched[:, :2]) / priors[:, 2:]
g_wh = torch.log(g_wh) / variances[1]
# return target for smooth_l1_loss
return torch.cat([g_cxcy, g_wh], 1) # [num_priors,4]
We compared the differences gt box and its corresponding prior box. Note that the format is matched (lefttop_x, lefttop_y, rightbottom_x, rightbottom_y ).
So obtained here is actually offset between gt box and prior box.
Calculation of loss
For all prior box, the total can be divided into three types
- Positive samples
- loss of top-ranking xx negative samples
- Remaining negative samples
wherein the sample that is positive: and the ground truth box iou iou exceeds a threshold or maximum prior box.
Negative Sample: prior box than the positive samples.
Loss function is divided into 2 parts, the loss coordinate offset, in part, the loss of information categories.
When calculating loc loss, considering only positive samples in the calculation conf loss, that is, considering the positive samples and the negative samples and holds considering the negative samples: Sample n = 3: 1.
Code implemented:
multibox_loss.py
class MultiBoxLoss(nn.Module):
def forward(self, predictions, targets):
It can be expressed as pseudocode
#根据匹配策略得到每个prior box对应的gt box
#根据iou筛选出positive prior box
#计算conf loss
#筛选出loss靠前的xx个negative prior box.保证neg:pos=3:1
#计算交叉熵
#归一化处理
- Coordinate offset loss
pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)
loc_p = loc_data[pos_idx].view(-1, 4) #预测得到的偏移量
loc_t = loc_t[pos_idx].view(-1, 4) #真实的偏移量
loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False) #我们回归的就是相对default box的偏移
With smooth_l1_loss. The code is relatively simple, not much talk about it.
Hard negative mining
after the match default box and gt box, there must be a lot of default box is no match on. That is only a small amount of positive samples, there are a lot of negative samples. For each default box, descending in accordance with our confidence loss the sorting we just take the top out of some of the default box to calculate the loss, so that the negative samples: positive samples in 3: 1 so you can make the model more to speed up the optimization, training more stable.
Unbalanced on the target detection reference may https://zhuanlan.zhihu.com/p/60612064
That is to simply drop the negative sample allows us to learn background information, sample making positive target information we learned so both are needed, and to maintain a proper ratio thesis using a 3: 1
corresponding to the code that is MultiBoxLoss.negpos_ratio
# Compute max conf across batch for hard negative mining
batch_conf = conf_data.view(-1, self.num_classes) #[batch*8732,21]
loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1)) #conf_t的列方向是类别信息
# Hard Negative Mining
loss_c[pos] = 0 # filter out pos boxes for now
loss_c = loss_c.view(num, -1)
_, loss_idx = loss_c.sort(1, descending=True)
_, idx_rank = loss_idx.sort(1)
num_pos = pos.long().sum(1, keepdim=True)
num_neg = torch.clamp(self.negpos_ratio*num_pos, max=pos.size(1)-1)
#得到负样本的index
neg = idx_rank < num_neg.expand_as(idx_rank)
This time the loss is not conf loss of the network, not the thesis of l_conf .
def log_sum_exp(x):
"""Utility function for computing log_sum_exp while determining
This will be used to determine unaveraged confidence loss across
all examples in a batch.
Args:
x (Variable(tensor)): conf_preds from conf layers
"""
x_max = x.data.max()
return torch.log(torch.sum(torch.exp(x-x_max), 1, keepdim=True)) + x_max
This uses a trick. Referring https://github.com/amdegroot/ssd.pytorch/issues/203 , https://stackoverflow.com/questions/42599498/numercially-stable-softmax
in order to avoid too e n power big or too small to calculate, often use this trick when calculating softmax.
This function is seriously affected my understanding of loss_c, in fact, you can put x_max remove the above function. That function
then becomes the loss_c
loss_c = torch.log(torch.sum(torch.exp(batch_conf), 1, keepdim=True)) - batch_conf.gather(1, conf_t.view(-1, 1))
Like to understand more.
Conf_t the column direction of the corresponding label index. Batch_conf.gather (1, conf_t.view (-1, 1)) to give a [batch * 8732,1] of the tensor, i.e., retaining only the corresponding probability prior box label prediction information.
That overall loss is the loss of all categories minus the sum of the prior box should be responsible for loss of the label.
After getting loss_c, we went to get index of the positive samples / negative samples
# 选出loss最大的一些负样本 负样本:正样本=3:1
# Hard Negative Mining
loss_c = loss_c.view(num, -1) #[batch,8732]
loss_c[pos] = 0 # filter out pos boxes for now
_, loss_idx = loss_c.sort(1, descending=True) #对每张图的priorbox的conf loss逆序排序
print(_[0,:],loss_idx[0]) #[batch,8732] 每一列的值为prior box的index
_, idx_rank = loss_idx.sort(1)
print(_[0,:],idx_rank[0,:]) #[batch,8732] 每一列的值为prior box在loss_idx的位置.我们要选取前loss_idx中的前xx个.(xx=3倍负样本)
num_pos = pos.long().sum(1, keepdim=True)
print(num_pos) #[batch,1] 列的值为每张图的正样本数量
#求得负样本的数量,3倍正样本,如果3倍正样本>全部prior box,则设置负样本数量为prior box数量
num_neg = torch.clamp(self.negpos_ratio*num_pos, max=pos.size(1)-1)
print(num_neg)
#选出loss排名最靠前的num_neg个负样本
neg = idx_rank < num_neg.expand_as(idx_rank)
print(neg)
At this point, we get positive and negative samples of the index. Then you can calculate the difference between the predicted value and the true value.
loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)
# Sum of losses: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
N = num_pos.data.sum()
loss_l /= N
loss_c /= N
return loss_l, loss_c
Measured by cross-entropy loss. Finally divided by the number of positive samples, make the normalization process.
Https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss
before calculating the loss, no need to manually softmax converted into a probability value.
training
As already enables the creation of a network structure, loss can be achieved calculate the next train..
Implemented in train.py
main logic streamlined as follows:
ssd_net = build_ssd('train', cfg['min_dim'], cfg['num_classes'])
net = ssd_net
optimizer = optim.SGD(net.parameters(), lr=args.lr, momentum=args.momentum,
weight_decay=args.weight_decay)
criterion = MultiBoxLoss(cfg['num_classes'], 0.5, True, 0, True, 3, 0.5,
False, args.cuda)
for iteration in range(args.start_iter, cfg['max_iter']):
# load train data
images, targets = next(batch_iterator)
# forward
out = net(images)
# backprop
optimizer.zero_grad()
loss_l, loss_c = criterion(out, targets)
loss = loss_l + loss_c
loss.backward()
optimizer.step()
which is
- Defined network architecture
- Back propagation loss function definitions and request a gradient method
- Load the training set
- Propagation prediction value obtained before
- Calculation of loss
- Back-propagation, updating the network weight parameters
Part of a function related to torch usage Reference: https://www.cnblogs.com/sdu20112013/p/11731741.html