SSD源码阅读
一、关键参数解释
feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11'],
feat_shapes=[(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
anchor_size_bounds=[0.15, 0.90],
anchor_sizes=[(21., 45.),
(45., 99.),
(99., 153.),
(153., 207.),
(207., 261.),
(261., 315.)],
anchor_ratios=[[2, .5],
[2, .5, 3, 1./3],
[2, .5, 3, 1./3],
[2, .5, 3, 1./3],
[2, .5],
[2, .5]],
anchor_steps=[8, 16, 32, 64, 100, 300],
anchor_offset=0.5,
normalizations=[20, -1, -1, -1, -1, -1],
prior_scaling=[0.1, 0.1, 0.2, 0.2]
以ssd-300为例,feat_layers代表生成cls和box得分的层共6层;feat_map值得是当前层的featurmap的大小;anchor_size_bounds代表
anchor_size_rate的最大最小值,以此生成anchor_size,计算如下:
由于block_4是单独设置的大小为21,所以k=5,block_7的大小为300*0.15=45,步长为(0.9-0.15)×100/(5-1)=0.18,0.18*300=54,因此特征图大小为45,99(45+54),153(99+54),207,261,315。计算315是由于ssd在设置1:1的正方形框时,要设置两个比例的正方形框。
anchor_ratios是anchor的长宽比,anchor的h,w计算为
,h计算为
,
是anchor_rate。除了上述设置的这些还包括两个1:1的正方框,其中一个anchor的
等于
。anchor_steps为了方便预测的box到原图坐标的转换。
综上可得下表
anchor的生成
整个训练过程都在train_ssd_network.py中,从这个文件中可以看出,第一步anchors的获取是通过ssd_anchors = ssd_net.anchors(ssd_shape)这句代码来获取的,这个被调用的函数可以在ssd_vgg_300.py中找到。而这个函数也只有一句话,那就是
return ssd_anchors_all_layers(img_shape,
self.params.feat_shapes,
self.params.anchor_sizes,
self.params.anchor_ratios,
self.params.anchor_steps,
self.params.anchor_offset,
dtype)
最终跳转到
def ssd_anchor_one_layer(img_shape,
feat_shape,
sizes,
ratios,
step,
offset=0.5,
dtype=np.float32):
"""Computer SSD default anchor boxes for one feature layer.
Determine the relative position grid of the centers, and the relative
width and height.
Arguments:
feat_shape: Feature shape, used for computing relative position grids;
size: Absolute reference sizes;
ratios: Ratios to use on these features;
img_shape: Image shape, used for computing height, width relatively to the
former;
offset: Grid offset.
Return:
y, x, h, w: Relative x and y grids, and height and width.
"""
# Compute the position grid: simple way.
# y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
# y = (y.astype(dtype) + offset) / feat_shape[0]
# x = (x.astype(dtype) + offset) / feat_shape[1]
# Weird SSD-Caffe computation using steps values...
y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
y = (y.astype(dtype) + offset) * step / img_shape[0]
x = (x.astype(dtype) + offset) * step / img_shape[1]
# Expand dims to support easy broadcasting.
y = np.expand_dims(y, axis=-1)
x = np.expand_dims(x, axis=-1)
# Compute relative height and width.
# Tries to follow the original implementation of SSD for the order.
num_anchors = len(sizes) + len(ratios)
h = np.zeros((num_anchors, ), dtype=dtype)
w = np.zeros((num_anchors, ), dtype=dtype)
# Add first anchor boxes with ratio=1.
h[0] = sizes[0] / img_shape[0]
w[0] = sizes[0] / img_shape[1]
di = 1
if len(sizes) > 1:
h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0]
w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1]
di += 1
for i, r in enumerate(ratios):
h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r)
w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r)
return y, x, h, w
这个函数就是给’block4’, ‘block7’, ‘block8’, ‘block9’, ‘block10’, ‘block11’的特征图在每个位置上生成anchor。最终的输出是y,x,h,w。y,x的大小为feature map的大小。h,w大小则为anchor ratio的长度,在源码中’block4’, ‘block10’, ‘block11’有四种比率,block7’, ‘block8’, ‘block9’,有六种比率,因此anchor的数量为 。
三 、ground truth bboxes encodes
要明白问什么要对ground truth编码先了解下什么是边框回归,推荐阅读边框回归,我的一点思考是,如果不对ground truth进行box encod的话,正样本会很少,学习也学习不到东西,更不会使bbox回归到正确的位置。encode的代码通过一个叫tf_ssd_bboxes_encode_layer的函数获得的,其定义于ssd_common.py
def tf_ssd_bboxes_encode_layer(labels,
bboxes,
anchors_layer,
num_classes,
no_annotation_label,
ignore_threshold=0.5,
prior_scaling=[0.1, 0.1, 0.2, 0.2],
dtype=tf.float32):
"""Encode groundtruth labels and bounding boxes using SSD anchors from
one layer.
Arguments:
labels: 1D Tensor(int64) containing groundtruth labels;
bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
anchors_layer: Numpy array with layer anchors;
matching_threshold: Threshold for positive match with groundtruth bboxes;
prior_scaling: Scaling of encoded coordinates.
Return:
(target_labels, target_localizations, target_scores): Target Tensors.
"""
# Anchors coordinates and volume.
yref, xref, href, wref = anchors_layer
ymin = yref - href / 2.
xmin = xref - wref / 2.
ymax = yref + href / 2.
xmax = xref + wref / 2.
vol_anchors = (xmax - xmin) * (ymax - ymin)
# Initialize tensors...
shape = (yref.shape[0], yref.shape[1], href.size)
feat_labels = tf.zeros(shape, dtype=tf.int64)
feat_scores = tf.zeros(shape, dtype=dtype)
feat_ymin = tf.zeros(shape, dtype=dtype)
feat_xmin = tf.zeros(shape, dtype=dtype)
feat_ymax = tf.ones(shape, dtype=dtype)
feat_xmax = tf.ones(shape, dtype=dtype)
def jaccard_with_anchors(bbox):
"""Compute jaccard score between a box and the anchors.
"""
int_ymin = tf.maximum(ymin, bbox[0])
int_xmin = tf.maximum(xmin, bbox[1])
int_ymax = tf.minimum(ymax, bbox[2])
int_xmax = tf.minimum(xmax, bbox[3])
h = tf.maximum(int_ymax - int_ymin, 0.)
w = tf.maximum(int_xmax - int_xmin, 0.)
# Volumes.
inter_vol = h * w
union_vol = vol_anchors - inter_vol \
+ (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
jaccard = tf.div(inter_vol, union_vol)
return jaccard
def intersection_with_anchors(bbox):
"""Compute intersection between score a box and the anchors.
"""
int_ymin = tf.maximum(ymin, bbox[0])
int_xmin = tf.maximum(xmin, bbox[1])
int_ymax = tf.minimum(ymax, bbox[2])
int_xmax = tf.minimum(xmax, bbox[3])
h = tf.maximum(int_ymax - int_ymin, 0.)
w = tf.maximum(int_xmax - int_xmin, 0.)
inter_vol = h * w
scores = tf.div(inter_vol, vol_anchors)
return scores
def condition(i, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax):
"""Condition: check label index.
"""
r = tf.less(i, tf.shape(labels))
return r[0]
def body(i, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax):
"""Body: update feature labels, scores and bboxes.
Follow the original SSD paper for that purpose:
- assign values when jaccard > 0.5;
- only update if beat the score of other bboxes.
"""
# Jaccard score.
label = labels[i]
bbox = bboxes[i]
jaccard = jaccard_with_anchors(bbox)
# Mask: check threshold + scores + no annotations + num_classes.
mask = tf.greater(jaccard, feat_scores)
# mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
mask = tf.logical_and(mask, feat_scores > -0.5)
mask = tf.logical_and(mask, label < num_classes)
imask = tf.cast(mask, tf.int64)
fmask = tf.cast(mask, dtype)
# Update values using mask.
feat_labels = imask * label + (1 - imask) * feat_labels
feat_scores = tf.where(mask, jaccard, feat_scores)
feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin
feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin
feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax
# Check no annotation label: ignore these anchors...
# interscts = intersection_with_anchors(bbox)
# mask = tf.logical_and(interscts > ignore_threshold,
# label == no_annotation_label)
# # Replace scores by -1.
# feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores)
return [i+1, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax]
# Main loop definition.
i = 0
[i, feat_labels, feat_scores,
feat_ymin, feat_xmin,
feat_ymax, feat_xmax] = tf.while_loop(condition, body,
[i, feat_labels, feat_scores,
feat_ymin, feat_xmin,
feat_ymax, feat_xmax])
# Transform to center / size.
feat_cy = (feat_ymax + feat_ymin) / 2.
feat_cx = (feat_xmax + feat_xmin) / 2.
feat_h = feat_ymax - feat_ymin
feat_w = feat_xmax - feat_xmin
# Encode features.
feat_cy = (feat_cy - yref) / href / prior_scaling[0]
feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
feat_h = tf.log(feat_h / href) / prior_scaling[2]
feat_w = tf.log(feat_w / wref) / prior_scaling[3]
# Use SSD ordering: x / y / w / h instead of ours.
feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)
return feat_labels, feat_localizations, feat_scores
函数的返回值是标签,位置和得分。主要有以下函数构成jaccard_with_anchors(bbox)用于计算bbox与所有default box的IOU;intersection_with_anchors(bbox)用于计算bbox与所有default box的交比上default box的面积的值,在此处没有用到这个函数;condition,body要和下面的tf.while_loop连起来看,tf.while_loop的执行逻辑是,若condition为真,执行body,condition只有一句话,其实就是判断i的值是否小于GT box的数目,也就是说整个循环逻辑是以每个GT box为单位进行的,body就是对每个GT box进行的操作。
jaccard_with_anchors(bbox)函数的意义在于对每个ground truth的box计算anchor与其的IOU大于0.5则真,这样一个ground truth就会有多个正样本的anchor,但anchor有其匹配机制,共有两个原则,第一个原则是每个GT box与其IOU最大的defaut box匹配,这样就能保证每个GT box都有一个与其匹配的default box。但是这样的话正负样本严重不平衡,因此还需要第二个原则,那就是对于未匹配的default box,若与某个GT box的IOU大于一个阈值(SSD中取0.5),那么也将其进行匹配,此处有一个问题就是若某个default box与几个GT box的IOU都大于阈值,选哪个与其匹配?显然,选与其IOU最大的那个GT box与之匹配。这样就大大增加了正样本的个数。还会有一个矛盾在于,假如一个GT box 1与其IOU最大的default box小于阈值,而该default box与某一个GT box 2的IOU大于阈值,如何选择。此处应该按照第一个原则,选择GT box 1,因为必须要保证每个GT box要有一个default box与之匹配。(实际情况中该矛盾发生可能较低,所以该tensorflow实现中仅实施了第二个原则。)
参考文中提到的博客边框回归,可知道要保持尺度不变形,需要对box的center和w,h做特殊处理。假设default box位置为
,则偏移的计算如下:
对应的解码为
网络结构
普通的网络层不在讨论,只讨论[‘block4’, ‘block7’, ‘block8’, ‘block9’, ‘block10’, ‘block11’],函数可以在ssd_300.py中找到。
def ssd_multibox_layer(inputs,
num_classes,
sizes,
ratios=[1],
normalization=-1,
bn_normalization=False):
"""Construct a multibox layer, return a class and localization predictions.
"""
net = inputs
if normalization > 0:
net = custom_layers.l2_normalization(net, scaling=True)
# Number of anchors.
num_anchors = len(sizes) + len(ratios)
# Location.
num_loc_pred = num_anchors * 4
loc_pred = slim.conv2d(net, num_loc_pred, [3, 3], activation_fn=None,
scope='conv_loc')
loc_pred = custom_layers.channel_to_last(loc_pred)
loc_pred = tf.reshape(loc_pred,
tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4])
# Class prediction.
num_cls_pred = num_anchors * num_classes
cls_pred = slim.conv2d(net, num_cls_pred, [3, 3], activation_fn=None,
scope='conv_cls')
cls_pred = custom_layers.channel_to_last(cls_pred)
cls_pred = tf.reshape(cls_pred,
tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes])
return cls_pred, loc_pred
损失函数
ssd的损失函数包括定位损失和分类损失,定位损失为smooth L1 loss,分类损失为交叉熵损失。
def ssd_losses(logits, localisations,
gclasses, glocalisations, gscores,
match_threshold=0.5,
negative_ratio=3.,
alpha=1.,
label_smoothing=0.,
scope=None):
"""Loss functions for training the SSD 300 VGG network.
This function defines the different loss components of the SSD, and
adds them to the TF loss collection.
Arguments:
logits: (list of) predictions logits Tensors;
localisations: (list of) localisations Tensors;
gclasses: (list of) groundtruth labels Tensors;
glocalisations: (list of) groundtruth localisations Tensors;
gscores: (list of) groundtruth score Tensors;
"""
with tf.name_scope(scope, 'ssd_losses'):
l_cross_pos = []
l_cross_neg = []
l_loc = []
for i in range(len(logits)):
dtype = logits[i].dtype
with tf.name_scope('block_%i' % i):
# Determine weights Tensor.
pmask = gscores[i] > match_threshold
fpmask = tf.cast(pmask, dtype)
n_positives = tf.reduce_sum(fpmask)
# Select some random negative entries.
# n_entries = np.prod(gclasses[i].get_shape().as_list())
# r_positive = n_positives / n_entries
# r_negative = negative_ratio * n_positives / (n_entries - n_positives)
# Negative mask.
no_classes = tf.cast(pmask, tf.int32)
predictions = slim.softmax(logits[i])
nmask = tf.logical_and(tf.logical_not(pmask),
gscores[i] > -0.5)
fnmask = tf.cast(nmask, dtype)
nvalues = tf.where(nmask,
predictions[:, :, :, :, 0],
1. - fnmask)
nvalues_flat = tf.reshape(nvalues, [-1])
# Number of negative entries to select.
n_neg = tf.cast(negative_ratio * n_positives, tf.int32)
n_neg = tf.maximum(n_neg, tf.size(nvalues_flat) // 8)
n_neg = tf.maximum(n_neg, tf.shape(nvalues)[0] * 4)
max_neg_entries = 1 + tf.cast(tf.reduce_sum(fnmask), tf.int32)
n_neg = tf.minimum(n_neg, max_neg_entries)
val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg)
minval = val[-1]
# Final negative mask.
nmask = tf.logical_and(nmask, -nvalues > minval)
fnmask = tf.cast(nmask, dtype)
# Add cross-entropy loss.
with tf.name_scope('cross_entropy_pos'):
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
labels=gclasses[i])
loss = tf.losses.compute_weighted_loss(loss, fpmask)
l_cross_pos.append(loss)
with tf.name_scope('cross_entropy_neg'):
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
labels=no_classes)
loss = tf.losses.compute_weighted_loss(loss, fnmask)
l_cross_neg.append(loss)
# Add localization loss: smooth L1, L2, ...
with tf.name_scope('localization'):
# Weights Tensor: positive mask + random negative.
weights = tf.expand_dims(alpha * fpmask, axis=-1)
loss = custom_layers.abs_smooth(localisations[i] - glocalisations[i])
loss = tf.losses.compute_weighted_loss(loss, weights)
l_loc.append(loss)
# Additional total losses...
with tf.name_scope('total'):
total_cross_pos = tf.add_n(l_cross_pos, 'cross_entropy_pos')
total_cross_neg = tf.add_n(l_cross_neg, 'cross_entropy_neg')
total_cross = tf.add(total_cross_pos, total_cross_neg, 'cross_entropy')
total_loc = tf.add_n(l_loc, 'localization')
# Add to EXTRA LOSSES TF.collection
tf.add_to_collection('EXTRA_LOSSES', total_cross_pos)
tf.add_to_collection('EXTRA_LOSSES', total_cross_neg)
tf.add_to_collection('EXTRA_LOSSES', total_cross)
tf.add_to_collection('EXTRA_LOSSES', total_loc)
Smooth L1 Loss计算如下,其中
代表预测框和defaul box的变换关系,
代表ground truth box 和 default box 的变换关系