代码运行

同样分析tensorflow版的实现，代码地址：SSD-Tensorflow

1.预测

unzip ssd_300_vgg.ckpt.zip

jupyter notebook notebooks/ssd_notebook.ipynb

启动jupyter notebook后，我们运行里面的代码就可以进行预测了。

2.训练

准备数据

首先下载VOC数据，下载方式可以参考https://blog.csdn.net/stesha_chen/article/details/82965283

创建文件夹tfrecords，然后运行tf_convert_data.py，将VOC数据转成tfrecord方式存储，后面会细读这部分的代码。

python tf_convert_data.py \
    --dataset_name=pascalvoc \
    --dataset_dir=./VOCdevkit/VOC2007/ \
    --output_name=voc_2007_train \
    --output_dir=./tfrecords

基于pre-trained VGG

由于SSD网络的base network就是一个分类网络，所以就算是从头开始训练SSD，我们也会先用训练好的分类网络初始化SSD的前面层，这里以VGG为例。

先下载预先训练好的VGG权重文件，VGG下载地址

python train_ssd_network.py \
    --train_dir=./log \
    --dataset_dir=./tfrecords \
    --dataset_name=pascalvoc_2007 \
    --dataset_split_name=train \
    --model_name=ssd_300_vgg \
    --checkpoint_path=./checkpoints/vgg_16.ckpt \
    --checkpoint_model_scope=vgg_16 \
    --checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
    --trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
    --save_summaries_secs=60 \
    --save_interval_secs=600 \
    --weight_decay=0.0005 \
    --optimizer=adam \
    --learning_rate=0.001 \
    --learning_rate_decay_factor=0.94 \
    --batch_size=32

--train_dir表示训练权重，tensorboard等信息保存的地址

--dataset_dir表示转出来的tfrecords保存的地址

--dataset_name表示我们现在训练哪个数据集，因为不同的数据集，在解析方式上是有些微差别的

--dataset_split_name表示我们现在是用pascalvoc_2007数据集的train数据还是test数据

--model_name表示我们模型的名字，检测模型是SSD，输入是300x300，base network是VGG

--checkpoint_path就是VGG pre-trained weights存放的地址

--checkpoint_exclude_scopes表示这些参数不需要从checkpoint中恢复

--trainable_scopes表示训练的时候只更新这些参数的值

基于pre-trained SSD

在用jupyter运行的时候我们已经将预先训练好的SSD权重文件下载了，运行下面的命令就可以继续训练了

python train_ssd_network.py \
    --train_dir=./log/ \
    --dataset_dir=./tfrecords \
    --dataset_name=pascalvoc_2007 \
    --dataset_split_name=train \
    --model_name=ssd_300_vgg \
    --checkpoint_path=./checkpoints/ssd_300_vgg.ckpt \
    --save_summaries_secs=60 \
    --save_interval_secs=600 \
    --weight_decay=0.0005 \
    --optimizer=adam \
    --learning_rate=0.001 \
    --batch_size=32

3.评估

python eval_ssd_network.py \
    --eval_dir=./log/eval \
    --dataset_dir=./tfrecords \
    --dataset_name=pascalvoc_2007 \
    --dataset_split_name=test \
    --model_name=ssd_300_vgg \
    --checkpoint_path=./checkpoints/ssd_300_vgg.ckpt \
    --wait_for_checkpoints=True \
    --batch_size=1 \
    --max_num_batches=500

代码分析

数据准备

在tf_convert_data.py中主要是调用了pascalvoc_to_tfrecords.run，第一个参数dataset_dir是./VOCdevkit/VOC2007，第二个参数output_dir是./tfrecords，第三个参数是voc_2007_train

    path = os.path.join(dataset_dir, DIRECTORY_ANNOTATIONS)
    filenames = sorted(os.listdir(path))

path是./VOCdevkit/VOC2007/Annotations，列出里面的所有文件就是filenames，每个文件是对图片进行标记的xml文件。

然后遍历所有文件，将200个文件放入一条tfrecord记录中。

        # 创建TFRecordWriter
        with tf.python_io.TFRecordWriter(tf_filename) as tfrecord_writer:
            j = 0
            # 循环200次放入同一条tfrecord中
            while i < len(filenames) and j < SAMPLES_PER_FILES:
                filename = filenames[i]
                img_name = filename[:-4]
                # 将数据写入tfrecord中
                _add_to_tfrecord(dataset_dir, img_name, tfrecord_writer)

在_add_to_tfrecord函数中先调用_process_image对数据进行解析，返回值如下：

image_data是通过tf.gfile.FastGFile(filename, 'r').read()读出来的图片buffer

    filename = directory + DIRECTORY_IMAGES + name + '.jpg'
    image_data = tf.gfile.FastGFile(filename, 'r').read()

shape是从xml中读出的size信息

    shape = [int(size.find('height').text),
             int(size.find('width').text),
             int(size.find('depth').text)]

bboxes是从xml中读出的每个物体的位置框，如果图片中两个物体那么bboxes的长度为2，里面放着两个bbox信息

        bbox = obj.find('bndbox')
        bboxes.append((float(bbox.find('ymin').text) / shape[0],
                       float(bbox.find('xmin').text) / shape[1],
                       float(bbox.find('ymax').text) / shape[0],
                       float(bbox.find('xmax').text) / shape[1]
                       ))

labels中保存一张图片上所有物体对应的id，比如图片上有dog和bird，那么labels=[12, 3]

        label = obj.find('name').text
        labels.append(int(VOC_LABELS[label][0]))

labels_text中保存一张图片中所有物体的名字

        label = obj.find('name').text
        labels_text.append(label.encode('ascii'))

difficult中保存所有物体是否是difficult的参数，如果有取其值，如果没有这个参数就设置为0，这个参数表示这个数据是否难标记的，后面对数据集处理的时候可以选择是否过滤掉这部分数据。

        if obj.find('difficult'):
            difficult.append(int(obj.find('difficult').text))
        else:
            difficult.append(0)

truncated中保存所有物体是否是truncated的参数，如果有取其值，如果没有这个参数就设置为0

        if obj.find('truncated'):
            truncated.append(int(obj.find('truncated').text))
        else:
            truncated.append(0)

接着在_add_to_tfrecord函数中调用_convert_to_example生成example，传入参数就是我们前面拿到的返回值，xmin是图片中所有物体的xmin的集合，xmax，ymin，ymax同样。

    example = tf.train.Example(features=tf.train.Features(feature={
            'image/height': int64_feature(shape[0]),
            'image/width': int64_feature(shape[1]),
            'image/channels': int64_feature(shape[2]),
            'image/shape': int64_feature(shape),
            'image/object/bbox/xmin': float_feature(xmin),
            'image/object/bbox/xmax': float_feature(xmax),
            'image/object/bbox/ymin': float_feature(ymin),
            'image/object/bbox/ymax': float_feature(ymax),
            'image/object/bbox/label': int64_feature(labels),
            'image/object/bbox/label_text': bytes_feature(labels_text),
            'image/object/bbox/difficult': int64_feature(difficult),
            'image/object/bbox/truncated': int64_feature(truncated),
            'image/format': bytes_feature(image_format),
            'image/encoded': bytes_feature(image_data)}))

最后在_add_to_tfrecord函数中将example写入tfrecord

tfrecord_writer.write(example.SerializeToString())

总结一下生成tfrecord只需要如下几步

tfrecord_writer = tf.python_io.TFRecordWriter(...)
解析xml和图片文件
example = tf.train.Example(...)
tfrecord_write.write(example.SerializeToString())

基于VGG从头开始训练

训练的入口在train_ssd_network.py中，从main函数开始运行。

1.准备

        # 初始化DeploymentConfig对象，将参数的信息保存在对象中
        # 主要就是配置多GPU的情况，配置是否在CPU上做拷贝等
        # Args:
        #   num_clones: Number of model clones to deploy in each replica.
        #   clone_on_cpu: If True clones would be placed on CPU.
        #   replica_id: Integer.  Index of the replica for which the model is
        #       deployed.  Usually 0 for the chief replica.
        #   num_replicas: Number of replicas to use.
        #   num_ps_tasks: Number of tasks for the `ps` job. 0 to not use replicas.
        deploy_config = model_deploy.DeploymentConfig(
            num_clones=FLAGS.num_clones,
            clone_on_cpu=FLAGS.clone_on_cpu,
            replica_id=0,
            num_replicas=1,
            num_ps_tasks=0)
        # deploy_config.variables_device()就是/device:CPU:0表示下面的参数放在CPU上
        with tf.device(deploy_config.variables_device()):
            # global_step打印出来就是<tf.Variable 'global_step:0' shape=() dtype=int64_ref>
            # 代表全局步数，后面会用到
            global_step = slim.create_global_step()

2.取出dataset

        dataset = dataset_factory.get_dataset(
            FLAGS.dataset_name, FLAGS.dataset_split_name, FLAGS.dataset_dir)

get_dataset最后会调用到pascalvoc_common.py中的get_split来读取tfrecord创建dataset。

首先创建reader

    if reader is None:
        reader = tf.TFRecordReader

然后生成decoder

    decoder = slim.tfexample_decoder.TFExampleDecoder(
        keys_to_features, items_to_handlers)

最后返回由这些参数组成的dataset

    return slim.dataset.Dataset(
            data_sources=file_pattern,
            reader=reader,
            decoder=decoder,
            num_samples=split_to_sizes[split_name],
            items_to_descriptions=items_to_descriptions,
            num_classes=num_classes,
            labels_to_names=labels_to_names)

3.生成anchors

# 现在model_name是ssd_300_vgg，因此ssd_class是ssd_vgg_300.SSDNet
ssd_class = nets_factory.get_network(FLAGS.model_name)
# 等同于调用ssd_vgg_300.SSDNet.default_params._replace
# 也就是调用SSDParams._replace
# 将SSDParams中的num_classes替换成21，不过其实现在就是21
ssd_params = ssd_class.default_params._replace(num_classes=FLAGS.num_classes)
# 就是调用ssd_vgg_300.SSDNet(ssd_params)来生成SSDNet对象
ssd_net = ssd_class(ssd_params)
# ssd_shape是（300， 300）
ssd_shape = ssd_net.params.img_shape
# 调用ssd_anchors_all_layers生成每一层的anchors
ssd_anchors = ssd_net.anchors(ssd_shape)

所以接着看ssd_anchors_all_layers的实现

def ssd_anchors_all_layers(img_shape,
                           layers_shape,
                           anchor_sizes,
                           anchor_ratios,
                           anchor_steps,
                           offset=0.5,
                           dtype=np.float32):
    # img_shape:(300, 300)
    # layers_shape:[(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)]
    # anchor_sizes:[(21., 45.),(45., 99.),(99., 153.),(153., 207.),(207., 261.),(261., 315.)]
    # anchor_ratios:[[2, .5],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5],[2, .5]]
    # anchor_steps:[8, 16, 32, 64, 100, 300]
    # offset:0.5
    layers_anchors = []
    # 跟论文中一样，一共有6层feature map会生成anchors，也就是default box
    # 对每一层feature map生成anchors
    for i, s in enumerate(layers_shape):
        anchor_bboxes = ssd_anchor_one_layer(img_shape, s,
                                             anchor_sizes[i],
                                             anchor_ratios[i],
                                             anchor_steps[i],
                                             offset=offset, dtype=dtype)
        layers_anchors.append(anchor_bboxes)
    return layers_anchors

最后看ssd_anchor_one_layer函数的实现，以第一层(38, 38)为例

def ssd_anchor_one_layer(img_shape,
                         feat_shape,
                         sizes,
                         ratios,
                         step,
                         offset=0.5,
                         dtype=np.float32):
    # img_shape: (300, 300)
    # feat_shape: (38, 38)
    # sizes: (21., 45.)
    # ratios: [2, .5]
    # step: 8
    # x，y是feature map上的每个像素坐标shape=(38, 38)
    y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
    # 算出xy还原到原图位置后除以输入图片的长宽
    y = (y.astype(dtype) + offset) * step / img_shape[0]
    x = (x.astype(dtype) + offset) * step / img_shape[1]

    # Expand dims to support easy broadcasting.
    # shape=(38, 38, 1)
    y = np.expand_dims(y, axis=-1)
    x = np.expand_dims(x, axis=-1)

    num_anchors = len(sizes) + len(ratios)
    # 用一定的计算方式生成4组h和w
    h = np.zeros((num_anchors, ), dtype=dtype)
    w = np.zeros((num_anchors, ), dtype=dtype)
    # Add first anchor boxes with ratio=1.
    h[0] = sizes[0] / img_shape[0]
    w[0] = sizes[0] / img_shape[1]
    di = 1
    if len(sizes) > 1:
        h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0]
        w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1]
        di += 1
    for i, r in enumerate(ratios):
        h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r)
        w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r)
    return y, x, h, w

其他的feature map同样处理，不过(19, 19)和(10, 10)和(5, 5)的feature map有6个anchors.现在这些anchors长宽都是一样，后面会做尺度压缩。

4.通过provider获取数据

with tf.name_scope(FLAGS.dataset_name + '_data_provider'):
    # 创建获取数据的provider，这种方式读取数据会比普通方式快很多
    # 因为是多个reader，并且在一个queue准备好后就在准备下一个queue
    provider = slim.dataset_data_provider.DatasetDataProvider(
        dataset,
        num_readers=FLAGS.num_readers,
        common_queue_capacity=20 * FLAGS.batch_size,
        common_queue_min=10 * FLAGS.batch_size,
        shuffle=True)
# Get for SSD network: image, labels, bboxes.
# 通过provider可以获取dataset中的数据
[image, shape, glabels, gbboxes] = provider.get(['image', 'shape',
                                                 'object/label',
                                                 'object/bbox'])

image:存放图片信息和图片格式信息

shape:存放图片的长宽和channel

glabels:图片中所有物体的label id数组

gbboxes:图片中所有物体的box信息

5.预处理取出来的数据

preprocessing_name = FLAGS.preprocessing_name or FLAGS.model_name
image_preprocessing_fn = preprocessing_factory.get_preprocessing(
    preprocessing_name, is_training=True)
......
# ssd_shape是（300， 300）
# 代码中DATA_FORMAT是NCHW表示Channel在最前面，这里我改成了NHWC，影响不大
image, glabels, gbboxes = \
    image_preprocessing_fn(image, glabels, gbboxes,
                           out_shape=ssd_shape,
                           data_format=DATA_FORMAT)

实际上就是调用ssd_vgg_preprocessing.py中的preprocess_for_train函数

# Distort image and bounding boxes.
dst_image = image
# 调用tf.image.sample_distorted_bounding_box为图像生成随机扭曲的边界框，做数据增强
# dst_image是通过生成的边界框截取出来的图像(None, None, 3)
# distort_bbox是新图片在原图片中的边界框(4,)
# bboxes的生成有以下几个步骤
# 1.将原来的bboxes做调整，生成相对新图片的坐标框
# 2.因为有可能截取的新图片使box的大部分内容丢失，如果box丢失的内容超过0.5则抛弃，这样生成了新的bboxes
# labels如果bboxes中有box被抛弃，那么相应的label也要被抛弃生成了新的label数据。
dst_image, labels, bboxes, distort_bbox = \
    distorted_bounding_box_crop(image, labels, bboxes,
                                min_object_covered=MIN_OBJECT_COVERED,
                                aspect_ratio_range=CROP_RATIO_RANGE)

# 将图片压缩成(300, 300)的尺寸
dst_image = tf_image.resize_image(dst_image, out_shape,
                                  method=tf.image.ResizeMethod.BILINEAR,
                                  align_corners=False)

# 图片随机翻转，bboxes也同样变化
dst_image, bboxes = tf_image.random_flip_left_right(dst_image, bboxes)
# 在4中方式中随机一种对颜色做变化，调整亮度，调整对比度，调整RGB色调，调整饱和度
dst_image = apply_with_random_selector(
        dst_image,
        lambda x, ordering: distort_color(x, ordering, fast_mode),
        num_cases=4)

image = dst_image * 255.
# 通用做法，减去RGB的均值
image = tf_image_whitened(image, [_R_MEAN, _G_MEAN, _B_MEAN])
# 返回的是新的图片
return image, labels, bboxes

6.处理gt_boxes和anchors

gclasses, glocalisations, gscores = \
    ssd_net.bboxes_encode(glabels, gbboxes, ssd_anchors)
batch_shape = [1] + [len(ssd_anchors)] * 3

然后取出ssd_anchors中每一层feature map的anchors后，调用ssd_common.py中的tf_ssd_bboxes_encode_layer函数

def tf_ssd_bboxes_encode_layer(labels,
                               bboxes,
                               anchors_layer,
                               num_classes,
                               no_annotation_label,
                               ignore_threshold=0.5,
                               prior_scaling=[0.1, 0.1, 0.2, 0.2],
                               dtype=tf.float32):
    # 以(38, 38)那层feature map为例，yref shape(38, 38, 1), xref shape(38, 38, 1)
    # href shape(4,), wref shape(4,)
    yref, xref, href, wref = anchors_layer
    # 下面的计算是以xref和yref为中心href和wref为长宽算出来的box的左上角坐标和右下脚坐标
    # ymin,xmin,ymax,xmax shape (38, 38, 4)
    ymin = yref - href / 2.
    xmin = xref - wref / 2.
    ymax = yref + href / 2.
    xmax = xref + wref / 2.
    # vol_anchors shape (38, 38, 4)
    vol_anchors = (xmax - xmin) * (ymax - ymin)

    # shape (38, 38, 4)
    shape = (yref.shape[0], yref.shape[1], href.size)
    feat_labels = tf.zeros(shape, dtype=tf.int64)
    feat_scores = tf.zeros(shape, dtype=dtype)

    feat_ymin = tf.zeros(shape, dtype=dtype)
    feat_xmin = tf.zeros(shape, dtype=dtype)
    feat_ymax = tf.ones(shape, dtype=dtype)
    feat_xmax = tf.ones(shape, dtype=dtype)

    i = 0
    # loop所有labels，在循环体中做如下操作
    # 1.先得到当前的label和当前的bbox
    # 2.求所有anchors和这个bbox的交并比得到jaccard shape(38, 38, 4)
    # 3.求jaccard中所有大于anchor与上一个bbox交并比的位置（为了算出anchor和哪个bbox的交并比最大）得到mask shape(38, 38, 4)其中都是true或者false。
    # 4.去除掉mask中一些异常的数据
    # 5.feat_labels存放的是anchor的label，shape为(38,38,4)，如果mask为1则标记当前label，如果为0则标记上一次循环的label或者0
    # 6.feat_scores记录的是anchor与box的交并比的值，只保留mask为true的值，其他为0，shape为(38, 38, 4)
    # 7.feat_ymin,feat_xmin,feat_ymax,feat_xmax存放当前box或者上一个box的坐标
    # 总结一下就是通过anchor跟box的交并比生成以下信息
    # feat_labels：anchor与哪个gt box相交最大就标记上这个box的label，如果都没有就是0
    # feat_scores：记录anchor与gt box相交最大的值
    # feat_ymin,feat_xmin,feat_ymax,feat_xmax：记录anchor与gt_box交并比最大的gt_box的坐标
    [i, feat_labels, feat_scores,
     feat_ymin, feat_xmin,
     feat_ymax, feat_xmax] = tf.while_loop(condition, body,
                                           [i, feat_labels, feat_scores,
                                            feat_ymin, feat_xmin,
                                            feat_ymax, feat_xmax])
    # 中心点和长宽
    feat_cy = (feat_ymax + feat_ymin) / 2.
    feat_cx = (feat_xmax + feat_xmin) / 2.
    feat_h = feat_ymax - feat_ymin
    feat_w = feat_xmax - feat_xmin
    # 计算中心点和长宽相对anchor的偏移
    feat_cy = (feat_cy - yref) / href / prior_scaling[0]
    feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
    feat_h = tf.log(feat_h / href) / prior_scaling[2]
    feat_w = tf.log(feat_w / wref) / prior_scaling[3]
    # feat_localizations shape(38, 38, 4, 4)是所有anchor相对它对应的gt_box的偏移
    feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)
    return feat_labels, feat_localizations, feat_scores

最后返回的gclasses, glocalisations, gscores长度都为6，放着每一层的feat_labels,feat_localizations,feat_scores

7.准备batch队列

r = tf.train.batch(
    tf_utils.reshape_list([image, gclasses, glocalisations, gscores]),
    batch_size=FLAGS.batch_size,
    num_threads=FLAGS.num_preprocessing_threads,
    capacity=5 * FLAGS.batch_size)
b_image, b_gclasses, b_glocalisations, b_gscores = \
    tf_utils.reshape_list(r, batch_shape)

# Intermediate queueing: unique batch computation pipeline for all
# GPUs running the training.
batch_queue = slim.prefetch_queue.prefetch_queue(
    tf_utils.reshape_list([b_image, b_gclasses, b_glocalisations, b_gscores]),
    capacity=2 * deploy_config.num_clones)

将创建的batch结构放入队列，便于训练的时候用队列的方式取batch，这样速度会更快，因为队列是多线程处理数据的。

8.搭建网络

clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue])

实际上是调用clone_fn函数

def clone_fn(batch_queue):
    # 取出队列中的batch
    b_image, b_gclasses, b_glocalisations, b_gscores = \
        tf_utils.reshape_list(batch_queue.dequeue(), batch_shape)

    # 准备网络的默认值
    # activation_fn=tf.nn.relu
    # weights_regularizer=slim.l2_regularizer(weight_decay)
    # weights_initializer=tf.contrib.layers.xavier_initializer()
    # biases_initializer=tf.zeros_initializer()
    # padding='SAME'
    arg_scope = ssd_net.arg_scope(weight_decay=FLAGS.weight_decay,
                                  data_format=DATA_FORMAT)
    with slim.arg_scope(arg_scope):
        # 实际上是调用ssd_vgg_300.py中的ssd_net函数
        predictions, localisations, logits, end_points = \
            ssd_net.net(b_image, is_training=True)
    # 实际上是调用ssd_vgg_300.py中的ssd_losses函数
    ssd_net.losses(logits, localisations,
                   b_gclasses, b_glocalisations, b_gscores,
                   match_threshold=FLAGS.match_threshold,
                   negative_ratio=FLAGS.negative_ratio,
                   alpha=FLAGS.loss_alpha,
                   label_smoothing=FLAGS.label_smoothing)
    return end_points

ssd_net

在函数中首先就是搭建卷积网络，从block1到block11

然后取出feat_layers指定的层block4,block7,block8,block9,block10,block11一共6层，分别对每一层调用ssd_multibox_layer

predictions = []
logits = []
localisations = []
for i, layer in enumerate(feat_layers):
    with tf.variable_scope(layer + '_box'):
        # 这个函数主要有4个步骤，以38 38这一层为例
        # 1.卷积运算生成loc_pred(38,38,16)，reshape后得到(38,38,4,4)代表4个anchor，每个anchor4个数字
        # 2.卷积运算生成cls_pred(38，38，84)，reshape后得到(38，38，4，21)代表4个anchor，每个anchor21个数字，因为有21个类别
        p, l = ssd_multibox_layer(end_points[layer],
                                  num_classes,
                                  anchor_sizes[i],
                                  anchor_ratios[i],
                                  normalizations[i])
    predictions.append(prediction_fn(p))
    logits.append(p)
    localisations.append(l)

predictions是将每一层计算出来的cls_pred做softmax后的值放在一起

logits是将每一层的cls_pred放在一起

localisations是将每一层的loc_pred放在一起

ssd_losses

最终调用ssd_losses来构建loss，主要由三部分组成，一个是正样本的cross entropy loss，一个是负样本的cross entropy loss，最后是localization的loss

def ssd_losses(logits, localisations,
               gclasses, glocalisations, gscores,
               match_threshold=0.5,
               negative_ratio=3.,
               alpha=1.,
               label_smoothing=0.,
               device='/cpu:0',
               scope=None):
    with tf.name_scope(scope, 'ssd_losses'):
        lshape = tfe.get_shape(logits[0], 5)
        num_classes = lshape[-1]
        batch_size = lshape[0]

        # Flatten out all vectors!
        flogits = []
        fgclasses = []
        fgscores = []
        flocalisations = []
        fglocalisations = []
        # 循环处理每一层，以38 38为例
        for i in range(len(logits)):
            # reshape后shape=(184832, 21),32*38*38*4,32是batchsize
            flogits.append(tf.reshape(logits[i], [-1, num_classes]))
            # reshape后shape=(184832,)
            fgclasses.append(tf.reshape(gclasses[i], [-1]))
            # reshape后shape=(184832,)
            fgscores.append(tf.reshape(gscores[i], [-1]))
            # reshape后shape=(184832, 4)
            flocalisations.append(tf.reshape(localisations[i], [-1, 4]))
            # reshape后shape=(184832, 4)
            fglocalisations.append(tf.reshape(glocalisations[i], [-1, 4]))
        # 全部concat后，anchor的个数为279424=8732*32,8732这个值论文中计算过
        logits = tf.concat(flogits, axis=0)
        gclasses = tf.concat(fgclasses, axis=0)
        gscores = tf.concat(fgscores, axis=0)
        localisations = tf.concat(flocalisations, axis=0)
        glocalisations = tf.concat(fglocalisations, axis=0)
        dtype = logits.dtype

        # 根据阀值选出正样本
        pmask = gscores > match_threshold
        fpmask = tf.cast(pmask, dtype)
        n_positives = tf.reduce_sum(fpmask)

        # 挑选负样本
        no_classes = tf.cast(pmask, tf.int32)
        predictions = slim.softmax(logits)
        # 从不符合正样本的anchor中挑选gscores > -0.5的anchors作为负样本备选
        nmask = tf.logical_and(tf.logical_not(pmask),
                               gscores > -0.5)
        fnmask = tf.cast(nmask, dtype)
        # 找出他们对应的预测值，其他用1填充
        nvalues = tf.where(nmask,
                           predictions[:, 0],
                           1. - fnmask)
        nvalues_flat = tf.reshape(nvalues, [-1])
        # 负样本的个数
        max_neg_entries = tf.cast(tf.reduce_sum(fnmask), tf.int32)
        # 保证负样本的个数不大于正样本的三倍，防止正负样本不平衡
        n_neg = tf.cast(negative_ratio * n_positives, tf.int32) + batch_size
        n_neg = tf.minimum(n_neg, max_neg_entries)

        val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg)
        max_hard_pred = -val[-1]
        # 得到了负样本的mask
        nmask = tf.logical_and(nmask, nvalues < max_hard_pred)
        fnmask = tf.cast(nmask, dtype)

        # 正负样本的cross-entropy loss.
        with tf.name_scope('cross_entropy_pos'):
            loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                                  labels=gclasses)
            loss = tf.div(tf.reduce_sum(loss * fpmask), batch_size, name='value')
            tf.losses.add_loss(loss)

        with tf.name_scope('cross_entropy_neg'):
            loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
                                                                  labels=no_classes)
            loss = tf.div(tf.reduce_sum(loss * fnmask), batch_size, name='value')
            tf.losses.add_loss(loss)

        # Add localization loss: smooth L1, L2, ...
        with tf.name_scope('localization'):
            # Weights Tensor: positive mask + random negative.
            weights = tf.expand_dims(alpha * fpmask, axis=-1)
            loss = custom_layers.abs_smooth(localisations - glocalisations)
            loss = tf.div(tf.reduce_sum(loss * weights), batch_size, name='value')
            tf.losses.add_loss(loss)

9.开始训练

取出total loss，生成optimizer，创建保存权重的saver等，然后就可以开始训练了。

训练前会调用tf_utils.get_init_fn(FLAGS)将checkpoint恢复

        slim.learning.train(
            train_tensor,
            logdir=FLAGS.train_dir,
            master='',
            is_chief=True,
            init_fn=tf_utils.get_init_fn(FLAGS),
            summary_op=summary_op,
            number_of_steps=FLAGS.max_number_of_steps,
            log_every_n_steps=FLAGS.log_every_n_steps,
            save_summaries_secs=FLAGS.save_summaries_secs,
            saver=saver,
            save_interval_secs=FLAGS.save_interval_secs,
            session_config=config,
            sync_optimizer=None)

可以看出整个过程是一体化的，从feature map中生成anchors然后送入分类网络，不像faster rcnn，需要先从rpn网络计算anchors然后送入分类网络。

预测流程

预测的代码可以参考notebooks/ssd_tests.ipynb中的实现，就是对生成的预测box做nms来生成最后的定位框。

tensorflow现在可以直接支持nms的计算tf.image.non_max_suppression

def bboxes_nms(scores, bboxes, nms_threshold=0.5, keep_top_k=200, scope=None):
    with tf.name_scope(scope, 'bboxes_nms_single', [scores, bboxes]):
        # Apply NMS algorithm.
        idxes = tf.image.non_max_suppression(bboxes, scores,
                                             keep_top_k, nms_threshold)
        scores = tf.gather(scores, idxes)
        bboxes = tf.gather(bboxes, idxes)
        # Pad results.
        scores = tfe_tensors.pad_axis(scores, 0, keep_top_k, axis=0)
        bboxes = tfe_tensors.pad_axis(bboxes, 0, keep_top_k, axis=0)
        return scores, bboxes

以上代码分析完了，其实还有很多细节可以深究，如果有兴趣的可以留言一起讨论。

感谢阅读。

SSD（2）代码分析

目录

代码运行

代码分析

代码运行

1.预测

2.训练

准备数据

基于pre-trained VGG

基于pre-trained SSD

3.评估

代码分析

数据准备

基于VGG从头开始训练

1.准备

2.取出dataset

3.生成anchors

4.通过provider获取数据

5.预处理取出来的数据

6.处理gt_boxes和anchors

7.准备batch队列

8.搭建网络

9.开始训练

预测流程

猜你喜欢