Cascaded Pyramid Network for Multi-Person Pose Estimation

Cascaded Pyramid Network 是2017年 COCO 人体姿态点检测时提出的网络,并获得了冠军。
CPN 的网络结构如下
在这里插入图片描述

GlobalNet 中分别提取了 ResNet 中4个不同blocks里的特征,利用不同尺度的语义信息来进行预测,RefineNet 主要对困难关键点进行定位。

其主要结构代码如下:

resnet_fms = resnet101(image, is_train, bn_trainable=True)
# 提取 ResNet101 中四个不同block中的 feature maps
global_fms, global_outs = create_global_net(resnet_fms, is_train)
# 将 ResNet feautre maps 输入到 GlobalNet 输出 global feature maps 和 global outputs,分别用于输入到 RefineNet 和计算 loss
refine_out = create_refine_net(global_fms, is_train)

GlobalNet的代码如下

def create_global_net(blocks, is_training, trainable=True):
    global_fms = []
    global_outs = []
    last_fm = None
    initializer = tf.contrib.layers.xavier_initializer()
    for i, block in enumerate(reversed(blocks)):
        with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
            lateral = slim.conv2d(block, 256, [1, 1],
                trainable=trainable, weights_initializer=initializer,
                padding='SAME', activation_fn=tf.nn.relu,
                scope='lateral/res{}'.format(5-i))

        if last_fm is not None:
            sz = tf.shape(lateral)
            upsample = tf.image.resize_bilinear(last_fm, (sz[1], sz[2]),
                name='upsample/res{}'.format(5-i))
            upsample = slim.conv2d(upsample, 256, [1, 1],
                trainable=trainable, weights_initializer=initializer,
                padding='SAME', activation_fn=None,
                scope='merge/res{}'.format(5-i))
            last_fm = upsample + lateral
        else:
            last_fm = lateral

        with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
            tmp = slim.conv2d(last_fm, 256, [1, 1],
                trainable=trainable, weights_initializer=initializer,
                padding='SAME', activation_fn=tf.nn.relu,
                scope='tmp/res{}'.format(5-i))
            out = slim.conv2d(tmp, cfg.nr_skeleton, [3, 3],
                trainable=trainable, weights_initializer=initializer,
                padding='SAME', activation_fn=None,
                scope='pyramid/res{}'.format(5-i))
        global_fms.append(last_fm)
        global_outs.append(tf.image.resize_bilinear(out, (cfg.output_shape[0], cfg.output_shape[1])))
    global_fms.reverse()
    global_outs.reverse()
    return global_fms, global_outs

根据代码描述,以输入图片尺寸为 384x288 为例,流程图如下(省去了 batch size)

/4
/2
/2
/2
256, 1 x 1
256, 1 x 1
256, 1 x 1
256, 1 x 1
256, 1 x 1
256, 1 x 1
256, 1 x 1
256, 1 x 1
joint num, 3 x 3
joint num, 3 x 3
joint num, 3 x 3
joint num, 3 x 3
bilinear resize 96 x 72
bilinear resize 96 x 72
bilinear resize 96 x 72
bilinear resize 24 x 18
256, 1 x 1 and add to
bilinear resize 48 x 36
256, 1 x 1 and add to
bilinear resize 96 x 72
256, 1 x 1 and add to
input: 384 x 288 x 3
resnet_v1_101_1/block1/unit_3/bottleneck_v1/Relu: 96 x 72 x 256
resnet_v1_101_2/block2/unit_4/bottleneck_v1/Relu: 48 x 36 x 512
resnet_v1_101_3/block3/unit_23/bottleneck_v1/Relu: 24 x 18 x 1024
resnet_v1_101_4/block4/unit_3/bottleneck_v1/Relu: 12 x 9 x 2048
lateral/res2/Relu: 96 x 72 x 256
lateral/res3/Relu: 48 x 36 x 256
lateral/res4/Relu: 24 x 18 x 256
lateral/res5/Relu: 12 x 9 x 256
tmp/res2/Relu: 96 x 72 x 256
tmp/res3/Relu: 48 x 36 x 256
tmp/res4/Relu: 24 x 18 x 256
tmp/res5/Relu: 12 x 9 x 256
pyramid/res5/BatchNorm/FusedBatchNorm: 12 x 9 x joint num
pyramid/res4/BatchNorm/FusedBatchNorm: 24 x 18 x joint num
pyramid/res3/BatchNorm/FusedBatchNorm: 48 x 36 x joint num
pyramid/res2/BatchNorm/FusedBatchNorm: 96 x72 x joint num
global outs: list
upsample/res4: 24 x 18 x 256
upsample/res3: 48 x 36 x 256
upsample/res2: 96 x 72 x 256

流程图中的菱形分别代表输出 global feature maps 和 global outputs。

RefineNet 的代码如下

def create_refine_net(blocks, is_training, trainable=True):
    initializer = tf.contrib.layers.xavier_initializer()
    bottleneck = resnet_v1.bottleneck
    refine_fms = []
    for i, block in enumerate(blocks):
        mid_fm = block
        with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
            for j in range(i):
                mid_fm = bottleneck(mid_fm, 256, 128, stride=1, scope='res{}/refine_conv{}'.format(2+i, j)) # no projection
        mid_fm = tf.image.resize_bilinear(mid_fm, (cfg.output_shape[0], cfg.output_shape[1]),
            name='upsample_conv/res{}'.format(2+i))
        refine_fms.append(mid_fm)
    refine_fm = tf.concat(refine_fms, axis=3)
    with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
        refine_fm = bottleneck(refine_fm, 256, 128, stride=1, scope='final_bottleneck')
        res = slim.conv2d(refine_fm, cfg.nr_skeleton, [3, 3],
            trainable=trainable, weights_initializer=initializer,
            padding='SAME', activation_fn=None,
            scope='refine_out')
    return res

根据代码描述,流程图如下

bilinear resize 96 x 72
bottleneck
bilinear resize 96 x 72
bottleneck
bottleneck
bilinear resize 96 x 72
bottleneck
bottleneck
bottleneck
bilinear resize 96 x 72
concat
concat
concat
concat
bottleneck
joint num, 3 x 3
lateral/res2/Relu: 96 x 72 x 256
upsample_conv/res2: 96 x 72 x 256
lateral/res3/Relu: 48 x 36 x 256
res3/refine_conv0/Relu: 48 x 36 x 256
upsample_conv/res3: 96 x 72 x 256
lateral/res4/Relu: 24 x 18 x 256
res4/refine_conv0/Relu: 24 x 18 x 256
res4/refine_conv1/Relu: 24 x 18 x 256
upsample_conv/res4: 96 x 72 x 256
lateral/res5/Relu: 12 x 9 x 256
res5/refine_conv0/Relu: 12 x 9 x 256
res5/refine_conv1/Relu: 12 x 9 x 256
res5/refine_conv2/Relu: 12 x 9 x 256
upsample_conv/res5: 96 x 72 x 256
concat: 96 x 72 x 1024
final_bottleneck/Relu: 96 x 72 x 256
refine_out/BatchNorm/FusedBatchNorm: 96 x 72 x joint num

这里的 bottleneck 为 resnet_v1.bottleneck

参考资料:
作者开源代码:https://github.com/chenyilun95/tf-cpn
旷视研究院详解COCO2017人体姿态估计冠军论文
实录 | 旷视研究院详解COCO2017人体姿态估计冠军论文(PPT+视频)
其他文章:
CPN(Cascaded Pyramid Network for Multi-Person Pose Estimation) 姿态估计

猜你喜欢

转载自blog.csdn.net/yeahDeDiQiZhang/article/details/83061284
今日推荐