Cascaded Pyramid Network 是2017年 COCO 人体姿态点检测时提出的网络,并获得了冠军。
CPN 的网络结构如下
GlobalNet 中分别提取了 ResNet 中4个不同blocks里的特征,利用不同尺度的语义信息来进行预测,RefineNet 主要对困难关键点进行定位。
其主要结构代码如下:
resnet_fms = resnet101(image, is_train, bn_trainable=True)
# 提取 ResNet101 中四个不同block中的 feature maps
global_fms, global_outs = create_global_net(resnet_fms, is_train)
# 将 ResNet feautre maps 输入到 GlobalNet 输出 global feature maps 和 global outputs,分别用于输入到 RefineNet 和计算 loss
refine_out = create_refine_net(global_fms, is_train)
GlobalNet的代码如下
def create_global_net(blocks, is_training, trainable=True):
global_fms = []
global_outs = []
last_fm = None
initializer = tf.contrib.layers.xavier_initializer()
for i, block in enumerate(reversed(blocks)):
with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
lateral = slim.conv2d(block, 256, [1, 1],
trainable=trainable, weights_initializer=initializer,
padding='SAME', activation_fn=tf.nn.relu,
scope='lateral/res{}'.format(5-i))
if last_fm is not None:
sz = tf.shape(lateral)
upsample = tf.image.resize_bilinear(last_fm, (sz[1], sz[2]),
name='upsample/res{}'.format(5-i))
upsample = slim.conv2d(upsample, 256, [1, 1],
trainable=trainable, weights_initializer=initializer,
padding='SAME', activation_fn=None,
scope='merge/res{}'.format(5-i))
last_fm = upsample + lateral
else:
last_fm = lateral
with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
tmp = slim.conv2d(last_fm, 256, [1, 1],
trainable=trainable, weights_initializer=initializer,
padding='SAME', activation_fn=tf.nn.relu,
scope='tmp/res{}'.format(5-i))
out = slim.conv2d(tmp, cfg.nr_skeleton, [3, 3],
trainable=trainable, weights_initializer=initializer,
padding='SAME', activation_fn=None,
scope='pyramid/res{}'.format(5-i))
global_fms.append(last_fm)
global_outs.append(tf.image.resize_bilinear(out, (cfg.output_shape[0], cfg.output_shape[1])))
global_fms.reverse()
global_outs.reverse()
return global_fms, global_outs
根据代码描述,以输入图片尺寸为 384x288 为例,流程图如下(省去了 batch size)
流程图中的菱形分别代表输出 global feature maps 和 global outputs。
RefineNet 的代码如下
def create_refine_net(blocks, is_training, trainable=True):
initializer = tf.contrib.layers.xavier_initializer()
bottleneck = resnet_v1.bottleneck
refine_fms = []
for i, block in enumerate(blocks):
mid_fm = block
with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
for j in range(i):
mid_fm = bottleneck(mid_fm, 256, 128, stride=1, scope='res{}/refine_conv{}'.format(2+i, j)) # no projection
mid_fm = tf.image.resize_bilinear(mid_fm, (cfg.output_shape[0], cfg.output_shape[1]),
name='upsample_conv/res{}'.format(2+i))
refine_fms.append(mid_fm)
refine_fm = tf.concat(refine_fms, axis=3)
with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
refine_fm = bottleneck(refine_fm, 256, 128, stride=1, scope='final_bottleneck')
res = slim.conv2d(refine_fm, cfg.nr_skeleton, [3, 3],
trainable=trainable, weights_initializer=initializer,
padding='SAME', activation_fn=None,
scope='refine_out')
return res
根据代码描述,流程图如下
这里的 bottleneck 为 resnet_v1.bottleneck
参考资料:
作者开源代码:https://github.com/chenyilun95/tf-cpn
旷视研究院详解COCO2017人体姿态估计冠军论文
实录 | 旷视研究院详解COCO2017人体姿态估计冠军论文(PPT+视频)
其他文章:
CPN(Cascaded Pyramid Network for Multi-Person Pose Estimation) 姿态估计