CPN（Cascaded Pyramid Network for Multi-Person Pose Estimation) 姿态估计

本篇博客是对论文《Cascaded Pyramid Network for Multi-Person Pose Estimation》的个人解读，以及对代码（tensorflow版本)的细节分析。
这里写图片描述

其他资料：

前言

目前对多人的姿态点检测的算法总体分为两类：

bottom-up 方法：此方法直接预测关键点，再判断关键点分别属于哪个人，代表有openpose
top-down方法：此方法为两阶段方法，先检测人，再根据检测的人的框使用单人姿态点检测算法SPPE（a single-person pose estimator）对每一个人进行关键点的检测，最后整合回原图。

两种方法的区别：

由于top-down方法依赖于人体检测框的结果，非常容易受其影响。比如如果人体重叠部分过大，则可能只检测出一个检测框，导致准确度下降，如下图1所示：
top-down算法的时间成本会随着图像中的人数增加而增加，每增加一个人就要进行一次SPPE算法。
bottom-up相比没有以上两点问题。

图1 人物重叠容易导致检测框只检测到一个人，从而影响关键点检测。

摘要

论文提出的关键点检测算法是SPPE,提出了一种网络结构，能够对不可见的关键点，重叠的关键点，模糊难以辨识的关键点的检测，克服复杂背景的影响。网络分为两部分：

GlobalNet：为一个FPN网络，用来检测比较简单的关键点，如眼睛，手;但对不可见的等较难的点判断并不是很好。
RefineNet：主要是用来检测非常难分辨的关键点，他的输入的GlobalNet的几个不同层次的特征来在线的（后文会讲到如何在线）对判断困难的关键点进行检测。

论文的方法获得了COCO 人体姿态点检测的2017年冠军，在COCO test-dev上的平均检测精度为73.0，在COCO test-challenge 数据集上平均检测精度 72.1，比2016年冠军的60.5高出了19%！

网络结构解读

这里写图片描述
图2 SPPE网络结构图

GlobalNet

这里写图片描述
图3 Globalnet
首先，GlobalNet的输入并不是一幅图像，而是Resnet的4个blocks提取出的特征图，论文中分别以C2,C3,C4,C5来代表。其中C2，C3由于层数较浅，所以有很高的空间精度即能够很好的定位原图信息，但是语义信息不足;相反，C4，C5，拥有较高的语义信息，但是空间分辨率较低，不足以定位图像信息。所以，GlobalNet采用FPN的结构充分的利用各个层次的不同信息来对关键点的heatmap进行预测。
注：GlobalNet与FPN稍有不同，在升采样（upsampling process）之后，两层相加之前，要再进行一次1×1的卷积操作。
输入到Globalnet中的每一层要进行的操作：

net1 (from resnet) —> 1*1 conv —> upsampling —> 1 * 1 conv —-> elem-sum(与下一层) ——> predict —> L2 loss

代码解读：

def create_global_net(blocks, is_training, trainable=True):
'''
blocks: 即C2，C3，C4，C5
'''
    global_fms = [] #GlobalNet的输出求loss
    global_outs = [] #RefineNet的输入！！！
    last_fm = None #初始化
    initializer = tf.contrib.layers.xavier_initializer()
    for i, block in enumerate(reversed(blocks)):
        with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
            lateral = slim.conv2d(block, 256, [1, 1],
                trainable=trainable, weights_initializer=initializer,
                padding='SAME', activation_fn=tf.nn.relu,
                scope='lateral/res{}'.format(5-i))
        #如果last_fm不是None,则对当前层进行线性差值（上采样）之后与当前层进行相加
        if last_fm is not None:
            sz = tf.shape(lateral)
            upsample = tf.image.resize_bilinear(last_fm, (sz[1], sz[2]),
                name='upsample/res{}'.format(5-i))
            upsample = slim.conv2d(upsample, 256, [1, 1],
                trainable=trainable, weights_initializer=initializer,
                padding='SAME', activation_fn=None,
                scope='merge/res{}'.format(5-i))
            last_fm = upsample + lateral #两层相加
        else:
            last_fm = lateral #此时的层为最高层即最后一层，有最高的语义表征，不进行上采样，直接输出

        #对每一层叠加之后的last_fm进行1*1的卷积之后再进行3*3的卷积生成17个feature map(17个关键点的热力图)，作为predict参与训练
        with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
            tmp = slim.conv2d(last_fm, 256, [1, 1],
                trainable=trainable, weights_initializer=initializer,
                padding='SAME', activation_fn=tf.nn.relu,
                scope='tmp/res{}'.format(5-i))
            out = slim.conv2d(tmp, cfg.nr_skeleton, [3, 3],
                trainable=trainable, weights_initializer=initializer,
                padding='SAME', activation_fn=None,
                scope='pyramid/res{}'.format(5-i))
        global_fms.append(last_fm)
        global_outs.append(tf.image.resize_bilinear(out, (cfg.output_shape[0], cfg.output_shape[1])))
    global_fms.reverse()
    global_outs.reverse()
    return global_fms, global_outs

RefineNet

对困难关键点进行定位，此处的困难关键点由训练决定，根据globalnet的loss大小决定哪几个关键点成为困难关键点，并不是人为的决定！！

def create_refine_net(blocks, is_training, trainable=True):
    #global_fms 即 blocks
    initializer = tf.contrib.layers.xavier_initializer()
    bottleneck = resnet_v1.bottleneck
    refine_fms = []
    for i, block in enumerate(blocks):
        mid_fm = block
        with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
            for j in range(i):
                mid_fm = bottleneck(mid_fm, 256, 128, stride=1, scope='res{}/refine_conv{}'.format(2+i, j)) # no projection
        mid_fm = tf.image.resize_bilinear(mid_fm, (cfg.output_shape[0], cfg.output_shape[1]),
            name='upsample_conv/res{}'.format(2+i))
        refine_fms.append(mid_fm)
    refine_fm = tf.concat(refine_fms, axis=3)  ##此处的操作的是concat!!!!
    with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
        refine_fm = bottleneck(refine_fm, 256, 128, stride=1, scope='final_bottleneck')
        res = slim.conv2d(refine_fm, cfg.nr_skeleton, [3, 3],
            trainable=trainable, weights_initializer=initializer,
            padding='SAME', activation_fn=None,
            scope='refine_out')
    return res

疑问！神经网络有时会利用不同层的语义特征，但是利用的方式有两种，就像本处的Globalnet,对不同层次的特征进行 elem_sum, 即对应元素相加，类似于Resnet;而Refinenet则是对不同层次的特征进行concat，类似于inception; 这两者之间的区别是什么，什么时候用concat什么时候用elem_sum呢？？？

总体网络架构：
resnet101 + create_global_net + create_refine_net

resnet_fms = resnet101(image, is_train, bn_trainable=True)
global_fms, global_outs = create_global_net(resnet_fms, is_train)
#flobal_outs: 17*64*48
refine_out = create_refine_net(global_fms, is_train)
#refine_outs: 17*

根据每个特征点的loss选择那些点进入到Refinenet中进行挖掘：

def ohkm(loss, top_k):
    ohkm_loss = 0.
    for i in range(cfg.batch_size):
        sub_loss = loss[i]
        topk_val, topk_idx = tf.nn.top_k(sub_loss, k=top_k, sorted=False, name='ohkm{}'.format(i))
        tmp_loss = tf.gather(sub_loss, topk_idx, name='ohkm_loss{}'.format(i)) # can be ignore ???
        ohkm_loss += tf.reduce_sum(tmp_loss) / top_k
    ohkm_loss /= cfg.batch_size
    return ohkm_loss



refine_loss = tf.reduce_mean(tf.square(refine_out - label7), (1,2)) * tf.to_float((tf.greater(valids, 0.1)))
refine_loss = ohkm(refine_loss, 8) #选择8个点作为困哪关键点进行挖掘！可设置调整。

训练数据读入(network.py)

    def make_data(self):
        from COCOAllJoints import COCOJoints
        from dataset import Preprocessing
        d = COCOJoints() 
        #得到的数据d对象：
        #humanData = dict(aid = aid,joints=joints, imgpath=imgname, headRect=rect, bbox=bbox, imgid = ann['image_id'], segmentation = ann['segmentation'])
        #输入数据为字典格式，每个字典包含7部分的内容，包括图像id以及关键点坐标，如果想修改训练自己的数据集的化，可以类似修改成这部分的对象
        train_data, _ = d.load_data(cfg.min_kps)

        from tfflat.data_provider import DataFromList, MultiProcessMapDataZMQ, BatchData, MapData
        dp = DataFromList(train_data) #转换为list
        if cfg.dpflow_enable:  #True
            dp = MultiProcessMapDataZMQ(dp, cfg.nr_dpflows, Preprocessing)
            ##传入函数Preprocessing，对图像进行裁剪标准框以及关键点坐标变换，并且生成图像关键点热力图标签！
        else:
            dp = MapData(dp, Preprocessing)
        dp = BatchData(dp, cfg.batch_size // cfg.nr_aug) ##nr_aug = 4
        dp.reset_state()
        dataiter = dp.get_data()
        return dataiter

图像预处理以及标签生成（dataset.py）

对图像进行标准框裁剪的时候，先进行了padding操作，使用图像均值进行填充，再对图像进行裁剪，裁剪之后的图像为之前的标准框，并对关键点坐标进行相应的操作！
pad–>crop

def Preprocessing(d, stage='train'):
    height, width = cfg.data_shape #256 192
    imgs = []
    labels = []
    valids = []
    if cfg.use_seg:
        segms = []

    vis = False
    img = cv2.imread(os.path.join(cfg.img_path, d['imgpath'])) #读取图像
    #hack(multiprocessing data provider)
    while img is None:
        print('read none image')
        time.sleep(np.random.rand() * 5)
        img = cv2.imread(os.path.join(cfg.img_path, d['imgpath']))
    add = max(img.shape[0], img.shape[1])
    bimg = cv2.copyMakeBorder(img, add, add, add, add, borderType=cv2.BORDER_CONSTANT,
                              value=cfg.pixel_means.reshape(-1)) #均值填充！！

    bbox = np.array(d['bbox']).reshape(4, ).astype(np.float32)
    bbox[:2] += add
    if 'joints' in d:
        joints = np.array(d['joints']).reshape(cfg.nr_skeleton, 3).astype(np.float32)
        joints[:, :2] += add
        inds = np.where(joints[:, -1] == 0)
        joints[inds, :2] = -1000000

    crop_width = bbox[2] * (1 + cfg.imgExtXBorder * 2)
    crop_height = bbox[3] * (1 + cfg.imgExtYBorder * 2)
    objcenter = np.array([bbox[0] + bbox[2] / 2., bbox[1] + bbox[3] / 2.])

    if stage == 'train':
        crop_width = crop_width * (1 + 0.25)
        crop_height = crop_height * (1 + 0.25)

    if crop_height / height > crop_width / width:
        crop_size = crop_height
        min_shape = height
    else:
        crop_size = crop_width
        min_shape = width

    crop_size = min(crop_size, objcenter[0] / width * min_shape * 2. - 1.)                     ##??
    crop_size = min(crop_size, (bimg.shape[1] - objcenter[0]) / width * min_shape * 2. - 1)    ##??
    crop_size = min(crop_size, objcenter[1] / height * min_shape * 2. - 1.)                    ##??
    crop_size = min(crop_size, (bimg.shape[0] - objcenter[1]) / height * min_shape * 2. - 1)   ##??

    min_x = int(objcenter[0] - crop_size / 2. / min_shape * width)   ##
    max_x = int(objcenter[0] + crop_size / 2. / min_shape * width)   ##
    min_y = int(objcenter[1] - crop_size / 2. / min_shape * height)  ##
    max_y = int(objcenter[1] + crop_size / 2. / min_shape * height)  ##

    x_ratio = float(width) / (max_x - min_x)
    y_ratio = float(height) / (max_y - min_y)

    if 'joints' in d:
        joints[:, 0] = joints[:, 0] - min_x
        joints[:, 1] = joints[:, 1] - min_y  ##转化为在截取的标准框上的坐标！

        joints[:, 0] *= x_ratio
        joints[:, 1] *= y_ratio    ##放大到256*192
        label = joints[:, :2].copy()
        valid = joints[:, 2].copy()

    img = cv2.resize(bimg[min_y:max_y, min_x:max_x, :], (width, height))
    #此时的img为截取的标准框图像，关键点都做了相应的变换

    # if stage != 'train':
    #     details = np.asarray([min_x - add, min_y - add, max_x - add, max_y - add])
    # if cfg.use_seg is True and 'segmentation' in d:
    #     seg = get_seg(ori_img.shape[0], ori_img.shape[1], d['segmentation'])
    #     add = max(seg.shape[0], seg.shape[1])
    #     bimg = cv2.copyMakeBorder(seg, add, add, add, add, borderType=cv2.BORDER_CONSTANT, value=(0, 0, 0))
    #     seg = cv2.resize(bimg[min_y:max_y, min_x:max_x], (width, height))
    #     segms.append(seg)

    # if vis:
    #     tmpimg = img.copy()
    #     from utils.visualize import draw_skeleton
    #     draw_skeleton(tmpimg, label.astype(int))
    #     cv2.imwrite('vis.jpg', tmpimg)
    #     from IPython import embed; embed()

    img = img - cfg.pixel_means  ##俭掉图像均值
    if cfg.pixel_norm:  #True
        img = img / 255.
    img = img.transpose(2, 0, 1) ##变换通道
    imgs.append(img)
    if 'joints' in d:
        labels.append(label.reshape(-1))
        valids.append(valid.reshape(-1))

    if stage == 'train':  ##进行图像增强，以及转换标签为热力图
        imgs, labels, valids = data_augmentation(imgs, labels, valids)
        heatmaps15 = joints_heatmap_gen(imgs, labels, cfg.output_shape, cfg.data_shape, return_valid=False,
                                        gaussian_kernel=cfg.gk15)
        heatmaps11 = joints_heatmap_gen(imgs, labels, cfg.output_shape, cfg.data_shape, return_valid=False,
                                        gaussian_kernel=cfg.gk11)
        heatmaps9 = joints_heatmap_gen(imgs, labels, cfg.output_shape, cfg.data_shape, return_valid=False,
                                       gaussian_kernel=cfg.gk9)
        heatmaps7 = joints_heatmap_gen(imgs, labels, cfg.output_shape, cfg.data_shape, return_valid=False,
                                       gaussian_kernel=cfg.gk7)

        return [imgs.astype(np.float32).transpose(0, 2, 3, 1),
                heatmaps15.astype(np.float32).transpose(0, 2, 3, 1),
                heatmaps11.astype(np.float32).transpose(0, 2, 3, 1),
                heatmaps9.astype(np.float32).transpose(0, 2, 3, 1),
                heatmaps7.astype(np.float32).transpose(0, 2, 3, 1),
                valids.astype(np.float32)]
    else:
        return [np.asarray(imgs).astype(np.float32), details]

图像标签热力图的生成

def joints_heatmap_gen(data, label, tar_size=cfg.output_shape, ori_size=cfg.data_shape, points=cfg.nr_skeleton,
                       return_valid=False, gaussian_kernel=cfg.gaussain_kernel):  ##cfg.output_shape = （64，48）（256，192）---> (64,48)
    #注意此时的label是在（256，192）上的关键点坐标，需要转换成（64，48）上的关键点坐标
    #对关键点位置进行高斯滤波，使用不同的kernel训练
    if return_valid: ##False
        valid = np.ones((len(data), points), dtype=np.float32)

    ret = np.zeros((len(data), points, tar_size[0], tar_size[1]), dtype='float32')
    for i in range(len(ret)):
        for j in range(points):  ##points = 17
            if label[i][j << 1] < 0 or label[i][j << 1 | 1] < 0:  ##左移一位？？？？？
                continue
            label[i][j << 1 | 1] = min(label[i][j << 1 | 1], ori_size[0] - 1)
            label[i][j << 1] = min(label[i][j << 1], ori_size[1] - 1)
            ret[i][j][int(label[i][j << 1 | 1] * tar_size[0] / ori_size[0])][
                int(label[i][j << 1] * tar_size[1] / ori_size[1])] = 1

    for i in range(len(ret)):
        for j in range(points):
            ret[i, j] = cv2.GaussianBlur(ret[i, j], gaussian_kernel, 0)
    for i in range(len(ret)):
        for j in range(cfg.nr_skeleton):
            am = np.amax(ret[i][j])
            if am <= 1e-8:
                if return_valid: #False
                    valid[i][j] = 0.
                continue
            ret[i][j] /= am / 255  #标准化
    if return_valid:
        return ret, valid
    else:
        return ret

数据增强

crop augmentation
random scales
rotation
flip

def data_augmentation(trainData, trainLabel, trainValids, segms=None):
    trainSegms = segms
    tremNum = cfg.nr_aug - 1
    gotData = trainData.copy()
    trainData = np.append(trainData, [trainData[0] for i in range(tremNum * len(trainData))], axis=0)
    if trainSegms is not None:
        gotSegm = trainSegms.copy()
        trainSegms = np.append(trainSegms, [trainSegms[0] for i in range(tremNum * len(trainSegms))], axis=0)
    trainLabel = np.append(trainLabel, [trainLabel[0] for i in range(tremNum * len(trainLabel))], axis=0)
    trainValids = np.append(trainValids, [trainValids[0] for i in range(tremNum * len(trainValids))], axis=0)
    counter = len(gotData)
    for lab in range(len(gotData)):
        ori_img = gotData[lab].transpose(1, 2, 0)
        if trainSegms is not None:
            ori_segm = gotSegm[lab].copy()
        annot = trainLabel[lab].copy()
        annot_valid = trainValids[lab].copy()
        height, width = ori_img.shape[0], ori_img.shape[1]
        center = (width / 2., height / 2.)
        n = cfg.nr_skeleton

        # affrat = random.uniform(0.75, 1.25)
        affrat = random.uniform(0.7, 1.35)
        halfl_w = min(width - center[0], (width - center[0]) / 1.25 * affrat)
        halfl_h = min(height - center[1], (height - center[1]) / 1.25 * affrat)
        # img = cv2.resize(ori_img[int(center[0] - halfl_w) : int(center[0] + halfl_w + 1), int(center[1] - halfl_h) : int(center[1] + halfl_h + 1)], (width, height))
        img = cv2.resize(ori_img[int(center[1] - halfl_h): int(center[1] + halfl_h + 1),
                         int(center[0] - halfl_w): int(center[0] + halfl_w + 1)], (width, height))
        if trainSegms is not None:
            segm = cv2.resize(ori_segm[int(center[1] - halfl_h): int(center[1] + halfl_h + 1),
                              int(center[0] - halfl_w): int(center[0] + halfl_w + 1)], (width, height))
        for i in range(n):
            annot[i << 1] = (annot[i << 1] - center[0]) / halfl_w * (width - center[0]) + center[0]
            annot[i << 1 | 1] = (annot[i << 1 | 1] - center[1]) / halfl_h * (height - center[1]) + center[1]
            annot_valid[i] *= (
            (annot[i << 1] >= 0) & (annot[i << 1] < width) & (annot[i << 1 | 1] >= 0) & (annot[i << 1 | 1] < height))

        trainData[lab] = img.transpose(2, 0, 1)
        if trainSegms is not None:
            trainSegms[lab] = segm
        trainLabel[lab] = annot
        trainValids[lab] = annot_valid

        # flip augmentation
        newimg = cv2.flip(img, 1)
        if trainSegms is not None:
            newsegm = cv2.flip(segm, 1)
        cod = []
        allc = []
        for i in range(n):
            x, y = annot[i << 1], annot[i << 1 | 1]
            if x >= 0:
                x = width - 1 - x
            cod.append((x, y))
        if trainSegms is not None:
            trainSegms[counter] = newsegm
        trainData[counter] = newimg.transpose(2, 0, 1)

        # **** the joint index depends on the dataset ****
        for (q, w) in cfg.symmetry:
            cod[q], cod[w] = cod[w], cod[q]
        for i in range(n):
            allc.append(cod[i][0])
            allc.append(cod[i][1])
        trainLabel[counter] = np.array(allc)
        allc_valid = annot_valid.copy()
        for (q, w) in cfg.symmetry:
            allc_valid[q], allc_valid[w] = allc_valid[w], allc_valid[q]
        trainValids[counter] = np.array(allc_valid)
        counter += 1

        # rotated augmentation
        for times in range(tremNum - 1):
            angle = random.uniform(0, 45)
            if random.randint(0, 1):
                angle *= -1
            rotMat = cv2.getRotationMatrix2D(center, angle, 1.0)
            newimg = cv2.warpAffine(img, rotMat, (width, height))
            if trainSegms is not None:
                newsegm = cv2.warpAffine(segm, rotMat, (width, height))

            allc = []
            allc_valid = []
            for i in range(n):
                x, y = annot[i << 1], annot[i << 1 | 1]
                coor = np.array([x, y])
                if x >= 0 and y >= 0:
                    R = rotMat[:, : 2]
                    W = np.array([rotMat[0][2], rotMat[1][2]])
                    coor = np.dot(R, coor) + W
                allc.append(coor[0])
                allc.append(coor[1])
                allc_valid.append(
                    annot_valid[i] * ((coor[0] >= 0) & (coor[0] < width) & (coor[1] >= 0) & (coor[1] < height)))

            newimg = newimg.transpose(2, 0, 1)
            trainData[counter] = newimg
            if trainSegms is not None:
                trainSegms[counter] = newsegm
            trainLabel[counter] = np.array(allc)
            trainValids[counter] = np.array(allc_valid)
            counter += 1
    if trainSegms is not None:
        return trainData, trainLabel, trainSegms
    else:
        return trainData, trainLabel, trainValids

模型训练

cpn项目框架

tenssorflow项目的架构也可以学习一下，下面是cpn文件夹下的组织架构。
这里写图片描述