CPN(Cascaded Pyramid Network for Multi-Person Pose Estimation) 姿态估计

本篇博客是对论文《Cascaded Pyramid Network for Multi-Person Pose Estimation》的个人解读,以及对代码(tensorflow版本)的细节分析。
这里写图片描述

其他资料:

  1. 旷视研究院详解COCO2017人体姿态估计冠军论文
  2. 实录 | 旷视研究院详解COCO2017人体姿态估计冠军论文(PPT+视频)

前言

目前对多人的姿态点检测的算法总体分为两类:

  1. bottom-up 方法:此方法直接预测关键点,再判断关键点分别属于哪个人,代表有openpose
  2. top-down方法:此方法为两阶段方法,先检测人,再根据检测的人的框使用单人姿态点检测算法SPPE(a single-person pose estimator)对每一个人进行关键点的检测,最后整合回原图。

两种方法的区别:

  1. 由于top-down方法依赖于人体检测框的结果,非常容易受其影响。比如如果人体重叠部分过大,则可能只检测出一个检测框,导致准确度下降,如下图1所示:
  2. top-down算法的时间成本会随着图像中的人数增加而增加,每增加一个人就要进行一次SPPE算法。
  3. bottom-up相比没有以上两点问题。
    这里写图片描述
    图1 人物重叠容易导致检测框只检测到一个人,从而影响关键点检测。

摘要

论文提出的关键点检测算法是SPPE,提出了一种网络结构,能够对不可见的关键点,重叠的关键点,模糊难以辨识的关键点的检测,克服复杂背景的影响。网络分为两部分:

  1. GlobalNet:为一个FPN网络,用来检测比较简单的关键点,如眼睛,手;但对不可见的等较难的点判断并不是很好。
  2. RefineNet:主要是用来检测非常难分辨的关键点,他的输入的GlobalNet的几个不同层次的特征来在线的(后文会讲到如何在线)对判断困难的关键点进行检测。

论文的方法获得了COCO 人体姿态点检测的2017年冠军,在COCO test-dev上的平均检测精度为73.0,在COCO test-challenge 数据集上平均检测精度 72.1,比2016年冠军的60.5高出了19%!

网络结构解读

这里写图片描述
图2 SPPE网络结构图

GlobalNet

这里写图片描述
图3 Globalnet
首先,GlobalNet的输入并不是一幅图像,而是Resnet的4个blocks提取出的特征图,论文中分别以C2,C3,C4,C5来代表。其中C2,C3由于层数较浅,所以有很高的空间精度即能够很好的定位原图信息,但是语义信息不足;相反,C4,C5,拥有较高的语义信息,但是空间分辨率较低,不足以定位图像信息。所以,GlobalNet采用FPN的结构充分的利用各个层次的不同信息来对关键点的heatmap进行预测。
注:GlobalNet与FPN稍有不同,在升采样(upsampling process)之后,两层相加之前,要再进行一次1×1的卷积操作。
输入到Globalnet中的每一层要进行的操作:

net1 (from resnet) —> 1*1 conv —> upsampling —> 1 * 1 conv —-> elem-sum(与下一层) ——> predict —> L2 loss

代码解读:

def create_global_net(blocks, is_training, trainable=True):
'''
blocks: 即C2,C3,C4,C5
'''
    global_fms = [] #GlobalNet的输出求loss
    global_outs = [] #RefineNet的输入!!!
    last_fm = None #初始化
    initializer = tf.contrib.layers.xavier_initializer()
    for i, block in enumerate(reversed(blocks)):
        with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
            lateral = slim.conv2d(block, 256, [1, 1],
                trainable=trainable, weights_initializer=initializer,
                padding='SAME', activation_fn=tf.nn.relu,
                scope='lateral/res{}'.format(5-i))
        #如果last_fm不是None,则对当前层进行线性差值(上采样)之后与当前层进行相加
        if last_fm is not None:
            sz = tf.shape(lateral)
            upsample = tf.image.resize_bilinear(last_fm, (sz[1], sz[2]),
                name='upsample/res{}'.format(5-i))
            upsample = slim.conv2d(upsample, 256, [1, 1],
                trainable=trainable, weights_initializer=initializer,
                padding='SAME', activation_fn=None,
                scope='merge/res{}'.format(5-i))
            last_fm = upsample + lateral #两层相加
        else:
            last_fm = lateral #此时的层为最高层即最后一层,有最高的语义表征,不进行上采样,直接输出

        #对每一层叠加之后的last_fm进行1*1的卷积之后再进行3*3的卷积生成17个feature map(17个关键点的热力图),作为predict参与训练
        with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
            tmp = slim.conv2d(last_fm, 256, [1, 1],
                trainable=trainable, weights_initializer=initializer,
                padding='SAME', activation_fn=tf.nn.relu,
                scope='tmp/res{}'.format(5-i))
            out = slim.conv2d(tmp, cfg.nr_skeleton, [3, 3],
                trainable=trainable, weights_initializer=initializer,
                padding='SAME', activation_fn=None,
                scope='pyramid/res{}'.format(5-i))
        global_fms.append(last_fm)
        global_outs.append(tf.image.resize_bilinear(out, (cfg.output_shape[0], cfg.output_shape[1])))
    global_fms.reverse()
    global_outs.reverse()
    return global_fms, global_outs

RefineNet

对困难关键点进行定位,此处的困难关键点由训练决定,根据globalnet的loss大小决定哪几个关键点成为困难关键点,并不是人为的决定!!

def create_refine_net(blocks, is_training, trainable=True):
    #global_fms 即 blocks
    initializer = tf.contrib.layers.xavier_initializer()
    bottleneck = resnet_v1.bottleneck
    refine_fms = []
    for i, block in enumerate(blocks):
        mid_fm = block
        with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
            for j in range(i):
                mid_fm = bottleneck(mid_fm, 256, 128, stride=1, scope='res{}/refine_conv{}'.format(2+i, j)) # no projection
        mid_fm = tf.image.resize_bilinear(mid_fm, (cfg.output_shape[0], cfg.output_shape[1]),
            name='upsample_conv/res{}'.format(2+i))
        refine_fms.append(mid_fm)
    refine_fm = tf.concat(refine_fms, axis=3)  ##此处的操作的是concat!!!!
    with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
        refine_fm = bottleneck(refine_fm, 256, 128, stride=1, scope='final_bottleneck')
        res = slim.conv2d(refine_fm, cfg.nr_skeleton, [3, 3],
            trainable=trainable, weights_initializer=initializer,
            padding='SAME', activation_fn=None,
            scope='refine_out')
    return res

疑问! 神经网络有时会利用不同层的语义特征,但是利用的方式有两种,就像本处的Globalnet,对不同层次的特征进行 elem_sum, 即对应元素相加,类似于Resnet;而Refinenet则是对不同层次的特征进行concat,类似于inception; 这两者之间的区别是什么,什么时候用concat什么时候用elem_sum呢???

总体网络架构:
resnet101 + create_global_net + create_refine_net

resnet_fms = resnet101(image, is_train, bn_trainable=True)
global_fms, global_outs = create_global_net(resnet_fms, is_train)
#flobal_outs: 17*64*48
refine_out = create_refine_net(global_fms, is_train)
#refine_outs: 17*

根据每个特征点的loss选择那些点进入到Refinenet中进行挖掘:

def ohkm(loss, top_k):
    ohkm_loss = 0.
    for i in range(cfg.batch_size):
        sub_loss = loss[i]
        topk_val, topk_idx = tf.nn.top_k(sub_loss, k=top_k, sorted=False, name='ohkm{}'.format(i))
        tmp_loss = tf.gather(sub_loss, topk_idx, name='ohkm_loss{}'.format(i)) # can be ignore ???
        ohkm_loss += tf.reduce_sum(tmp_loss) / top_k
    ohkm_loss /= cfg.batch_size
    return ohkm_loss



refine_loss = tf.reduce_mean(tf.square(refine_out - label7), (1,2)) * tf.to_float((tf.greater(valids, 0.1)))
refine_loss = ohkm(refine_loss, 8) #选择8个点作为困哪关键点进行挖掘!可设置调整。

训练数据读入(network.py)

    def make_data(self):
        from COCOAllJoints import COCOJoints
        from dataset import Preprocessing
        d = COCOJoints() 
        #得到的数据d对象:
        #humanData = dict(aid = aid,joints=joints, imgpath=imgname, headRect=rect, bbox=bbox, imgid = ann['image_id'], segmentation = ann['segmentation'])
        #输入数据为字典格式,每个字典包含7部分的内容,包括图像id以及关键点坐标,如果想修改训练自己的数据集的化,可以类似修改成这部分的对象
        train_data, _ = d.load_data(cfg.min_kps)

        from tfflat.data_provider import DataFromList, MultiProcessMapDataZMQ, BatchData, MapData
        dp = DataFromList(train_data) #转换为list
        if cfg.dpflow_enable:  #True
            dp = MultiProcessMapDataZMQ(dp, cfg.nr_dpflows, Preprocessing)
            ##传入函数Preprocessing,对图像进行裁剪标准框以及关键点坐标变换,并且生成图像关键点热力图标签!
        else:
            dp = MapData(dp, Preprocessing)
        dp = BatchData(dp, cfg.batch_size // cfg.nr_aug) ##nr_aug = 4
        dp.reset_state()
        dataiter = dp.get_data()
        return dataiter

图像预处理以及标签生成(dataset.py)

对图像进行标准框裁剪的时候,先进行了padding操作,使用图像均值进行填充,再对图像进行裁剪,裁剪之后的图像为之前的标准框,并对关键点坐标进行相应的操作!
pad–>crop

def Preprocessing(d, stage='train'):
    height, width = cfg.data_shape #256 192
    imgs = []
    labels = []
    valids = []
    if cfg.use_seg:
        segms = []

    vis = False
    img = cv2.imread(os.path.join(cfg.img_path, d['imgpath'])) #读取图像
    #hack(multiprocessing data provider)
    while img is None:
        print('read none image')
        time.sleep(np.random.rand() * 5)
        img = cv2.imread(os.path.join(cfg.img_path, d['imgpath']))
    add = max(img.shape[0], img.shape[1])
    bimg = cv2.copyMakeBorder(img, add, add, add, add, borderType=cv2.BORDER_CONSTANT,
                              value=cfg.pixel_means.reshape(-1)) #均值填充!!

    bbox = np.array(d['bbox']).reshape(4, ).astype(np.float32)
    bbox[:2] += add
    if 'joints' in d:
        joints = np.array(d['joints']).reshape(cfg.nr_skeleton, 3).astype(np.float32)
        joints[:, :2] += add
        inds = np.where(joints[:, -1] == 0)
        joints[inds, :2] = -1000000

    crop_width = bbox[2] * (1 + cfg.imgExtXBorder * 2)
    crop_height = bbox[3] * (1 + cfg.imgExtYBorder * 2)
    objcenter = np.array([bbox[0] + bbox[2] / 2., bbox[1] + bbox[3] / 2.])

    if stage == 'train':
        crop_width = crop_width * (1 + 0.25)
        crop_height = crop_height * (1 + 0.25)

    if crop_height / height > crop_width / width:
        crop_size = crop_height
        min_shape = height
    else:
        crop_size = crop_width
        min_shape = width

    crop_size = min(crop_size, objcenter[0] / width * min_shape * 2. - 1.)                     ##??
    crop_size = min(crop_size, (bimg.shape[1] - objcenter[0]) / width * min_shape * 2. - 1)    ##??
    crop_size = min(crop_size, objcenter[1] / height * min_shape * 2. - 1.)                    ##??
    crop_size = min(crop_size, (bimg.shape[0] - objcenter[1]) / height * min_shape * 2. - 1)   ##??

    min_x = int(objcenter[0] - crop_size / 2. / min_shape * width)   ##
    max_x = int(objcenter[0] + crop_size / 2. / min_shape * width)   ##
    min_y = int(objcenter[1] - crop_size / 2. / min_shape * height)  ##
    max_y = int(objcenter[1] + crop_size / 2. / min_shape * height)  ##

    x_ratio = float(width) / (max_x - min_x)
    y_ratio = float(height) / (max_y - min_y)

    if 'joints' in d:
        joints[:, 0] = joints[:, 0] - min_x
        joints[:, 1] = joints[:, 1] - min_y  ##转化为在截取的标准框上的坐标!

        joints[:, 0] *= x_ratio
        joints[:, 1] *= y_ratio    ##放大到256*192
        label = joints[:, :2].copy()
        valid = joints[:, 2].copy()

    img = cv2.resize(bimg[min_y:max_y, min_x:max_x, :], (width, height))
    #此时的img为截取的标准框图像,关键点都做了相应的变换

    # if stage != 'train':
    #     details = np.asarray([min_x - add, min_y - add, max_x - add, max_y - add])
    # if cfg.use_seg is True and 'segmentation' in d:
    #     seg = get_seg(ori_img.shape[0], ori_img.shape[1], d['segmentation'])
    #     add = max(seg.shape[0], seg.shape[1])
    #     bimg = cv2.copyMakeBorder(seg, add, add, add, add, borderType=cv2.BORDER_CONSTANT, value=(0, 0, 0))
    #     seg = cv2.resize(bimg[min_y:max_y, min_x:max_x], (width, height))
    #     segms.append(seg)

    # if vis:
    #     tmpimg = img.copy()
    #     from utils.visualize import draw_skeleton
    #     draw_skeleton(tmpimg, label.astype(int))
    #     cv2.imwrite('vis.jpg', tmpimg)
    #     from IPython import embed; embed()

    img = img - cfg.pixel_means  ##俭掉图像均值
    if cfg.pixel_norm:  #True
        img = img / 255.
    img = img.transpose(2, 0, 1) ##变换通道
    imgs.append(img)
    if 'joints' in d:
        labels.append(label.reshape(-1))
        valids.append(valid.reshape(-1))

    if stage == 'train':  ##进行图像增强,以及转换标签为热力图
        imgs, labels, valids = data_augmentation(imgs, labels, valids)
        heatmaps15 = joints_heatmap_gen(imgs, labels, cfg.output_shape, cfg.data_shape, return_valid=False,
                                        gaussian_kernel=cfg.gk15)
        heatmaps11 = joints_heatmap_gen(imgs, labels, cfg.output_shape, cfg.data_shape, return_valid=False,
                                        gaussian_kernel=cfg.gk11)
        heatmaps9 = joints_heatmap_gen(imgs, labels, cfg.output_shape, cfg.data_shape, return_valid=False,
                                       gaussian_kernel=cfg.gk9)
        heatmaps7 = joints_heatmap_gen(imgs, labels, cfg.output_shape, cfg.data_shape, return_valid=False,
                                       gaussian_kernel=cfg.gk7)

        return [imgs.astype(np.float32).transpose(0, 2, 3, 1),
                heatmaps15.astype(np.float32).transpose(0, 2, 3, 1),
                heatmaps11.astype(np.float32).transpose(0, 2, 3, 1),
                heatmaps9.astype(np.float32).transpose(0, 2, 3, 1),
                heatmaps7.astype(np.float32).transpose(0, 2, 3, 1),
                valids.astype(np.float32)]
    else:
        return [np.asarray(imgs).astype(np.float32), details]

图像标签热力图的生成

def joints_heatmap_gen(data, label, tar_size=cfg.output_shape, ori_size=cfg.data_shape, points=cfg.nr_skeleton,
                       return_valid=False, gaussian_kernel=cfg.gaussain_kernel):  ##cfg.output_shape = (64,48)(256,192)---> (64,48)
    #注意此时的label是在(256,192)上的关键点坐标,需要转换成(64,48)上的关键点坐标
    #对关键点位置进行高斯滤波,使用不同的kernel训练
    if return_valid: ##False
        valid = np.ones((len(data), points), dtype=np.float32)

    ret = np.zeros((len(data), points, tar_size[0], tar_size[1]), dtype='float32')
    for i in range(len(ret)):
        for j in range(points):  ##points = 17
            if label[i][j << 1] < 0 or label[i][j << 1 | 1] < 0:  ##左移一位?????
                continue
            label[i][j << 1 | 1] = min(label[i][j << 1 | 1], ori_size[0] - 1)
            label[i][j << 1] = min(label[i][j << 1], ori_size[1] - 1)
            ret[i][j][int(label[i][j << 1 | 1] * tar_size[0] / ori_size[0])][
                int(label[i][j << 1] * tar_size[1] / ori_size[1])] = 1

    for i in range(len(ret)):
        for j in range(points):
            ret[i, j] = cv2.GaussianBlur(ret[i, j], gaussian_kernel, 0)
    for i in range(len(ret)):
        for j in range(cfg.nr_skeleton):
            am = np.amax(ret[i][j])
            if am <= 1e-8:
                if return_valid: #False
                    valid[i][j] = 0.
                continue
            ret[i][j] /= am / 255  #标准化
    if return_valid:
        return ret, valid
    else:
        return ret

数据增强

  1. crop augmentation
  2. random scales
  3. rotation
  4. flip
def data_augmentation(trainData, trainLabel, trainValids, segms=None):
    trainSegms = segms
    tremNum = cfg.nr_aug - 1
    gotData = trainData.copy()
    trainData = np.append(trainData, [trainData[0] for i in range(tremNum * len(trainData))], axis=0)
    if trainSegms is not None:
        gotSegm = trainSegms.copy()
        trainSegms = np.append(trainSegms, [trainSegms[0] for i in range(tremNum * len(trainSegms))], axis=0)
    trainLabel = np.append(trainLabel, [trainLabel[0] for i in range(tremNum * len(trainLabel))], axis=0)
    trainValids = np.append(trainValids, [trainValids[0] for i in range(tremNum * len(trainValids))], axis=0)
    counter = len(gotData)
    for lab in range(len(gotData)):
        ori_img = gotData[lab].transpose(1, 2, 0)
        if trainSegms is not None:
            ori_segm = gotSegm[lab].copy()
        annot = trainLabel[lab].copy()
        annot_valid = trainValids[lab].copy()
        height, width = ori_img.shape[0], ori_img.shape[1]
        center = (width / 2., height / 2.)
        n = cfg.nr_skeleton

        # affrat = random.uniform(0.75, 1.25)
        affrat = random.uniform(0.7, 1.35)
        halfl_w = min(width - center[0], (width - center[0]) / 1.25 * affrat)
        halfl_h = min(height - center[1], (height - center[1]) / 1.25 * affrat)
        # img = cv2.resize(ori_img[int(center[0] - halfl_w) : int(center[0] + halfl_w + 1), int(center[1] - halfl_h) : int(center[1] + halfl_h + 1)], (width, height))
        img = cv2.resize(ori_img[int(center[1] - halfl_h): int(center[1] + halfl_h + 1),
                         int(center[0] - halfl_w): int(center[0] + halfl_w + 1)], (width, height))
        if trainSegms is not None:
            segm = cv2.resize(ori_segm[int(center[1] - halfl_h): int(center[1] + halfl_h + 1),
                              int(center[0] - halfl_w): int(center[0] + halfl_w + 1)], (width, height))
        for i in range(n):
            annot[i << 1] = (annot[i << 1] - center[0]) / halfl_w * (width - center[0]) + center[0]
            annot[i << 1 | 1] = (annot[i << 1 | 1] - center[1]) / halfl_h * (height - center[1]) + center[1]
            annot_valid[i] *= (
            (annot[i << 1] >= 0) & (annot[i << 1] < width) & (annot[i << 1 | 1] >= 0) & (annot[i << 1 | 1] < height))

        trainData[lab] = img.transpose(2, 0, 1)
        if trainSegms is not None:
            trainSegms[lab] = segm
        trainLabel[lab] = annot
        trainValids[lab] = annot_valid

        # flip augmentation
        newimg = cv2.flip(img, 1)
        if trainSegms is not None:
            newsegm = cv2.flip(segm, 1)
        cod = []
        allc = []
        for i in range(n):
            x, y = annot[i << 1], annot[i << 1 | 1]
            if x >= 0:
                x = width - 1 - x
            cod.append((x, y))
        if trainSegms is not None:
            trainSegms[counter] = newsegm
        trainData[counter] = newimg.transpose(2, 0, 1)

        # **** the joint index depends on the dataset ****
        for (q, w) in cfg.symmetry:
            cod[q], cod[w] = cod[w], cod[q]
        for i in range(n):
            allc.append(cod[i][0])
            allc.append(cod[i][1])
        trainLabel[counter] = np.array(allc)
        allc_valid = annot_valid.copy()
        for (q, w) in cfg.symmetry:
            allc_valid[q], allc_valid[w] = allc_valid[w], allc_valid[q]
        trainValids[counter] = np.array(allc_valid)
        counter += 1

        # rotated augmentation
        for times in range(tremNum - 1):
            angle = random.uniform(0, 45)
            if random.randint(0, 1):
                angle *= -1
            rotMat = cv2.getRotationMatrix2D(center, angle, 1.0)
            newimg = cv2.warpAffine(img, rotMat, (width, height))
            if trainSegms is not None:
                newsegm = cv2.warpAffine(segm, rotMat, (width, height))

            allc = []
            allc_valid = []
            for i in range(n):
                x, y = annot[i << 1], annot[i << 1 | 1]
                coor = np.array([x, y])
                if x >= 0 and y >= 0:
                    R = rotMat[:, : 2]
                    W = np.array([rotMat[0][2], rotMat[1][2]])
                    coor = np.dot(R, coor) + W
                allc.append(coor[0])
                allc.append(coor[1])
                allc_valid.append(
                    annot_valid[i] * ((coor[0] >= 0) & (coor[0] < width) & (coor[1] >= 0) & (coor[1] < height)))

            newimg = newimg.transpose(2, 0, 1)
            trainData[counter] = newimg
            if trainSegms is not None:
                trainSegms[counter] = newsegm
            trainLabel[counter] = np.array(allc)
            trainValids[counter] = np.array(allc_valid)
            counter += 1
    if trainSegms is not None:
        return trainData, trainLabel, trainSegms
    else:
        return trainData, trainLabel, trainValids

模型训练

cpn项目框架

tenssorflow项目的架构也可以学习一下,下面是cpn文件夹下的组织架构。
这里写图片描述

猜你喜欢

转载自blog.csdn.net/m0_37477175/article/details/80999124
今日推荐