本篇博客是对论文《Cascaded Pyramid Network for Multi-Person Pose Estimation》的个人解读,以及对代码(tensorflow版本)的细节分析。
其他资料:
前言
目前对多人的姿态点检测的算法总体分为两类:
- bottom-up 方法:此方法直接预测关键点,再判断关键点分别属于哪个人,代表有openpose
- top-down方法:此方法为两阶段方法,先检测人,再根据检测的人的框使用单人姿态点检测算法SPPE(a single-person pose estimator)对每一个人进行关键点的检测,最后整合回原图。
两种方法的区别:
- 由于top-down方法依赖于人体检测框的结果,非常容易受其影响。比如如果人体重叠部分过大,则可能只检测出一个检测框,导致准确度下降,如下图1所示:
- top-down算法的时间成本会随着图像中的人数增加而增加,每增加一个人就要进行一次SPPE算法。
- bottom-up相比没有以上两点问题。
图1 人物重叠容易导致检测框只检测到一个人,从而影响关键点检测。
摘要
论文提出的关键点检测算法是SPPE,提出了一种网络结构,能够对不可见的关键点,重叠的关键点,模糊难以辨识的关键点的检测,克服复杂背景的影响。网络分为两部分:
- GlobalNet:为一个FPN网络,用来检测比较简单的关键点,如眼睛,手;但对不可见的等较难的点判断并不是很好。
- RefineNet:主要是用来检测非常难分辨的关键点,他的输入的GlobalNet的几个不同层次的特征来在线的(后文会讲到如何在线)对判断困难的关键点进行检测。
论文的方法获得了COCO 人体姿态点检测的2017年冠军,在COCO test-dev上的平均检测精度为73.0,在COCO test-challenge 数据集上平均检测精度 72.1,比2016年冠军的60.5高出了19%!
网络结构解读
图2 SPPE网络结构图
GlobalNet
图3 Globalnet
首先,GlobalNet的输入并不是一幅图像,而是Resnet的4个blocks提取出的特征图,论文中分别以C2,C3,C4,C5来代表。其中C2,C3由于层数较浅,所以有很高的空间精度即能够很好的定位原图信息,但是语义信息不足;相反,C4,C5,拥有较高的语义信息,但是空间分辨率较低,不足以定位图像信息。所以,GlobalNet采用FPN的结构充分的利用各个层次的不同信息来对关键点的heatmap进行预测。
注:GlobalNet与FPN稍有不同,在升采样(upsampling process)之后,两层相加之前,要再进行一次1×1的卷积操作。
输入到Globalnet中的每一层要进行的操作:
net1 (from resnet) —> 1*1 conv —> upsampling —> 1 * 1 conv —-> elem-sum(与下一层) ——> predict —> L2 loss
代码解读:
def create_global_net(blocks, is_training, trainable=True):
'''
blocks: 即C2,C3,C4,C5
'''
global_fms = [] #GlobalNet的输出求loss
global_outs = [] #RefineNet的输入!!!
last_fm = None #初始化
initializer = tf.contrib.layers.xavier_initializer()
for i, block in enumerate(reversed(blocks)):
with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
lateral = slim.conv2d(block, 256, [1, 1],
trainable=trainable, weights_initializer=initializer,
padding='SAME', activation_fn=tf.nn.relu,
scope='lateral/res{}'.format(5-i))
#如果last_fm不是None,则对当前层进行线性差值(上采样)之后与当前层进行相加
if last_fm is not None:
sz = tf.shape(lateral)
upsample = tf.image.resize_bilinear(last_fm, (sz[1], sz[2]),
name='upsample/res{}'.format(5-i))
upsample = slim.conv2d(upsample, 256, [1, 1],
trainable=trainable, weights_initializer=initializer,
padding='SAME', activation_fn=None,
scope='merge/res{}'.format(5-i))
last_fm = upsample + lateral #两层相加
else:
last_fm = lateral #此时的层为最高层即最后一层,有最高的语义表征,不进行上采样,直接输出
#对每一层叠加之后的last_fm进行1*1的卷积之后再进行3*3的卷积生成17个feature map(17个关键点的热力图),作为predict参与训练
with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
tmp = slim.conv2d(last_fm, 256, [1, 1],
trainable=trainable, weights_initializer=initializer,
padding='SAME', activation_fn=tf.nn.relu,
scope='tmp/res{}'.format(5-i))
out = slim.conv2d(tmp, cfg.nr_skeleton, [3, 3],
trainable=trainable, weights_initializer=initializer,
padding='SAME', activation_fn=None,
scope='pyramid/res{}'.format(5-i))
global_fms.append(last_fm)
global_outs.append(tf.image.resize_bilinear(out, (cfg.output_shape[0], cfg.output_shape[1])))
global_fms.reverse()
global_outs.reverse()
return global_fms, global_outs
RefineNet
对困难关键点进行定位,此处的困难关键点由训练决定,根据globalnet的loss大小决定哪几个关键点成为困难关键点,并不是人为的决定!!
def create_refine_net(blocks, is_training, trainable=True):
#global_fms 即 blocks
initializer = tf.contrib.layers.xavier_initializer()
bottleneck = resnet_v1.bottleneck
refine_fms = []
for i, block in enumerate(blocks):
mid_fm = block
with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
for j in range(i):
mid_fm = bottleneck(mid_fm, 256, 128, stride=1, scope='res{}/refine_conv{}'.format(2+i, j)) # no projection
mid_fm = tf.image.resize_bilinear(mid_fm, (cfg.output_shape[0], cfg.output_shape[1]),
name='upsample_conv/res{}'.format(2+i))
refine_fms.append(mid_fm)
refine_fm = tf.concat(refine_fms, axis=3) ##此处的操作的是concat!!!!
with slim.arg_scope(resnet_arg_scope(bn_is_training=is_training)):
refine_fm = bottleneck(refine_fm, 256, 128, stride=1, scope='final_bottleneck')
res = slim.conv2d(refine_fm, cfg.nr_skeleton, [3, 3],
trainable=trainable, weights_initializer=initializer,
padding='SAME', activation_fn=None,
scope='refine_out')
return res
疑问! 神经网络有时会利用不同层的语义特征,但是利用的方式有两种,就像本处的Globalnet,对不同层次的特征进行 elem_sum, 即对应元素相加,类似于Resnet;而Refinenet则是对不同层次的特征进行concat,类似于inception; 这两者之间的区别是什么,什么时候用concat什么时候用elem_sum呢???
总体网络架构:
resnet101 + create_global_net + create_refine_net
resnet_fms = resnet101(image, is_train, bn_trainable=True)
global_fms, global_outs = create_global_net(resnet_fms, is_train)
#flobal_outs: 17*64*48
refine_out = create_refine_net(global_fms, is_train)
#refine_outs: 17*
根据每个特征点的loss选择那些点进入到Refinenet中进行挖掘:
def ohkm(loss, top_k):
ohkm_loss = 0.
for i in range(cfg.batch_size):
sub_loss = loss[i]
topk_val, topk_idx = tf.nn.top_k(sub_loss, k=top_k, sorted=False, name='ohkm{}'.format(i))
tmp_loss = tf.gather(sub_loss, topk_idx, name='ohkm_loss{}'.format(i)) # can be ignore ???
ohkm_loss += tf.reduce_sum(tmp_loss) / top_k
ohkm_loss /= cfg.batch_size
return ohkm_loss
refine_loss = tf.reduce_mean(tf.square(refine_out - label7), (1,2)) * tf.to_float((tf.greater(valids, 0.1)))
refine_loss = ohkm(refine_loss, 8) #选择8个点作为困哪关键点进行挖掘!可设置调整。
训练数据读入(network.py)
def make_data(self):
from COCOAllJoints import COCOJoints
from dataset import Preprocessing
d = COCOJoints()
#得到的数据d对象:
#humanData = dict(aid = aid,joints=joints, imgpath=imgname, headRect=rect, bbox=bbox, imgid = ann['image_id'], segmentation = ann['segmentation'])
#输入数据为字典格式,每个字典包含7部分的内容,包括图像id以及关键点坐标,如果想修改训练自己的数据集的化,可以类似修改成这部分的对象
train_data, _ = d.load_data(cfg.min_kps)
from tfflat.data_provider import DataFromList, MultiProcessMapDataZMQ, BatchData, MapData
dp = DataFromList(train_data) #转换为list
if cfg.dpflow_enable: #True
dp = MultiProcessMapDataZMQ(dp, cfg.nr_dpflows, Preprocessing)
##传入函数Preprocessing,对图像进行裁剪标准框以及关键点坐标变换,并且生成图像关键点热力图标签!
else:
dp = MapData(dp, Preprocessing)
dp = BatchData(dp, cfg.batch_size // cfg.nr_aug) ##nr_aug = 4
dp.reset_state()
dataiter = dp.get_data()
return dataiter
图像预处理以及标签生成(dataset.py)
对图像进行标准框裁剪的时候,先进行了padding操作,使用图像均值进行填充,再对图像进行裁剪,裁剪之后的图像为之前的标准框,并对关键点坐标进行相应的操作!
pad–>crop
def Preprocessing(d, stage='train'):
height, width = cfg.data_shape #256 192
imgs = []
labels = []
valids = []
if cfg.use_seg:
segms = []
vis = False
img = cv2.imread(os.path.join(cfg.img_path, d['imgpath'])) #读取图像
#hack(multiprocessing data provider)
while img is None:
print('read none image')
time.sleep(np.random.rand() * 5)
img = cv2.imread(os.path.join(cfg.img_path, d['imgpath']))
add = max(img.shape[0], img.shape[1])
bimg = cv2.copyMakeBorder(img, add, add, add, add, borderType=cv2.BORDER_CONSTANT,
value=cfg.pixel_means.reshape(-1)) #均值填充!!
bbox = np.array(d['bbox']).reshape(4, ).astype(np.float32)
bbox[:2] += add
if 'joints' in d:
joints = np.array(d['joints']).reshape(cfg.nr_skeleton, 3).astype(np.float32)
joints[:, :2] += add
inds = np.where(joints[:, -1] == 0)
joints[inds, :2] = -1000000
crop_width = bbox[2] * (1 + cfg.imgExtXBorder * 2)
crop_height = bbox[3] * (1 + cfg.imgExtYBorder * 2)
objcenter = np.array([bbox[0] + bbox[2] / 2., bbox[1] + bbox[3] / 2.])
if stage == 'train':
crop_width = crop_width * (1 + 0.25)
crop_height = crop_height * (1 + 0.25)
if crop_height / height > crop_width / width:
crop_size = crop_height
min_shape = height
else:
crop_size = crop_width
min_shape = width
crop_size = min(crop_size, objcenter[0] / width * min_shape * 2. - 1.) ##??
crop_size = min(crop_size, (bimg.shape[1] - objcenter[0]) / width * min_shape * 2. - 1) ##??
crop_size = min(crop_size, objcenter[1] / height * min_shape * 2. - 1.) ##??
crop_size = min(crop_size, (bimg.shape[0] - objcenter[1]) / height * min_shape * 2. - 1) ##??
min_x = int(objcenter[0] - crop_size / 2. / min_shape * width) ##
max_x = int(objcenter[0] + crop_size / 2. / min_shape * width) ##
min_y = int(objcenter[1] - crop_size / 2. / min_shape * height) ##
max_y = int(objcenter[1] + crop_size / 2. / min_shape * height) ##
x_ratio = float(width) / (max_x - min_x)
y_ratio = float(height) / (max_y - min_y)
if 'joints' in d:
joints[:, 0] = joints[:, 0] - min_x
joints[:, 1] = joints[:, 1] - min_y ##转化为在截取的标准框上的坐标!
joints[:, 0] *= x_ratio
joints[:, 1] *= y_ratio ##放大到256*192
label = joints[:, :2].copy()
valid = joints[:, 2].copy()
img = cv2.resize(bimg[min_y:max_y, min_x:max_x, :], (width, height))
#此时的img为截取的标准框图像,关键点都做了相应的变换
# if stage != 'train':
# details = np.asarray([min_x - add, min_y - add, max_x - add, max_y - add])
# if cfg.use_seg is True and 'segmentation' in d:
# seg = get_seg(ori_img.shape[0], ori_img.shape[1], d['segmentation'])
# add = max(seg.shape[0], seg.shape[1])
# bimg = cv2.copyMakeBorder(seg, add, add, add, add, borderType=cv2.BORDER_CONSTANT, value=(0, 0, 0))
# seg = cv2.resize(bimg[min_y:max_y, min_x:max_x], (width, height))
# segms.append(seg)
# if vis:
# tmpimg = img.copy()
# from utils.visualize import draw_skeleton
# draw_skeleton(tmpimg, label.astype(int))
# cv2.imwrite('vis.jpg', tmpimg)
# from IPython import embed; embed()
img = img - cfg.pixel_means ##俭掉图像均值
if cfg.pixel_norm: #True
img = img / 255.
img = img.transpose(2, 0, 1) ##变换通道
imgs.append(img)
if 'joints' in d:
labels.append(label.reshape(-1))
valids.append(valid.reshape(-1))
if stage == 'train': ##进行图像增强,以及转换标签为热力图
imgs, labels, valids = data_augmentation(imgs, labels, valids)
heatmaps15 = joints_heatmap_gen(imgs, labels, cfg.output_shape, cfg.data_shape, return_valid=False,
gaussian_kernel=cfg.gk15)
heatmaps11 = joints_heatmap_gen(imgs, labels, cfg.output_shape, cfg.data_shape, return_valid=False,
gaussian_kernel=cfg.gk11)
heatmaps9 = joints_heatmap_gen(imgs, labels, cfg.output_shape, cfg.data_shape, return_valid=False,
gaussian_kernel=cfg.gk9)
heatmaps7 = joints_heatmap_gen(imgs, labels, cfg.output_shape, cfg.data_shape, return_valid=False,
gaussian_kernel=cfg.gk7)
return [imgs.astype(np.float32).transpose(0, 2, 3, 1),
heatmaps15.astype(np.float32).transpose(0, 2, 3, 1),
heatmaps11.astype(np.float32).transpose(0, 2, 3, 1),
heatmaps9.astype(np.float32).transpose(0, 2, 3, 1),
heatmaps7.astype(np.float32).transpose(0, 2, 3, 1),
valids.astype(np.float32)]
else:
return [np.asarray(imgs).astype(np.float32), details]
图像标签热力图的生成
def joints_heatmap_gen(data, label, tar_size=cfg.output_shape, ori_size=cfg.data_shape, points=cfg.nr_skeleton,
return_valid=False, gaussian_kernel=cfg.gaussain_kernel): ##cfg.output_shape = (64,48)(256,192)---> (64,48)
#注意此时的label是在(256,192)上的关键点坐标,需要转换成(64,48)上的关键点坐标
#对关键点位置进行高斯滤波,使用不同的kernel训练
if return_valid: ##False
valid = np.ones((len(data), points), dtype=np.float32)
ret = np.zeros((len(data), points, tar_size[0], tar_size[1]), dtype='float32')
for i in range(len(ret)):
for j in range(points): ##points = 17
if label[i][j << 1] < 0 or label[i][j << 1 | 1] < 0: ##左移一位?????
continue
label[i][j << 1 | 1] = min(label[i][j << 1 | 1], ori_size[0] - 1)
label[i][j << 1] = min(label[i][j << 1], ori_size[1] - 1)
ret[i][j][int(label[i][j << 1 | 1] * tar_size[0] / ori_size[0])][
int(label[i][j << 1] * tar_size[1] / ori_size[1])] = 1
for i in range(len(ret)):
for j in range(points):
ret[i, j] = cv2.GaussianBlur(ret[i, j], gaussian_kernel, 0)
for i in range(len(ret)):
for j in range(cfg.nr_skeleton):
am = np.amax(ret[i][j])
if am <= 1e-8:
if return_valid: #False
valid[i][j] = 0.
continue
ret[i][j] /= am / 255 #标准化
if return_valid:
return ret, valid
else:
return ret
数据增强
- crop augmentation
- random scales
- rotation
- flip
def data_augmentation(trainData, trainLabel, trainValids, segms=None):
trainSegms = segms
tremNum = cfg.nr_aug - 1
gotData = trainData.copy()
trainData = np.append(trainData, [trainData[0] for i in range(tremNum * len(trainData))], axis=0)
if trainSegms is not None:
gotSegm = trainSegms.copy()
trainSegms = np.append(trainSegms, [trainSegms[0] for i in range(tremNum * len(trainSegms))], axis=0)
trainLabel = np.append(trainLabel, [trainLabel[0] for i in range(tremNum * len(trainLabel))], axis=0)
trainValids = np.append(trainValids, [trainValids[0] for i in range(tremNum * len(trainValids))], axis=0)
counter = len(gotData)
for lab in range(len(gotData)):
ori_img = gotData[lab].transpose(1, 2, 0)
if trainSegms is not None:
ori_segm = gotSegm[lab].copy()
annot = trainLabel[lab].copy()
annot_valid = trainValids[lab].copy()
height, width = ori_img.shape[0], ori_img.shape[1]
center = (width / 2., height / 2.)
n = cfg.nr_skeleton
# affrat = random.uniform(0.75, 1.25)
affrat = random.uniform(0.7, 1.35)
halfl_w = min(width - center[0], (width - center[0]) / 1.25 * affrat)
halfl_h = min(height - center[1], (height - center[1]) / 1.25 * affrat)
# img = cv2.resize(ori_img[int(center[0] - halfl_w) : int(center[0] + halfl_w + 1), int(center[1] - halfl_h) : int(center[1] + halfl_h + 1)], (width, height))
img = cv2.resize(ori_img[int(center[1] - halfl_h): int(center[1] + halfl_h + 1),
int(center[0] - halfl_w): int(center[0] + halfl_w + 1)], (width, height))
if trainSegms is not None:
segm = cv2.resize(ori_segm[int(center[1] - halfl_h): int(center[1] + halfl_h + 1),
int(center[0] - halfl_w): int(center[0] + halfl_w + 1)], (width, height))
for i in range(n):
annot[i << 1] = (annot[i << 1] - center[0]) / halfl_w * (width - center[0]) + center[0]
annot[i << 1 | 1] = (annot[i << 1 | 1] - center[1]) / halfl_h * (height - center[1]) + center[1]
annot_valid[i] *= (
(annot[i << 1] >= 0) & (annot[i << 1] < width) & (annot[i << 1 | 1] >= 0) & (annot[i << 1 | 1] < height))
trainData[lab] = img.transpose(2, 0, 1)
if trainSegms is not None:
trainSegms[lab] = segm
trainLabel[lab] = annot
trainValids[lab] = annot_valid
# flip augmentation
newimg = cv2.flip(img, 1)
if trainSegms is not None:
newsegm = cv2.flip(segm, 1)
cod = []
allc = []
for i in range(n):
x, y = annot[i << 1], annot[i << 1 | 1]
if x >= 0:
x = width - 1 - x
cod.append((x, y))
if trainSegms is not None:
trainSegms[counter] = newsegm
trainData[counter] = newimg.transpose(2, 0, 1)
# **** the joint index depends on the dataset ****
for (q, w) in cfg.symmetry:
cod[q], cod[w] = cod[w], cod[q]
for i in range(n):
allc.append(cod[i][0])
allc.append(cod[i][1])
trainLabel[counter] = np.array(allc)
allc_valid = annot_valid.copy()
for (q, w) in cfg.symmetry:
allc_valid[q], allc_valid[w] = allc_valid[w], allc_valid[q]
trainValids[counter] = np.array(allc_valid)
counter += 1
# rotated augmentation
for times in range(tremNum - 1):
angle = random.uniform(0, 45)
if random.randint(0, 1):
angle *= -1
rotMat = cv2.getRotationMatrix2D(center, angle, 1.0)
newimg = cv2.warpAffine(img, rotMat, (width, height))
if trainSegms is not None:
newsegm = cv2.warpAffine(segm, rotMat, (width, height))
allc = []
allc_valid = []
for i in range(n):
x, y = annot[i << 1], annot[i << 1 | 1]
coor = np.array([x, y])
if x >= 0 and y >= 0:
R = rotMat[:, : 2]
W = np.array([rotMat[0][2], rotMat[1][2]])
coor = np.dot(R, coor) + W
allc.append(coor[0])
allc.append(coor[1])
allc_valid.append(
annot_valid[i] * ((coor[0] >= 0) & (coor[0] < width) & (coor[1] >= 0) & (coor[1] < height)))
newimg = newimg.transpose(2, 0, 1)
trainData[counter] = newimg
if trainSegms is not None:
trainSegms[counter] = newsegm
trainLabel[counter] = np.array(allc)
trainValids[counter] = np.array(allc_valid)
counter += 1
if trainSegms is not None:
return trainData, trainLabel, trainSegms
else:
return trainData, trainLabel, trainValids
模型训练
cpn项目框架
tenssorflow项目的架构也可以学习一下,下面是cpn文件夹下的组织架构。