In engineering, the running speed and accuracy of the model are equally important. In this article, I will use different methods to optimize and compare the performance of the model, hoping to bring you some practical tricks and experience.
Students with relevant experience in key point detection should know that the mainstream methods of key points are divided into Heatmap-based and Regression-based.
The main difference lies in the difference in supervision information. The Heatmap-based method supervision model learns a Gaussian probability distribution map, that is, each point in GroundTruth is rendered into a Gaussian heat map, and the final network output is K feature maps corresponding to K key points. , and then use argmax to obtain the maximum point as the estimation result. Because this method needs to render a Gaussian heat map, and because the maximum value point in the heat map directly corresponds to the result, it is inevitable to maintain a relatively high-resolution heat map (64x64 is common, and if it is small, the error lower bound is too large. cause serious loss of precision), so it naturally leads to a large amount of calculation and memory overhead.
The Regression-based method is very simple and rude, directly supervises the model to learn the coordinate value, and calculates the L1 or L2 loss of the coordinate value. Since there is no need to render Gaussian heatmaps and maintain high resolution, the feature maps output by the network can be small (such as 14x14 or even 7x7). Taking Resnet-50 as an example, FLOPs is 20,000th of the Heatmap-based method. One, this is quite friendly to devices with weak computing power (such as mobile phones), and in actual projects, this method is more often used.
However, Regression has always been crushed by Heatmap in terms of accuracy. The fully convolutional structure of Heatmap can completely retain position information , so the spatial generalization ability of Gaussian heatmap is stronger. The regression method needs to expand the image vector into a one-dimensional vector at the end, and the position information will be lost during the reshape process. In addition, the fully connected network in Regression needs to convert location information into coordinate values. For this obscure information conversion process, its nonlinearity is extremely strong, so it is not easy to train and converge.
In order to better improve the accuracy of Regression, I will make a series of optimizations and record them here.
1.regression
I will use mobilenetv3 as the backbone of all experiments to build the Baseline of MobileNetv3+Deeppose. The training data comes from the project, and the config is as follows.
model = dict(
type='TopDown',
pretrained=None,
backbone=dict(type='MobileNetV3'),
neck=dict(type='GlobalAveragePooling'),
keypoint_head=dict(
type='DeepposeRegressionHead',
in_channels=96,
num_joints=channel_cfg['num_output_channels'],
loss_keypoint=dict(type='SmoothL1Loss', use_target_weight=True)),
train_cfg=dict(),
test_cfg=dict(flip_test=True))
On the cpu side, the model speed is based on the ncnn test, and the conclusions are as follows:
method | input size | AP50:95 | acc_pse | time |
---|---|---|---|---|
Deeppose | 192*256 | 41.3% | 65% | 2.5ms |
2.Heatmap
Also using mobilenetv3 as the backbone, unlike Regression, in order to obtain the heat map features with a size of (48, 64), we need to add 3 deconv layers to the head, and sample the feature map with a backbone size of (6, 8) to (48, 64).
model = dict(
type='TopDown',
backbone=dict(type='MobileNetV3'),
keypoint_head=dict(
type='TopdownHeatmapSimpleHead',
in_channels=96,
out_channels=channel_cfg['num_output_channels'],
loss_keypoint=dict(type='JointsMSELoss', use_target_weight=True)),
train_cfg=dict(),
test_cfg=dict(
flip_test=True,
post_process='default',
shift_heatmap=True,
modulate_kernel=11))
On the cpu side, the model speed is based on the ncnn test, and the conclusions are as follows:
method | input size | AP50:95 | acc_pse | time |
---|---|---|---|---|
Deeppose | 192*256 | 41.3% | 65% | 2.5ms |
Heatmap | 192*256 | 67.5% | 93% | 60ms |
Due to the different structure of the head layer, the amount of parameters becomes larger, resulting in a sharp increase in the reasoning time. The structure of the full convolution of Heatmap can completely retain the position information , so the spatial generalization ability of Gaussian heatmap is stronger. The regression method needs to expand the image vector into a one-dimensional vector at the end, and the position information will be lost during the reshape process. In addition, the fully connected network in Regression needs to convert location information into coordinate values. For this obscure information conversion process, its nonlinearity is extremely strong, so it is not easy to train and converge.
3.RLE
Regression only cares about the mean value of the discrete probability distribution (only predicted coordinate values, a mean value can correspond to countless distributions), lost μ \muThe information of the distribution around μ , compared to the heatmap, shows the GT distribution (artificially set the varianceσ \sigmaσ ) is marked as a Gaussian heat map and used as a learning target. The implicit maximum likelihood loss of RLE can help regression determine the mean and variance of the probability distribution, construct the true error probability distribution, and thus better return coordinates.
model = dict(
type='TopDown',
backbone=dict(type='MobileNetV3'),
neck=dict(type='GlobalAveragePooling'),
keypoint_head=dict(
type='DeepposeRegressionHead',
in_channels=96,
num_joints=channel_cfg['num_output_channels'],
loss_keypoint=dict(
type='RLELoss',
use_target_weight=True,
size_average=True,
residual=True),
out_sigma=True),
train_cfg=dict(),
test_cfg=dict(flip_test=True, regression_flip_shift=True))
mmpose has implemented RLE loss, we only need to add loss_keypoint=RLELoss to config to run.
method | input size | AP50:95 | acc_pse | time |
---|---|---|---|---|
Deeppose | 192*256 | 41.3% | 65% | 2.5ms |
Heatmap | 192*256 | 67.5% | 93% | 60ms |
RLE | 192*256 | 67.3% | 90% | 2.5ms |
From the above table, we can find that when the RLE loss is introduced, the AP is increased to 67.3%, which is similar to the heatmap, while the reasoning time remains at 2.5ms. Please refer to the detailed explanation of RLE .
4.Integral Pose Regression
We know that when Heatmap reasoning, argmax is used to obtain the index with the highest score in the feature map, but argmax itself is not derivable. In order to solve this problem, IPR uses the Soft-Argmax method to decode, first normalize the probability heat map with Softmax, and then use the expected method to obtain the predicted coordinates. We introduce the IPR mechanism on deeppose, replace the last fc with the conv layer, retain the feature size of the last layer of the backbone, and use the expected softmax to obtain the predicted coordinates. A big benefit of this is the ability to bring more supervised information into training.
I wrote IPRhead code on mmpose
@HEADS.register_module()
class IntegralPoseRegressionHead(nn.Module):
def __init__(self,
in_channels,
num_joints,
feat_size,
loss_keypoint=None,
out_sigma=False,
debias=False,
train_cfg=None,
test_cfg=None):
super().__init__()
self.in_channels = in_channels
self.num_joints = num_joints
self.loss = build_loss(loss_keypoint)
self.train_cfg = {
} if train_cfg is None else train_cfg
self.test_cfg = {
} if test_cfg is None else test_cfg
self.out_sigma = out_sigma
self.conv = build_conv_layer(
dict(type='Conv2d'),
in_channels=in_channels,
out_channels=num_joints,
kernel_size=1,
stride=1,
padding=0)
self.size = feat_size
self.wx = torch.arange(0.0, 1.0 * self.size, 1).view([1, self.size]).repeat([self.size, 1]) / self.size
self.wy = torch.arange(0.0, 1.0 * self.size, 1).view([self.size, 1]).repeat([1, self.size]) / self.size
self.wx = nn.Parameter(self.wx, requires_grad=False)
self.wy = nn.Parameter(self.wy, requires_grad=False)
if out_sigma:
self.gap = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(self.in_channels, self.num_joints * 2)
if debias:
self.softmax_fc = nn.Linear(64, 64)
def forward(self, x):
"""Forward function."""
if isinstance(x, (list, tuple)):
assert len(x) == 1, ('DeepPoseRegressionHead only supports '
'single-level feature.')
x = x[0]
featmap = self.conv(x)
s = list(featmap.size())
featmap = featmap.view([s[0], s[1], s[2] * s[3]])
featmap = F.softmax(16 * featmap, dim=2)
featmap = featmap.view([s[0], s[1], s[2], s[3]])
scoremap_x = featmap.mul(self.wx)
scoremap_x = scoremap_x.view([s[0], s[1], s[2] * s[3]])
soft_argmax_x = torch.sum(scoremap_x, dim=2, keepdim=True)
scoremap_y = featmap.mul(self.wy)
scoremap_y = scoremap_y.view([s[0], s[1], s[2] * s[3]])
soft_argmax_y = torch.sum(scoremap_y, dim=2, keepdim=True)
output = torch.cat([soft_argmax_x, soft_argmax_y], dim=-1)
if self.out_sigma:
x = self.gap(x).reshape(x.size(0), -1)
pred_sigma = self.fc(x)
pred_sigma = pred_sigma.reshape(pred_sigma.size(0), self.num_joints, 2)
output = torch.cat([output, pred_sigma], dim=-1)
return output, featmap
After we introduce IPR, the actual output features are similar to the feature form output by the Heatmap method. The Heatmap method has an artificial probability distribution, that is, a Gaussian heat map, while the introduction of IPR in deeppose uses expectations as coordinates and directly supervises them through coordinate GT. Therefore, the loss decreases as long as the expectation is close to the GT. This brings about a problem that we cannot constrain the probability distribution through the predicted coordinates we expect to obtain.
As shown in the figure above, the expectations of the upper and lower distributions are both mean, but the distributions are completely different. RLE has demonstrated that a reasonable probability distribution is crucial, and supervision of the probability distribution is necessary in order to improve model performance. DSNT proposes to use JS divergence to make the probability distribution of the model closer to the self-made Gaussian distribution. There is a problem here. The variance of the Gaussian distribution can only be set by empirical values, and cannot be given adaptively for each sample. At the same time, the Gaussian distribution It may not be the best choice either.
@LOSSES.register_module()
class RLE_DSNTLoss(nn.Module):
"""RLE_DSNTLoss loss.
"""
def __init__(self,
use_target_weight=False,
size_average=True,
residual=True,
q_dis='laplace',
sigma=2.0):
super().__init__()
self.dsnt_loss = DSNTLoss(sigma=sigma, use_target_weight=use_target_weight)
self.rle_loss = RLELoss(use_target_weight=use_target_weight,
size_average=size_average,
residual=residual,
q_dis=q_dis)
self.use_target_weight = use_target_weight
def forward(self, output, heatmap, target, target_weight=None):
assert target_weight is not None
loss1 = self.dsnt_loss(heatmap, target, target_weight)
loss2 = self.rle_loss(output, target, target_weight)
loss = loss1 + loss2 # 这里权重可以调参
return loss
@LOSSES.register_module()
class DSNTLoss(nn.Module):
def __init__(self,
sigma,
use_target_weight=False,
size_average=True,
):
super(DSNTLoss, self).__init__()
self.use_target_weight = use_target_weight
self.sigma = sigma
self.size_average = size_average
def forward(self, heatmap, target, target_weight=None):
"""Forward function.
Note:
- batch_size: N
- num_keypoints: K
- dimension of keypoints: D (D=2 or D=3)
Args:
output (torch.Tensor[N, K, D*2]): Output regression,
including coords and sigmas.
target (torch.Tensor[N, K, D]): Target regression.
target_weight (torch.Tensor[N, K, D]):
Weights across different joint types.
"""
loss = dsntnn.js_reg_losses(heatmap, target, self.sigma)
if self.size_average:
loss /= len(loss)
return loss.sum()
As can be seen from the table below, the performance of the model is improved after the introduction of IPR+DSNT.
method | input size | AP50:95 | acc_pse | time |
---|---|---|---|---|
Deeppose | 192*256 | 41.3% | 65% | 2.5ms |
Heatmap | 192*256 | 67.5% | 93% | 60ms |
RLE | 192*256 | 67.3% | 90% | 2.5ms |
RLE+IPR+DSNT | 256*256 | 70.2% | 95% | 3.5ms |
5.Removing the Bias of Integral Pose Regression
After we introduce IPR, we can use Softmax to calculate the expected coordinate values, but using Softmax to calculate expectations will introduce errors. Because Softmax has a feature that makes each value non-zero. For a distribution that is very sharp in itself, Softmax will soften it into a gradient distribution. As a result of this property, the final calculated expected value will be inaccurate. Only when the response value is large enough and the distribution is sharp enough, the expected value is close to the Argmax result. Once the response value is small and the distribution is flat, the expected value will approach the center. This effect becomes more dramatic with larger feature sizes.
Removing the Bias of Integral Pose Regression proposes the debias method to eliminate the influence of Softmax softening. Specifically, assuming that the response value conforms to the Gaussian distribution, we can divide the feature map into four regions according to the width of twice the maximum response point: we know that once Softmax passes, the
original value of 0 is 2, 3, The 4-quadrant area will be filled with long tails in an instant, and for the 1st quadrant area, since the response value is in the center of the area, the estimated expected value of this area will be accurate regardless of the response value.
Let's go back to the Softmax formula:
for brevity, let's use C to represent the denominator:
Since it is assumed that the response values of areas 2, 3, and 4 are all 0, the molecular part is calculated as 1, and the Softmax result after dividing the area can be expressed as:
Then continue to bring in according to the calculation formula of Soft-Argmax, the calculation of the expected value can be expressed as:
That is: the expected value of the first area, plus the expected value of the other three areas. It is known that 2, 3, 4 tend to H ~ ( P ) = 1 / c \tilde{H}(P)=1/cH~(P)=1/ c , so the expected value of these three areas can be raised by 1/c, leaving only
The summation here is geometrically equivalent to multiplying the coordinates of the center point of the area by the area of the area. I will give a simple demonstration for the [n, m] interval:
Therefore, the expected value of the entire feature map can be regarded as the weighted sum of the coordinates of the center points of the four regions:
Due to the symmetry of the center points of the four areas, suppose the coordinates of the center point of the first area are J 1 = ( x 0 , y 0 ) J_1=(x_0,y_0)J1=(x0,y0) , then the center point coordinates of the remaining three areas areJ 2 = ( x 0 , y 0 + w / 2 ) , J 3 = ( x 0 + h / 2 , y 0 ) , J 4 = ( x 0 + h / 2 , y 0 + w / 2 ) J_2=(x_0,y_0+w/2),J_3=(x_0+h/2,y_0), J_4=(x_0+h/2,y_0+w/2)J2=(x0,y0+w/2),J3=(x0+h/2,y0),J4=(x0+h/2,y0+w/2)
Corresponding to the 1/c we got above multiplied by the coordinates of the center point multiplied by the area, each weighted value is obtained:
Bringing into the above weighted sum formula (6), the expected value of the entire feature map can be expressed as:
Since it is known that the weights of the four regions add up to 1, there is w 1 = 1 − w 2 − w 3 − w 4 w_1= 1-w_2-w_3-w_4w1=1−w2−w3−w4, so the expected value of the entire feature map is simplified into the following form:
Since J r J^rJThe r value can be easily obtained by calculating Soft-Argmax for the entire image, so the accurate coordinates of the center point of the first area can be obtained by translating the formula (9): this step is equivalent to
removing the original redundant long tail from the expected value Subtracted, we can further analyze the formula, the expected estimated value of the whole picture is equivalent to an offset of the expected value of the first area.
@HEADS.register_module()
class IntegralPoseRegressionHead(nn.Module):
def __init__(self,
in_channels,
num_joints,
feat_size,
loss_keypoint=None,
out_sigma=False,
debias=False,
train_cfg=None,
test_cfg=None):
super().__init__()
self.in_channels = in_channels
self.num_joints = num_joints
self.loss = build_loss(loss_keypoint)
self.train_cfg = {
} if train_cfg is None else train_cfg
self.test_cfg = {
} if test_cfg is None else test_cfg
self.out_sigma = out_sigma
self.debias = debias
self.conv = build_conv_layer(
dict(type='Conv2d'),
in_channels=in_channels,
out_channels=num_joints,
kernel_size=1,
stride=1,
padding=0)
self.size = feat_size
self.wx = torch.arange(0.0, 1.0 * self.size, 1).view([1, self.size]).repeat([self.size, 1]) / self.size
self.wy = torch.arange(0.0, 1.0 * self.size, 1).view([self.size, 1]).repeat([1, self.size]) / self.size
self.wx = nn.Parameter(self.wx, requires_grad=False)
self.wy = nn.Parameter(self.wy, requires_grad=False)
if out_sigma:
self.gap = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(self.in_channels, self.num_joints * 2)
if debias:
self.softmax_fc = nn.Linear(64, 64)
def forward(self, x):
"""Forward function."""
if isinstance(x, (list, tuple)):
assert len(x) == 1, ('DeepPoseRegressionHead only supports '
'single-level feature.')
x = x[0]
featmap = self.conv(x)
s = list(featmap.size())
featmap = featmap.view([s[0], s[1], s[2] * s[3]])
if self.debias:
mlp_x_norm = torch.norm(self.softmax_fc.weight, dim=-1)
norm_feat = torch.norm(featmap, dim=-1, keepdim=True)
featmap = self.softmax_fc(featmap)
featmap /= norm_feat
featmap /= mlp_x_norm.reshape(1, 1, -1)
featmap = F.softmax(16 * featmap, dim=2)
featmap = featmap.view([s[0], s[1], s[2], s[3]])
scoremap_x = featmap.mul(self.wx)
scoremap_x = scoremap_x.view([s[0], s[1], s[2] * s[3]])
soft_argmax_x = torch.sum(scoremap_x, dim=2, keepdim=True)
scoremap_y = featmap.mul(self.wy)
scoremap_y = scoremap_y.view([s[0], s[1], s[2] * s[3]])
soft_argmax_y = torch.sum(scoremap_y, dim=2, keepdim=True)
# output = torch.cat([soft_argmax_x, soft_argmax_y], dim=-1)
if self.debias:
C = featmap.reshape(s[0], s[1], s[2] * s[3]).exp().sum(dim=2).unsqueeze(dim=2)
soft_argmax_x = C / (C - 1) * (soft_argmax_x - 1 / (2 * C))
soft_argmax_y = C / (C - 1) * (soft_argmax_y - 1 / (2 * C))
output = torch.cat([soft_argmax_x, soft_argmax_y], dim=-1)
if self.out_sigma:
x = self.gap(x).reshape(x.size(0), -1)
pred_sigma = self.fc(x)
pred_sigma = pred_sigma.reshape(pred_sigma.size(0), self.num_joints, 2)
output = torch.cat([output, pred_sigma], dim=-1)
return output, featmap
method | input size | AP50:95 | acc_pse | time |
---|---|---|---|---|
Deeppose | 192*256 | 41.3% | 65% | 2.5ms |
Heatmap | 192*256 | 67.5% | 93% | 60ms |
RLE | 192*256 | 67.3% | 90% | 2.5ms |
RLE+IPR+DSNT | 256*256 | 70.2% | 95% | 3.5ms |
RLE+IPR+DSNT+debias | 256*256 | 71% | 95% | 3.5ms |
Thank you very much for the guidance given by Zhihu’s mirror article. I have learned a lot here. Friends who are interested can check Zhihu’s address .