mmpose key points (four): optimize the key point model (principle and code explanation, continuous update)

In engineering, the running speed and accuracy of the model are equally important. In this article, I will use different methods to optimize and compare the performance of the model, hoping to bring you some practical tricks and experience.

Students with relevant experience in key point detection should know that the mainstream methods of key points are divided into Heatmap-based and Regression-based.

The main difference lies in the difference in supervision information. The Heatmap-based method supervision model learns a Gaussian probability distribution map, that is, each point in GroundTruth is rendered into a Gaussian heat map, and the final network output is K feature maps corresponding to K key points. , and then use argmax to obtain the maximum point as the estimation result. Because this method needs to render a Gaussian heat map, and because the maximum value point in the heat map directly corresponds to the result, it is inevitable to maintain a relatively high-resolution heat map (64x64 is common, and if it is small, the error lower bound is too large. cause serious loss of precision), so it naturally leads to a large amount of calculation and memory overhead.

The Regression-based method is very simple and rude, directly supervises the model to learn the coordinate value, and calculates the L1 or L2 loss of the coordinate value. Since there is no need to render Gaussian heatmaps and maintain high resolution, the feature maps output by the network can be small (such as 14x14 or even 7x7). Taking Resnet-50 as an example, FLOPs is 20,000th of the Heatmap-based method. One, this is quite friendly to devices with weak computing power (such as mobile phones), and in actual projects, this method is more often used.

However, Regression has always been crushed by Heatmap in terms of accuracy. The fully convolutional structure of Heatmap can completely retain position information , so the spatial generalization ability of Gaussian heatmap is stronger. The regression method needs to expand the image vector into a one-dimensional vector at the end, and the position information will be lost during the reshape process. In addition, the fully connected network in Regression needs to convert location information into coordinate values. For this obscure information conversion process, its nonlinearity is extremely strong, so it is not easy to train and converge.

In order to better improve the accuracy of Regression, I will make a series of optimizations and record them here.

1.regression

I will use mobilenetv3 as the backbone of all experiments to build the Baseline of MobileNetv3+Deeppose. The training data comes from the project, and the config is as follows.

model = dict(
    type='TopDown',
    pretrained=None,
    backbone=dict(type='MobileNetV3'),
    neck=dict(type='GlobalAveragePooling'),
    keypoint_head=dict(
        type='DeepposeRegressionHead',
        in_channels=96,
        num_joints=channel_cfg['num_output_channels'],
        loss_keypoint=dict(type='SmoothL1Loss', use_target_weight=True)),
    train_cfg=dict(),
    test_cfg=dict(flip_test=True))

On the cpu side, the model speed is based on the ncnn test, and the conclusions are as follows:

method input size AP50:95 acc_pse time
Deeppose 192*256 41.3% 65% 2.5ms

2.Heatmap

Also using mobilenetv3 as the backbone, unlike Regression, in order to obtain the heat map features with a size of (48, 64), we need to add 3 deconv layers to the head, and sample the feature map with a backbone size of (6, 8) to (48, 64).

model = dict(
    type='TopDown',
    backbone=dict(type='MobileNetV3'),
    keypoint_head=dict(
        type='TopdownHeatmapSimpleHead',
        in_channels=96,
        out_channels=channel_cfg['num_output_channels'],
        loss_keypoint=dict(type='JointsMSELoss', use_target_weight=True)),
    train_cfg=dict(),
    test_cfg=dict(
        flip_test=True,
        post_process='default',
        shift_heatmap=True,
        modulate_kernel=11))

On the cpu side, the model speed is based on the ncnn test, and the conclusions are as follows:

method input size AP50:95 acc_pse time
Deeppose 192*256 41.3% 65% 2.5ms
Heatmap 192*256 67.5% 93% 60ms

Due to the different structure of the head layer, the amount of parameters becomes larger, resulting in a sharp increase in the reasoning time. The structure of the full convolution of Heatmap can completely retain the position information , so the spatial generalization ability of Gaussian heatmap is stronger. The regression method needs to expand the image vector into a one-dimensional vector at the end, and the position information will be lost during the reshape process. In addition, the fully connected network in Regression needs to convert location information into coordinate values. For this obscure information conversion process, its nonlinearity is extremely strong, so it is not easy to train and converge.

3.RLE

Regression only cares about the mean value of the discrete probability distribution (only predicted coordinate values, a mean value can correspond to countless distributions), lost μ \muThe information of the distribution around μ , compared to the heatmap, shows the GT distribution (artificially set the varianceσ \sigmaσ ) is marked as a Gaussian heat map and used as a learning target. The implicit maximum likelihood loss of RLE can help regression determine the mean and variance of the probability distribution, construct the true error probability distribution, and thus better return coordinates.

model = dict(
    type='TopDown',
    backbone=dict(type='MobileNetV3'),
    neck=dict(type='GlobalAveragePooling'),
    keypoint_head=dict(
        type='DeepposeRegressionHead',
        in_channels=96,
        num_joints=channel_cfg['num_output_channels'],
        loss_keypoint=dict(
            type='RLELoss',
            use_target_weight=True,
            size_average=True,
            residual=True),
        out_sigma=True),
    train_cfg=dict(),
    test_cfg=dict(flip_test=True, regression_flip_shift=True))

mmpose has implemented RLE loss, we only need to add loss_keypoint=RLELoss to config to run.

method input size AP50:95 acc_pse time
Deeppose 192*256 41.3% 65% 2.5ms
Heatmap 192*256 67.5% 93% 60ms
RLE 192*256 67.3% 90% 2.5ms

From the above table, we can find that when the RLE loss is introduced, the AP is increased to 67.3%, which is similar to the heatmap, while the reasoning time remains at 2.5ms. Please refer to the detailed explanation of RLE .

4.Integral Pose Regression

We know that when Heatmap reasoning, argmax is used to obtain the index with the highest score in the feature map, but argmax itself is not derivable. In order to solve this problem, IPR uses the Soft-Argmax method to decode, first normalize the probability heat map with Softmax, and then use the expected method to obtain the predicted coordinates. We introduce the IPR mechanism on deeppose, replace the last fc with the conv layer, retain the feature size of the last layer of the backbone, and use the expected softmax to obtain the predicted coordinates. A big benefit of this is the ability to bring more supervised information into training.

I wrote IPRhead code on mmpose

@HEADS.register_module()
class IntegralPoseRegressionHead(nn.Module):
    def __init__(self,
                 in_channels,
                 num_joints,
                 feat_size,
                 loss_keypoint=None,
                 out_sigma=False,
                 debias=False,
                 train_cfg=None,
                 test_cfg=None):
        super().__init__()

        self.in_channels = in_channels
        self.num_joints = num_joints

        self.loss = build_loss(loss_keypoint)

        self.train_cfg = {
    
    } if train_cfg is None else train_cfg
        self.test_cfg = {
    
    } if test_cfg is None else test_cfg

        self.out_sigma = out_sigma
        self.conv = build_conv_layer(
                            dict(type='Conv2d'),
                            in_channels=in_channels,
                            out_channels=num_joints,
                            kernel_size=1,
                            stride=1,
                            padding=0)

        self.size = feat_size
        self.wx = torch.arange(0.0, 1.0 * self.size, 1).view([1, self.size]).repeat([self.size, 1]) / self.size
        self.wy = torch.arange(0.0, 1.0 * self.size, 1).view([self.size, 1]).repeat([1, self.size]) / self.size
        self.wx = nn.Parameter(self.wx, requires_grad=False)
        self.wy = nn.Parameter(self.wy, requires_grad=False)

        if out_sigma:
            self.gap = nn.AdaptiveAvgPool2d((1, 1))
            self.fc = nn.Linear(self.in_channels, self.num_joints * 2)
        if debias:
            self.softmax_fc = nn.Linear(64, 64)

    def forward(self, x):
        """Forward function."""
        if isinstance(x, (list, tuple)):
            assert len(x) == 1, ('DeepPoseRegressionHead only supports '
                                 'single-level feature.')
            x = x[0]

        featmap = self.conv(x)
        s = list(featmap.size())
        featmap = featmap.view([s[0], s[1], s[2] * s[3]])
        featmap = F.softmax(16 * featmap, dim=2)
        featmap = featmap.view([s[0], s[1], s[2], s[3]])
        scoremap_x = featmap.mul(self.wx)
        scoremap_x = scoremap_x.view([s[0], s[1], s[2] * s[3]])
        soft_argmax_x = torch.sum(scoremap_x, dim=2, keepdim=True)
        scoremap_y = featmap.mul(self.wy)
        scoremap_y = scoremap_y.view([s[0], s[1], s[2] * s[3]])
        soft_argmax_y = torch.sum(scoremap_y, dim=2, keepdim=True)
        output = torch.cat([soft_argmax_x, soft_argmax_y], dim=-1)
        if self.out_sigma:
            x = self.gap(x).reshape(x.size(0), -1)
            pred_sigma = self.fc(x)
            pred_sigma = pred_sigma.reshape(pred_sigma.size(0), self.num_joints, 2)
            output = torch.cat([output, pred_sigma], dim=-1)

        return output, featmap

After we introduce IPR, the actual output features are similar to the feature form output by the Heatmap method. The Heatmap method has an artificial probability distribution, that is, a Gaussian heat map, while the introduction of IPR in deeppose uses expectations as coordinates and directly supervises them through coordinate GT. Therefore, the loss decreases as long as the expectation is close to the GT. This brings about a problem that we cannot constrain the probability distribution through the predicted coordinates we expect to obtain.
insert image description here
As shown in the figure above, the expectations of the upper and lower distributions are both mean, but the distributions are completely different. RLE has demonstrated that a reasonable probability distribution is crucial, and supervision of the probability distribution is necessary in order to improve model performance. DSNT proposes to use JS divergence to make the probability distribution of the model closer to the self-made Gaussian distribution. There is a problem here. The variance of the Gaussian distribution can only be set by empirical values, and cannot be given adaptively for each sample. At the same time, the Gaussian distribution It may not be the best choice either.

@LOSSES.register_module()
class RLE_DSNTLoss(nn.Module):
    """RLE_DSNTLoss loss.
    """
    def __init__(self,
                 use_target_weight=False,
                 size_average=True,
                 residual=True,
                 q_dis='laplace',
                 sigma=2.0):
        super().__init__()
        self.dsnt_loss = DSNTLoss(sigma=sigma, use_target_weight=use_target_weight)
        self.rle_loss = RLELoss(use_target_weight=use_target_weight,
                                size_average=size_average,
                                residual=residual,
                                q_dis=q_dis)
        self.use_target_weight = use_target_weight

    def forward(self, output, heatmap, target, target_weight=None):

        assert target_weight is not None
        loss1 = self.dsnt_loss(heatmap, target, target_weight)
        loss2 = self.rle_loss(output, target, target_weight)

        loss = loss1 + loss2 # 这里权重可以调参

        return loss

@LOSSES.register_module()
class DSNTLoss(nn.Module):
    def __init__(self,
                 sigma,
                 use_target_weight=False,
                 size_average=True,
                 ):
        super(DSNTLoss, self).__init__()
        self.use_target_weight = use_target_weight
        self.sigma = sigma
        self.size_average = size_average
    
    def forward(self, heatmap, target, target_weight=None):
        """Forward function.

        Note:
            - batch_size: N
            - num_keypoints: K
            - dimension of keypoints: D (D=2 or D=3)

        Args:
            output (torch.Tensor[N, K, D*2]): Output regression,
                    including coords and sigmas.
            target (torch.Tensor[N, K, D]): Target regression.
            target_weight (torch.Tensor[N, K, D]):
                Weights across different joint types.
        """
        loss = dsntnn.js_reg_losses(heatmap, target, self.sigma)

        if self.size_average:
            loss /= len(loss)

        return loss.sum()

As can be seen from the table below, the performance of the model is improved after the introduction of IPR+DSNT.

method input size AP50:95 acc_pse time
Deeppose 192*256 41.3% 65% 2.5ms
Heatmap 192*256 67.5% 93% 60ms
RLE 192*256 67.3% 90% 2.5ms
RLE+IPR+DSNT 256*256 70.2% 95% 3.5ms

5.Removing the Bias of Integral Pose Regression

After we introduce IPR, we can use Softmax to calculate the expected coordinate values, but using Softmax to calculate expectations will introduce errors. Because Softmax has a feature that makes each value non-zero. For a distribution that is very sharp in itself, Softmax will soften it into a gradient distribution. As a result of this property, the final calculated expected value will be inaccurate. Only when the response value is large enough and the distribution is sharp enough, the expected value is close to the Argmax result. Once the response value is small and the distribution is flat, the expected value will approach the center. This effect becomes more dramatic with larger feature sizes.

Removing the Bias of Integral Pose Regression proposes the debias method to eliminate the influence of Softmax softening. Specifically, assuming that the response value conforms to the Gaussian distribution, we can divide the feature map into four regions according to the width of twice the maximum response point: we know that once Softmax passes, the
insert image description here
original value of 0 is 2, 3, The 4-quadrant area will be filled with long tails in an instant, and for the 1st quadrant area, since the response value is in the center of the area, the estimated expected value of this area will be accurate regardless of the response value.

Let's go back to the Softmax formula:
insert image description here
for brevity, let's use C to represent the denominator:

insert image description here
Since it is assumed that the response values ​​of areas 2, 3, and 4 are all 0, the molecular part is calculated as 1, and the Softmax result after dividing the area can be expressed as:

insert image description here
Then continue to bring in according to the calculation formula of Soft-Argmax, the calculation of the expected value can be expressed as:

insert image description here

That is: the expected value of the first area, plus the expected value of the other three areas. It is known that 2, 3, 4 tend to H ~ ( P ) = 1 / c \tilde{H}(P)=1/cH~(P)=1/ c , so the expected value of these three areas can be raised by 1/c, leaving only

![Insert picture description here](https://img-blog.csdnimg.cn/0004484651fa48b9b9bb727a6ea7b0e5.png

The summation here is geometrically equivalent to multiplying the coordinates of the center point of the area by the area of ​​the area. I will give a simple demonstration for the [n, m] interval:

insert image description here

Therefore, the expected value of the entire feature map can be regarded as the weighted sum of the coordinates of the center points of the four regions:

Due to the symmetry of the center points of the four areas, suppose the coordinates of the center point of the first area are J 1 = ( x 0 , y 0 ) J_1=(x_0,y_0)J1=(x0,y0) , then the center point coordinates of the remaining three areas areJ 2 = ( x 0 , y 0 + w / 2 ) , J 3 = ( x 0 + h / 2 , y 0 ) , J 4 = ( x 0 + h / 2 , y 0 + w / 2 ) J_2=(x_0,y_0+w/2),J_3=(x_0+h/2,y_0), J_4=(x_0+h/2,y_0+w/2)J2=(x0,y0+w/2),J3=(x0+h/2,y0),J4=(x0+h/2,y0+w/2)

Corresponding to the 1/c we got above multiplied by the coordinates of the center point multiplied by the area, each weighted value is obtained:
insert image description here

Bringing into the above weighted sum formula (6), the expected value of the entire feature map can be expressed as:
insert image description here
Since it is known that the weights of the four regions add up to 1, there is w 1 = 1 − w 2 − w 3 − w 4 w_1= 1-w_2-w_3-w_4w1=1w2w3w4, so the expected value of the entire feature map is simplified into the following form:
insert image description here
Since J r J^rJThe r value can be easily obtained by calculating Soft-Argmax for the entire image, so the accurate coordinates of the center point of the first area can be obtained by translating the formula (9): this step is equivalent to
insert image description here
removing the original redundant long tail from the expected value Subtracted, we can further analyze the formula, the expected estimated value of the whole picture is equivalent to an offset of the expected value of the first area.

@HEADS.register_module()
class IntegralPoseRegressionHead(nn.Module):
    def __init__(self,
                 in_channels,
                 num_joints,
                 feat_size,
                 loss_keypoint=None,
                 out_sigma=False,
                 debias=False,
                 train_cfg=None,
                 test_cfg=None):
        super().__init__()

        self.in_channels = in_channels
        self.num_joints = num_joints

        self.loss = build_loss(loss_keypoint)

        self.train_cfg = {
    
    } if train_cfg is None else train_cfg
        self.test_cfg = {
    
    } if test_cfg is None else test_cfg

        self.out_sigma = out_sigma
        self.debias = debias

        self.conv = build_conv_layer(
                            dict(type='Conv2d'),
                            in_channels=in_channels,
                            out_channels=num_joints,
                            kernel_size=1,
                            stride=1,
                            padding=0)

        self.size = feat_size
        self.wx = torch.arange(0.0, 1.0 * self.size, 1).view([1, self.size]).repeat([self.size, 1]) / self.size
        self.wy = torch.arange(0.0, 1.0 * self.size, 1).view([self.size, 1]).repeat([1, self.size]) / self.size
        self.wx = nn.Parameter(self.wx, requires_grad=False)
        self.wy = nn.Parameter(self.wy, requires_grad=False)

        if out_sigma:
            self.gap = nn.AdaptiveAvgPool2d((1, 1))
            self.fc = nn.Linear(self.in_channels, self.num_joints * 2)
        if debias:
            self.softmax_fc = nn.Linear(64, 64)

    def forward(self, x):
        """Forward function."""
        if isinstance(x, (list, tuple)):
            assert len(x) == 1, ('DeepPoseRegressionHead only supports '
                                 'single-level feature.')
            x = x[0]

        featmap = self.conv(x)
        s = list(featmap.size())
        featmap = featmap.view([s[0], s[1], s[2] * s[3]])
        if self.debias:
            mlp_x_norm = torch.norm(self.softmax_fc.weight, dim=-1)
            norm_feat = torch.norm(featmap, dim=-1, keepdim=True)
            featmap = self.softmax_fc(featmap)
            featmap /= norm_feat
            featmap /= mlp_x_norm.reshape(1, 1, -1)
            
        featmap = F.softmax(16 * featmap, dim=2)
        featmap = featmap.view([s[0], s[1], s[2], s[3]])
        scoremap_x = featmap.mul(self.wx)
        scoremap_x = scoremap_x.view([s[0], s[1], s[2] * s[3]])
        soft_argmax_x = torch.sum(scoremap_x, dim=2, keepdim=True)
        scoremap_y = featmap.mul(self.wy)
        scoremap_y = scoremap_y.view([s[0], s[1], s[2] * s[3]])
        soft_argmax_y = torch.sum(scoremap_y, dim=2, keepdim=True)
        # output = torch.cat([soft_argmax_x, soft_argmax_y], dim=-1)
            
        if self.debias:
            C = featmap.reshape(s[0], s[1], s[2] * s[3]).exp().sum(dim=2).unsqueeze(dim=2)
            soft_argmax_x = C / (C - 1) * (soft_argmax_x - 1 / (2 * C))
            soft_argmax_y = C / (C - 1) * (soft_argmax_y - 1 / (2 * C))
            
        output = torch.cat([soft_argmax_x, soft_argmax_y], dim=-1)
        if self.out_sigma:
            x = self.gap(x).reshape(x.size(0), -1)
            pred_sigma = self.fc(x)
            pred_sigma = pred_sigma.reshape(pred_sigma.size(0), self.num_joints, 2)
            output = torch.cat([output, pred_sigma], dim=-1)

        return output, featmap
method input size AP50:95 acc_pse time
Deeppose 192*256 41.3% 65% 2.5ms
Heatmap 192*256 67.5% 93% 60ms
RLE 192*256 67.3% 90% 2.5ms
RLE+IPR+DSNT 256*256 70.2% 95% 3.5ms
RLE+IPR+DSNT+debias 256*256 71% 95% 3.5ms

Thank you very much for the guidance given by Zhihu’s mirror article. I have learned a lot here. Friends who are interested can check Zhihu’s address .

Guess you like

Origin blog.csdn.net/litt1e/article/details/126472321