Lightweight Stacked Hourglass Network

Author: Computer Vision Research Institute
Editor: 3D Vision Developer Community

 

overview

In the field of AI painting, many researchers are working to improve the controllability of the AI ​​painting model, that is, to make the images generated by the model more in line with human requirements. A while ago, a model called ControlNet took this controllability to new heights. Around the same time, researchers from Alibaba and Ant Group also made achievements in the same field. This article is a detailed introduction to this achievement.

Code address: https://github.com/jameelhassan/PoseEstimation

Human pose estimation (HPE) is a classic task in computer vision that focuses on representing the orientation of a person by identifying their joint locations. HPE can be used to understand and analyze human geometry and motion-related information. The stacked hourglass structure proposed by Newell et al. in [Stacked hourglass networks for human pose estimation. In European conference on computer vision, pages 483–499] was one of the first compelling deep learning-based HPE methods, because in Until then, classical approaches dominated the HPE literature. In this work, repeated bottom-up and top-down processing is utilized to capture information from different scales, and intermediate supervision is introduced to iteratively refine predictions at each stage. This greatly improved accuracy compared to state-of-the-art methods at the time.

picture

However, HPE is a real-time application as it is often used as a predecessor of another module. Therefore, in this case, it is crucial to focus on computational efficiency. In this study, the researchers made architectural and non-architectural modifications to the stacked hourglass network to obtain a model that is both accurate and computationally efficient. In the following, the researchers provide a brief description of the baseline model. The original architecture consists of multiple stacked hourglass units, each consisting of four downsampling and upsampling levels. At each level, downsampling is achieved by residual blocks and max pooling operations, while upsampling is achieved by residual blocks and nearest neighbor interpolation. This process ensures that the model captures both local and global information, which is important to coherently understand the full body for accurate final pose estimation.

After each max-pooling operation, the network branches, applying more convolutions at the pre-pooling resolution through another residual block, whose results are added as skip connections to the corresponding upsampled feature maps in the second half of the hourglass. The output of the model is a heatmap for each joint, which models the probability that a joint is present at each pixel. Predict the intermediate heatmap after each hourglass and apply the loss on it. Furthermore, these predictions are projected to more channels and serve as the input of the subsequent hourglass, as well as the input of the current hourglass and its feature map output.

 

design choice

Depthwise Separable Convolutions

Depthwise separable convolutions replace traditional convolutions to reduce the number of parameters for convolution operations. This is performed by using convolutions to spatially separate the convolutions across channels, and then aggregating the channel information via pointwise convolutions, as shown in the following figure:

picture

 

Dilated Convolution

The dilated convolution described in the equation below is a variant of the regular convolution operation that has the ability to exponentially increase the receptive field without loss of resolution or coverage, as is the case with the pooling operation.

picture

 

Ghost Bottleneck

The Ghost bottleneck proposed in [Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition] also reduces the computational complexity of convolution operations by dividing the convolutions differently. In order to generate a fixed number of channels, the Ghost bottleneck uses regular convolution to output a small number of channels, and the remaining channels are generated by cheaper linear operations, as shown in the figure below. These are concatenated and convolved to output the desired number of channels.

picture

 

DiCE Bottleneck

The Dimensional Convolutional (DiCE) unit for Efficient Networks is a convolutional unit proposed by Mehta et al., which compromises dimensional convolution and dimensional fusion. The convolution operation is applied to each of the three input dimensions (width, height, and depth). To combine encoded information along each of these dimensions, an efficient fusion unit is used to combine these representations. Therefore, the DiCE unit can efficiently capture information along the spatial and channel dimensions.

 

Shuffle Bottleneck

The shuffle unit first proposed in [Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition] uses point-wise group convolution and channel shuffling to improve computational efficiency while maintaining accuracy sex.

picture

 

Perceptual Loss

Perceptual loss is used to compare similar images with small differences. Here we use it as a feature-level Mean Squared Error (MSE) loss between two images, which computes the loss at the high-level feature map rather than the original image space. The hypothesis here is that if you let the first hourglass "perceive" what the second hourglass "perceives" at a high feature level, the overall performance of the network improves. The total loss shown in the equation below consists of the perceptual loss and the original prediction loss, where the prediction loss has a higher weight.

picture

 

Residual connection

The researchers also replace the existing residual connection addition with cascaded residual connections followed by pointwise convolution to obtain the desired number of channels, called ResConcat. Also included is a residual connection from the narrowest feature map of the hourglass (neck) to the next hourglass neck, called NarrowRes.

 

experiment

picture

picture

 
 

Copyright statement: This article is authorized by the special author of Obi Zhongguang 3D Vision Developer Community to publish it originally. Without authorization, it may not be reproduced. This article is only for academic sharing. The copyright belongs to the original author. If any infringement is involved, please contact to delete the article.

The 3D vision developer community is a sharing and communication platform created by Obi Zhongguang for all developers, aiming to open 3D vision technology to developers. The platform provides developers with free courses in the field of 3D vision, Obi Zhongguang exclusive resources and professional technical support.

Join [3D Vision Developer Community] to learn cutting-edge knowledge of the industry and empower developers to improve their skills!
Join [3D Vision AI Open Platform] to experience AI algorithm capabilities and help developers implement visual algorithms!

Past recommendations:

1. The 3rd 3D Vision Innovation Application Contest of Orbi Zhongguang & Nvidia ended successfully!
2. Hurry up! The final of the 3rd 3D Vision Innovation Application Competition 2023 is about to start!
3. DeepMIM: A deep supervision method introduced in MIM
​4. SPM: A plug-and-play shape prior module!

Guess you like

Origin blog.csdn.net/limingmin2020/article/details/130409743