Pose Anything is officially open source | A single model can realize key point positioning of any target category

Traditional 2D pose estimation models are limited by their class-specific designs and only work for predefined object classes. This limitation becomes particularly difficult when dealing with new objects due to the lack of relevant training data.

To address this limitation, a common object-oriented pose estimation (CAPE) method is introduced. CAPE aims to achieve keypoint localization for any object class using a single model, requiring only a minimum number of supporting images with annotated keypoints. This method not only realizes object pose generation based on arbitrary key point definitions, but also significantly reduces the related costs, paving the way for various adaptable pose estimation applications.

The authors propose a new CAPE method that exploits the inherent geometric relationships between keypoints through a newly designed graph Transformer decoder. By capturing and integrating this critical structural information, our method enhances the accuracy of keypoint location, which marks that compared with traditional CAPE techniques, we no longer treat keypoints as independent entities.

The authors validate our method on the MP-100 dataset, which includes more than 20,000 images covering more than 100 categories. Our method achieves significant improvements of 2.16% and 1.82% in 1-shot and 5-shot settings respectively. Furthermore, full-end training of our method demonstrates scalability and efficiency compared with previous CAPE methods.

Code: https://orhir.github.io/pose-anything/

1 Introduction

Two-dimensional pose estimation, also known as keypoint localization, has made significant progress in computer vision research in recent years and has found a variety of applications in both academic and industrial fields. This task involves predicting the location of specific semantic parts of an object in an image. It is worth noting that it plays a vital role in human posture estimation, animal posture estimation (zoology), and vehicle posture estimation of autonomous vehicles (autonomous driving) in fields such as video understanding and virtual reality.

However, a fundamental limitation of traditional pose estimation models is their inherent class-specificity. These models typically only work within predefined object categories, limiting their usefulness to the domain in which they were trained. Therefore, these models are limited in their real-life applications involving new objects due to the lack of relevant training data.

picture

To address this challenge, the Common Object-Oriented Pose Estimation (CAPE) method is proposed. CAPE aims to perform keypoint localization for arbitrary object classes using a single model, by using one or several supporting images annotated with keypoints. This approach makes it possible to generate object poses based on arbitrary keypoint definitions. Remarkably, it significantly reduces the overhead of data collection, model training, and parameter tuning associated with each new category. This paves the way for more flexible and adaptable applications in the field of pose estimation. Figure 1 shows some results of the authors' work.

The current CAPE method will support matching keypoints with corresponding keypoints in the Query image. Matching is performed in latent space to facilitate the location of key points. A significant difference compared with previous CAPE methods is that they treat key points as single, disconnected points. Unlike these methods, the authors believe that the underlying geometry of the object is of great significance. This structure serves as a powerful prior that can significantly improve the accuracy of key point location. Based on this understanding, the authors introduce a new graph Transformer decoder specifically designed to capture and integrate this critical structural information. By doing so, the authors aim to exploit the inherent relationships and dependencies between key points.

We evaluate our method on the object-oriented general pose estimation benchmark dataset MP-100. This dataset contains more than 20,000 images from more than 100 categories, including animals, vehicles, furniture, and clothing. It consists of samples from existing class-specific pose estimation datasets. Since some bone data were missing, the authors collected bone annotations from the original dataset and annotated several categories defined for missing bones.

Our method improves CapeFormer's previous best method by 2.16% and 1.82% under 1-shot and 5-shot settings, respectively. The authors' method is trained end-to-end and shows scalability and efficiency compared to previous methods.

In summary, the authors make three main contributions:

  1. A new method is proposed, which utilizes the geometric relationship between key points and is implemented through a novel graph Transformer decoder.

  2. An updated version of the MP-100 dataset with skeletal annotations for all categories is provided.

  3. State-of-the-art performance on the MP-100 benchmark.

2 Related Works

Detection Transformer

DETection Transformer (DETR) is intuitive in its simplicity and concept, and is broadly applicable, requiring minimal domain expertise while avoiding the need for custom label assignments or non-maximum suppression.

However, the original DETR design has challenges such as slow convergence speed and reduced detection accuracy. To address these issues, many subsequent studies emerged that improved the DETR paradigm. These works pave the way for the development of state-of-the-art object detectors, leveraging innovations such as the reintroduction of multi-scale features and local cross-attention computation.

The connection between detection problems and pose estimation is increasingly prominent. The former mainly focuses on bounding box estimation and localization, while the latter focuses on key point localization. Recognizing this, the authors incorporated several improvements of the DETR model into the latest object-oriented general pose estimation methods, thereby improving their performance.

Category-Agnostic Pose Estimation

The main goal of pose estimation is to accurately locate the semantic keypoints of an object or instance. Traditionally, pose estimation methods have focused on specific categories such as humans, animals, or vehicles. Existing research mainly focuses on designing powerful convolutional neural networks or Transformer-based architectures. However, these methods only work for the object categories encountered during training.

A relatively unexplored aspect is general object-oriented pose estimation, proposed by POMNet, where the focus turns to building general representations and similarity measures. This method predicts keypoints by comparing supporting keypoints with Query images in the embedding space, solving the problem of object categories not encountered during training.

POMNet uses a Transformer to encode the Query image and supporting key points, and directly predicts the similarity by concatenating the supporting key points and Query image features. CapeFormer extends the matching paradigm to a two-stage framework to improve prediction accuracy by correcting unreliable matching results. The authors follow CapeFormer's approach and enhance their approach with a more reliable baseline. Furthermore, the authors focus on the importance of geometric structures and seamlessly integrate them into the authors' architectural framework.

Graph Neural Network

Originally designed for use on graph data similar to social networks, citation networks, and biochemical graphs, GCNs have gradually gained attention in computer vision tasks. They find various applications such as scene graph generation, point cloud classification, and action recognition. Scene graph generation involves parsing images into graphs representing objects and their relationships, often combined with object detection. Point clouds obtained from lidar scans can be efficiently classified and segmented using GCNs. In addition, GCNs are also used to process naturally constructed graphs of human joints for human activity recognition. Recent work has even suggested using GNNs as fundamental feature extractors for vision tasks, representing images as graphical structures.

Unlike GCNs, Graph Attention Networks (GAT) introduce the concept of self-attention. GAT integrates graph structure into the attention mechanism by masking attention. Similarly, aGCN relies on self-attention to determine the weights of neighboring joints, differing in their choice of activation functions and transformation matrices. Although GAT and aGCN have made progress in aggregating adjacent joints, challenges remain with the restricted perceptual domain. Zhao et al. solve this problem by introducing a learnable weight matrix to the adjacency matrix, making it possible to learn the semantic relationships between 2D joints. Similarly, Lin et al. exploit the standard Transformer encoder by adjusting the dimensionality of the encoder layer.

However, these methods, as mentioned earlier, fail to fully exploit the inherent graph structure. To address this challenge, Graformer proposes a solution involving attention in convolutional layers followed by a learnable adjacency matrix. However, their method is mainly aimed at domain-specific applications and is therefore not suitable for the author's general object-oriented pose estimation task.

The authors' general object-oriented pose estimation method solves this problem by integrating self-attention and graph convolutional networks into the decoder architecture. Self-attention promotes learnable connections between keypoints, while graph convolutional networks leverage the input of skeletal graphs to help the model share information between semantically related keypoints.

3 Method

The author first introduces an enhanced Baseline based on CapeFormer. The authors then describe a graph-based approach that exploits the powerful graph structures available in the data. The full architecture of the author's method is shown in Figure 2.

picture

An Enhanced Baseline

The author made two changes to the CapeFormer architecture. First, the authors replaced its Resnet-50 [14] backbone with a comparable size and stronger Transformer-based SwinV2-T [22] backbone. The authors also removed positional encoding because they found that it introduced an undesirable dependency on keypoint order. Using these changes, the author created an enhanced Baseline with the same basic architecture as CapeFormer. The authors elaborate on the impact of these changes in the experimental section.

For the sake of completeness, the author briefly introduces the design of CapeFormer. Its design is similar to the DETR architecture and consists of 4 sub-networks:

Shared backbone: Use pre-trained ResNet-50 to extract features of the input support image and Query image. To obtain the features supporting keypoints, the authors perform element-wise multiplication of the supporting image feature maps and keypoint masks. These masks are created by placing a Gaussian kernel centered at the supporting keypoint location. In scenarios involving multiple supporting images, such as a 5-shot setup, the authors calculate the average (in feature space) value of supporting keypoint features from different images. This results in Query feature map and Support Keypoint features.

Transformer encoder: Use the Transformer encoder to fuse information that supports key point features and Querypatch features. The encoder consists of three Transformer blocks, each block contains a self-attention layer. Before entering the self-attention layer, the input supporting keypoints and query features are concatenated and then separated again. The output is a refined Query feature map and refined supporting keypoint features.

Similarity-aware proposal generator: The proposal generator will support alignment of keypoint features with Query features to generate similarity maps. From these plots, the authors select peaks as similarity-aware proposals. To strike a balance between efficiency and flexibility, a trainable inner product mechanism [31] is adopted to explicitly represent the similarity.

Transformer decoder: In order to decode key point positions from the Query feature map, a Transformer decoder network is utilized. The Transformer decoder consists of three layers, each containing self-attention, cross-attention and feed-forward blocks. An iterative refinement strategy similar to previous work is applied to enable each decoder layer to predict coordinate offsets predicted by the previous coordinate. Furthermore, similar to the Conditional DETR approach, the decoder utilizes the predicted coordinates to provide enhanced reference points to pool features from image feature maps.

A Graph-Based Approach

The core idea of ​​the authors' work is to exploit the geometric structure encoded in the pose graph. The author's method is based on an enhanced Baseline, replacing the Transformer decoder module with the author's new graph Transformer decoder module. Now the author will discuss the author's new graph Transformer decoder and related loss functions.

Graph Transformer Decoder (GTD)

The authors introduce a decoder called Graph Transformer Decoder (GTD) that works directly on graph structures. As shown in Figure 3, GTD uses a new feed-forward network in the Transformer decoding layer, which includes a graph convolution network layer (GCN).

picture

The authors recognize that the self-attention mechanism that helps the model focus on relevant information can be viewed as a graph convolutional network (GCN) with a learnable adjacency matrix. When the authors deal with single-category pose estimation, this mechanism is sufficient to learn the relationship between key points and integrate the learned structure into the model. However, for object-oriented general pose estimation (CAPE) tasks, where models need to work with a variety of object categories, it is beneficial to explicitly consider semantic connections between keypoints. This helps the model break symmetries, maintain a consistent structure, and handle noisy keypoints through information sharing between neighboring keypoints.

The author introduces this prior into the Transformer decoder. Specifically, GTD is based on the original CapeFormer decoder, changing the feedforward network from a simple MLP to a GCN network. To address the over-smoothing problem commonly observed in deep GCNs, which may lead to reduced uniqueness of node features and thus lower performance, the authors introduce a linear layer after each node. Figure 4 illustrates the power of implementing graph priors in cross-attention layers. As shown in the figure, early layers focus on one keypoint, while later layers consider adjacent keypoints (according to the graph structure).

picture

The original decoder has three main components: self-attention, cross-attention and feed-forward network. Self-attention allows adaptive interactions between supported keypoints to transform input features into. Cross-attention enhances attention on proposed locations by extracting local positioning information from Query feature blocks. This involves concatenating keypoint location embeddings with keypoint features. The input to cross-attention is (with positional embedding) as Query, block features (with position encoding) as Key, and as Value, and the result is the transformed feature.

Finally, the author adopts a new feed-forward network to process the output keypoint features. To introduce prior geometric knowledge, the authors introduce a graph neural network (GCN) layer. This GCN layer further condenses key point features and facilitates information exchange between known key points.

The output value for the input can be expressed as:

where is a learnable parameter matrix, an activation function (ReLU), which is in the form of a symmetric normalized adjacency matrix, defined to represent nodes connected to nodes, and 0 elsewhere.

The output features are used to update keypoint locations from . The authors follow [32] and use the following update function:

where and are sigmoid and its inverse function respectively. The keypoint positions in the last decoder layer are used as the final keypoints predictions.

The training loss follows CapeFormer and the authors use two supervision signals: a heatmap loss and an offset loss. The heatmap loss supervises the similarity measure and initial coordinate proposals, while the offset loss supervises the localization output:

where represents the similarity heat map output by the proposed generator, is the sigmoid function, is the GT heat map, is the position of each layer output, and is the GT position. The total loss term is:

4 Experiments

We use the MP-100 dataset as our training and evaluation dataset, which contains samples from existing class-specific pose estimation datasets. The MP-100 dataset includes over 20,000 images across 100 different categories, with the number of keypoints in different categories varying between 68. To facilitate model training and evaluation, the samples were divided into five different splits. Notably, each split ensures that the training, validation, and testing categories are independent of each other, meaning that the categories used for evaluation are unseen during training.

The dataset includes partial skeleton annotations in various formats, including differences in keypoint indexing (some start at zero, while others start at one). To standardize and enhance the dataset, the authors adopted a unified format that contains comprehensive skeletal definitions for all categories. This process involves cross-referencing the original data sets and manually annotating them where necessary to ensure completeness and consistency.

In order to quantify the performance of the author's model, the author adopts the Probabilistic Correct Keypoint (PCK) metric and sets the PCK threshold to 0.2 in accordance with the standards established by POMNet and CapeFormer.

picture

The authors present a qualitative comparison of our method with previous CAPE methods CapeFormer and POMNet1 in Figure 5. The last column represents the author's complete method, based on the author's enhanced Baseline, replacing the previous decoder with the author's new GTD module. As can be seen, enhanced Baseline performs better than previous state-of-the-art techniques in key point localization.

Furthermore, the structural information included in the authors' method serves as a powerful prior for keypoint localization, helping to break symmetries and create structural consistency among keypoints. The first and third rows of Figure 5 show examples where the prior method may confuse but does not confuse the author, respectively, and the author's superiority in maintaining structural consistency among key points. There is more information in the supplementary section of Figure 5.

Implementation Details

For fair comparison, network parameters, training parameters and data augmentation and preprocessing are consistent with CapeFormer: both encoder and decoder have three layers and are set to 2.0. The model was built based on the MMPose framework and trained for 200 epochs using the Adam optimizer with a batch size of 16 and a learning rate set to , and decayed by a factor of 10 at epochs 160 and 180. More design choices and evaluations can be found in the supplementary materials.

Enhance Backbone. The author uses a stronger Transformer-based SwinV2-T Backbone instead of the original ResNet-50 Backbone, which can also provide multi-scale feature maps. After trying various configurations, including multi-scale and single-scale feature maps, the authors found that applying simple bilinear upsampling of the final feature layer to histogram-like results while retaining simplicity.

In addition, the authors optimized the extraction of supporting keypoint features by taking advantage of the enhanced feature quality of the new Backbone by using a Gaussian kernel mask with lower variance to extract supporting keypoint features. These simple adjustments resulted in a 3.2% improvement.

Disable support for keypoint identifiers. CapeFormer introduces a key position encoding called "Support Key Identifier". This encoding is generated by sinusoidal encoding of the sequence of key points. This encoding significantly improves PCK by 3% while introducing dependence on keypoint order. Although the dataset contains multiple object categories, some of them share the same keypoint order.

This became apparent when the authors evaluated a model trained using this positional encoding on data with keypoints in reverse order. The result is a roughly 30% drop in PCK, indicating that the model has become very specialized, targeting a specific keypoint format. This drastic drop would not exist without using positional encoding. The authors argue that object-oriented generic pose estimation (CAPE) should not rely on such assumptions but should accommodate supporting keypoints in no particular order. Therefore, the author removed this encoding from the author's baseline.

Benchmark Results

The authors compared our method with previous CAPE methods CapeFormer and POMNet and three baselines: ProtoNet, MAML and Fine-tuned. More details on the evaluation of these models can be found in [42].

picture

The authors report results in 1-shot and 5-shot settings on the MP-100 dataset. As shown in Table 1, the enhanced Baseline model outperforms previous methods in a setting that is unbiased on keypoint order relative to CapeFormer and improves the average PCK by 0.94% in the 1-shot setting and 0.94% in the 5-shot setting. An increase of 1.60%. The authors' graph-based approach further improves performance, improving the enhanced baseline by 1.22% in the 1-shot setting and 0.22% in the 5-shot setting, achieving new state-of-the-art results in both settings.

The authors' design also demonstrates scalability. Similar to DETR-based models, using a larger backbone can improve performance. The authors show that our graph decoder design also enhances the performance of the larger augmented baseline, improving results by 1.02% and 0.34% in the 1-shot and 5-shot settings respectively.

picture

To evaluate the robustness of our model, we evaluate smaller versions of our network using images from different domains. The results are shown in Figure 6. The authors' model is trained only on real images, demonstrating its adaptability and effectiveness across different data sources such as cartoons and fictional animals, created by diffusion models. In addition, the author's model also shows satisfactory performance when supporting images and Query images from different domains.

Ablation Study

The authors conducted a series of ablation experiments on the MP-100 data set. First, the author uses different Backbones to evaluate the author's method, demonstrating the advantages of the Swin Transformer architecture on localization tasks. The authors then show the contribution of geometric priors by evaluating the use of incorrect skeletal relationships. Finally, the authors further demonstrate the power of graph structures by evaluating the use of masked inputs. The authors will conduct all ablation experiments on the segmentation of the MP-100’s test set following [32, 42] in the 1-shot setting of the MP-100’s test set.

Different Backbones. The author evaluates the performance of the author's model with various pre-trained Backbones, including a CNN-based Backbone (ResNet-50) and two different pre-trained Transformer Backbones, namely Dino and Swin V2. Dino and DinoV2 are trained by self-distillation of a self-supervised visual Transformer to obtain robust and refined semantic features.

Furthermore, these encoded semantic representations are shared between related but distinct object categories. Swin Transformer reconstructs self-attention through the window mechanism, achieving a balance between efficiency and performance in visual tasks. This hierarchy provides the flexibility to model at different scales and is comparable to linear computational complexity at image size. As shown in Table 2, SwinV2 outperforms other options, delivering superior results while maintaining efficiency comparable to CNN-based Backbone. Additionally, using a larger Backbone can improve performance at the expense of efficiency and size.

picture

Evaluating the contribution of graph structure priors. To evaluate the contribution of graph structure priors, we evaluate our method using random graph input, with edge connections randomly selected for each instance. This resulted in a performance drop of 9.57%, validating the contribution of the graph decoder and leveraging structural knowledge.

To emphasize the power of graph information, the authors partially occlude the support/query images before executing the algorithm. It should be noted that the difference between the Baseline model and the authors is that the new graph is based on a feed-forward network. The quantitative analysis in Figure 6(a) shows that the author's method consistently outperforms the Baseline model. The dashed horizontal line represents the base operation, which outputs the positions of supporting key points while preserving the structure. Remarkably, the authors' model is able to predict keypoints even when a large portion of the supporting image is occluded. This indicates that the model has learned which keypoints are relevant for each category and matched them to supporting features based on structure.

picture

For example, in Figure 7(b) (top), the model accurately predicts most of the key points of the sofa structure. However, when a large part of the Query image is occluded, the model's performance drops rapidly, although it preserves the structure, as shown in Figure 7(b) (bottom). For more examples and additional ablation studies, see the supplementary material.

5 Conclusion

The authors propose a novel approach for object-oriented general pose estimation (CAPE) by recognizing the importance of the inherent geometric structure within an object. The authors introduce a powerful graph Transformer decoder that significantly improves the accuracy of keypoint localization by capturing and integrating structural information to exploit the relationships and dependencies between keypoints. In addition, the authors provide an updated version of the MP-100 dataset that now includes bone annotations for all categories, further advancing CAPE research.

Our experimental results show that our method is superior to the previous state-of-the-art method CapeFormer to a large extent. With improvements in both 1-shot and 5-shot settings, the authors' approach demonstrates its scalability and efficiency, opening the door to more flexible and adaptable applications in the field of computer vision.

reference

[1]. Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation.

おすすめ

転載: blog.csdn.net/jacke121/article/details/135040264