GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB Reading Notes

This article is a paper of 2018CVPR. It mainly talks about 3D hand tracking from monocular RGB images. It mainly includes three aspects, training data generation, key point prediction, and 3D skeleton fitting. It is mainly compared with a paper of ICCV in 2017, which is the first paper about hand tracking of monocular images. The abstract part roughly describes the main content of this paper. The first part introduces the current state of 3D hand pose estimation, the idea of ​​this paper inspired by recent research, and the main work content. The second part of related work describes the current status of multi-view research, monocular methods, datasets, and learning-based methods. The third part of the hand tracking system describes the training data generation, hand joint regression and kinematic stock price fitting. The fourth part of the experience deals with qualitative and quantitative comparisons. The fifth part is limitations and discussion. The sixth part is the summary. The supplementary material gives the architecture and training details of the network. The code and datasets are all open source.

Abstract: We address the extremely challenging problem of real-time 3D hand tracking based on monocular RGB sequences. Our tracking method combines a convolutional neural network with a kinematic 3D hand model such that it generalizes well to unseen data, is robust to occlusions and different camera viewpoints, and leads to anatomically plausible and temporally smooth hand movements. When training CNNs, we propose a novel approach to synthetic training data based on geometrically consistent image-to-image translation networks. More specifically, we use a neural network to convert synthetic images into "real" images such that the resulting images follow the same statistical distribution as images of real-world hands. To train this translation network, we combine an adversarial loss, a cycle consistency loss, and a geometric consistency loss to preserve geometric properties (such as hand pose) during translation. We demonstrate that our hand tracking system outperforms the current state-of-the-art in challenging monocular RGB.

1 Introduction

Estimating the 3D pose of the hand is a long-standing goal of computer vision with many applications such as virtual/augmented reality (VR/AR) [17, 26] and human-computer interaction [38, 18]. While a large body of existing work considers hand tracking or pose estimation based on unlabeled images, many of them require depth cameras [34, 47, 39, 44, 27, 7, 54] or multi-view settings [41, 1 ,56]. In many applications, however, these requirements are disadvantageous because such hardware is less common, more expensive, and does not work in all scenarios.

In contrast, we address these issues and propose a novel real-time skeletal 3D hand tracking algorithm that is robust to object occlusion and clutter. Some recent studies only consider RGB markerless hand tracking [36, 63, 8], but there are obvious limitations. For example, the method of Simon et al. [36] achieves the estimation of 3D joint positions in a multi-view setting; however, in the monocular case, only 2D joint positions can be estimated. Likewise, the method by GomezDonoso et al [8] is also limited to 2D. Recently, Zimmermann and Brox [63] proposed a monocular RGB-based 3D hand position estimation method, but this method can only obtain relative 3D positions and suffers from occlusion problems.

Inspired by recent work in hand and body tracking [49, 20, 19], we combine CNN-based 2D and 3D hand joint prediction with a kinematic fitting step to track hands in global 3D from monocular RGB . The main problem with this (supervised) learning-based approach is the need for properly labeled training data. While manually annotating 2D joint locations in single-view RGB images [14] has been shown to be feasible, accurate annotation in 3D is impossible due to inherent depth ambiguity. One way to overcome this problem is to leverage existing multi-camera approaches to track 3D hand motion [41, 1, 56, 8]. However, the resulting annotations will lack precision due to unavoidable tracking errors. Some works render synthetic hands where the perfect ground truth is known [20,63]. However, cnns trained on synthetic parts) improve results in the scene where image-to-image translation produces pose-preserving results with less texture leakage and sharper contours. Once this network is trained, we can use it to convert any synthetically generated image into a "real" image while preserving perfect (and cheap) real annotations. In the remainder of this paper, we label images as "real" (in quotes), or ganggenerated, when we refer to synthetic images processed by our translation network as following the same statistical distribution as real images.

Finally, using the annotated RGB images generated by the GAN, we train a CNN that jointly regresses image-space 2D and root-relative 3D hand joint locations. While a skeletal hand model combined with 2D predictions is sufficient to estimate the global translation of the hand, the relative 3D position resolves the inherent ambiguities of global rotation and sharpness that arise in 2D positions. In summary, our main contributions are:

• The first real-time hand tracking system to track global 3D hand pose from unconstrained monocular RGB images.

• A novel geometrically consistent GAN that performs image-to-image translation while preserving pose during translation.

• Based on this network, we are able to enhance synthetic hand image datasets such that the statistical distribution resembles real hand images.

• A new RGB dataset with annotated 3D hand joint positions. We overcome existing datasets in size (>260k frames), image fidelity, and annotation accuracy.

2. Related work

Our goal is to track hand poses from unconstrained monocular RGB video streams at real-time frame rates. This is a challenging problem because of the large pose space, depth ambiguity due to occlusion by objects, appearance changes due to lighting and skin tones, and camera viewpoint changes. While glove-based solutions can address these challenges, they are cumbersome to wear. Therefore, in the following discussion, we will discuss marker-free camera-based methods that attempt to address these challenges.

Multi-View Approach: The use of multiple RGB cameras greatly alleviates occlusion during hand motion and interaction. [56] demonstrated hand tracking with two cameras using a discriminative approach to quickly find the closest pose in the database. [23] showed tracking of hands and manipulated objects using 8 calibrated cameras in a studio setting. Ballan et al. [1] also used 8 synchronized RGB cameras to estimate the pose, adding input from discriminative detection points on the fingers. [41,42] used 5 RGB cameras and an additional depth sensor to demonstrate real-time hand pose estimation. Panteleris and Argyros [24] proposed to use short baseline stereo cameras for hand pose estimation without disparity maps. All of the above methods utilize multiple calibrated cameras, making it difficult to set up and manipulate general hand motions in unconstrained scenarios such as community videos. More recently, Simon et al. [36] proposed a method to generate large amounts of 2D and 3D hand pose data by using a panoramic camera setup that limits natural motion and appearance variations. They also leverage their data for 2D hand pose estimation, but cannot estimate 3D pose in monocular RGB video. Our contribution addresses the data variability and difficult 3D pose estimation problems of general scenes.

Monocular methods: Monocular methods for 3D hand pose estimation are preferable because they can be used for many applications without setup overhead. The availability of inexpensive consumer depth sensors has led to extensive research using them for hand pose estimation. Hamer et al. [10] proposed one of the first generative methods to use monocular RGB-D data for hand tracking even under partial occlusion. Since this method often suffers from local optima, Keskin et al. [15] proposed a learning-based discriminative method. Many subsequent works have been proposed to improve generative components [44, 46, 51, 52] and learning-based discriminators [58, 16, 45, 49, 55, 7, 37, 22, 61, 5, 4, 29] . Hybrid methods that combine the best of generative and discriminative methods show the best performance on benchmark datasets [47, 39, 40, 20, 59].

Despite the above progress in monocular RGB-D or depth-based hand pose estimation, it should be noted that these devices do not work in all scenarios due to sunlight interference, such as outdoors, and consume more power. high. Furthermore, 3D hand pose estimation in unconstrained RGB videos allows us to deal with community videos, as shown in Figure 1. Some initial approaches to this problem [12, 43, 30] did not produce metrically accurate 3D poses because they were only acquired for a given input, or assumed that the zcoordinate was fixed. Zimmermann and Brox [63] proposed a learning-based approach to this problem. However, their 3D joint prediction is relative to a standard framework, i.e. the absolute coordinates are unknown, and it is not very robust to object occlusions. Furthermore, their method cannot distinguish 3D poses with the same projection of 2D joint positions, since their 3D predictions are merely based on abstract 2D heatmaps without directly considering images. In contrast, our work addresses these limitations by jointly learning 2D and 3D joint locations from image evidence, so we are able to correctly estimate poses with ambiguous 2D joint locations. Furthermore, our skeleton fitting framework combines previous hand models with these predictions to obtain global 3D coordinates.

Training with learning-based methods: One of the challenges of using learning-based models for hand position estimation is the difficulty in obtaining annotated data with sufficient real-world variables. For depth-based hand pose estimation, multiple training datasets have been proposed, utilizing generative model fitting to obtain ground truth annotations [49, 39] or better to sample the pose space [21]. Simon et al. [36] proposed a multi-view guidance method. However, this outside-in capture setup can still be obscured by manipulating objects by hand. Synthetic data promises perfect ground truth, but models trained on this data suffer from domain gaps when applied to real inputs [20].

Techniques such as domain adaptation [6, 50, 25] aim to bridge the gap between real and synthetic data by learning features that are invariant to latent differences. Other techniques use real synthetic image pairs [13, 32, 3] to train networks that can generate images that contain many features of real images. Due to the difficulty of obtaining real-numbered synthetic image pairs, Shrivastava et al. [35] recently proposed a synthetic-to-real fine network that only requires unpaired examples. However, the degree of refinement is limited due to the limited input pixel similarity. In contrast, the unpaired image-to-image translation work of Zhu et al. [62] relaxes these constraints to look for bijections between two domains. We implement a richer refinement based on [62] and introduce geometric consistency constraints to ensure efficient annotation transfer. Without requiring corresponding pairs of real synthetic images, we can generate hand images that contain many of the features found in real datasets.

3. Hand tracking system

The main goal of this paper is to propose a real-time system for 3D monocular RGB hand tracking. The whole system is shown in Figure 2. Given a live monocular RGB video stream, we use a CNN hand joint regressor RegNet to predict 2D joint heatmaps and 3D joint locations (Section 3.2). RegNet is trained using images generated by a new image-to-image translation network, GeoConGAN (Section 3.1), which enriches synthetic hand images. The output image of geocon - the generated image - is more suitable for training a CNN that will work on real images. After joint regression, we fit a kinematic skeleton to the 2D and 3D predictions by minimizing our fitting energy (Section 3.3), which has several key advantages for robust 3D hand pose tracking: it Biomechanical plausibility is enhanced; we can retrieve absolute positions in 3D; moreover, we are able to impose temporal stability across multiple frames.

3.1 Training data generation

Since it is not possible to annotate 3D joint locations in hundreds of real hand images, synthetically generated images are usually used. While the main advantage of synthetic images is that the true positions of joints in 3D are known, an important disadvantage is that they often lack realism. This discrepancy between real and synthetic images limits the generalization ability of CNNs trained only on synthetic images. To account for this discrepancy, we propose GeoConGAN, an image-to-image translation network whose goal is to translate synthetic images into real ones. Most importantly, to train this network, we used unpaired real and synthetic images, as described below. Note that for both real and synthetic data, we only use foreground segmented images containing hands on a white background, which helps training and concentrates the network capacity on the hand region.

Realistic Hand Image Acquisition: To obtain our real image dataset, we use a green screen setup to capture hand images in different poses and 7 subjects with different skin tones and hand shapes. In total, we captured 28,903 images of real hands using a desktop webcam with a resolution of 640 × 480.

Synthetic Hand Image Generation: Our synthetic hand image dataset is a combination of the SynthHands dataset [20], which contains hand images taken from egocentric perspectives, and our own hand images taken from various third-person perspectives. To generate the latter, we use standard strategies [20, 63] in state-of-the-art datasets, where hand motions obtained via hand trackers or hand animation platforms are relocalized as kinematic 3D hand models.

Geometrically Consistent CycleGAN (GeoConGAN): While the above procedure allows the generation of a large number of synthetic training images with different configurations of hand poses, training a joint regression network for hands based only on synthetic has limited generalization, as we will demonstrate in Section 4.1.

To address this, we train a network that converts synthetic images into "real" (or GANerated) images. Our translation network is based on CycleGAN [62], which uses an adversarial discriminator [9] to simultaneously learn cycle-consistent forward and reverse mappings. Cycle consistency means that the combination of two maps (in either direction) is an identity map. In our case, we learn the mapping from synthetic to real images (synth2real), and from real to synthetic images (real2synth). Compared with many existing image-to-image or style transfer networks [13, 32], CycleGAN has the advantage that it does not require paired images, i.e., a given synthetic image must have no true image counterpart, which is useful for our purposes Crucial, since such pairs are not available.

The architecture of this GeoConGAN is illustrated in Figure 3. The input to the network is a synthetic (cropped) real image of hands on a white background, and their respective silhouettes, the foreground segmentation masks. At its core, GeoConGAN is similar to CycleGAN [62] with its discriminator and cycle consistency loss, and two trainable translators, synth2real and real2synth. However, unlike CycleGAN, we incorporate an additional geometric consistency loss (based on cross-entropy) to ensure that images generated by real2synth and synth2real components maintain hand poses during image translation. Enforcing a consistent hand pose is crucial to ensure that the ground truth joint position of the synthetic image also applies to the "real" image generated by synth2real. Figure 4 shows the benefit of adding this new loss term.

To extract the contours of images generated by real2synth and synth2real (blue box in Fig. 3), we train a binary classification network SilNet based on a simple UNet [31] with three 2-step convolutions and three inverse convolution. Note that this is a relatively simple task since the image has a white background. We choose a differentiable network on the basis of no threshold, so that the training of GeoConGAN performs better. Our SilNet is pre-trained on a small disjoint subset of data and is fixed while training synth2real and real2synth. See Supplementary File for details.

Data Augmentation: After GeoConGAN training is complete, we feed all synthetically generated images into the synth2real component and obtain a set of "real" images with associated ground truth 3D joint positions. We perform background augmentation by synthesizing a generated image (foreground) and a random image (background) by using the background mask of the original synthesized image [53, 19, 28]. Similarly, we also augment randomly textured objects with object masks produced when rendering synthetic sequences [20]. Training on images without backgrounds or objects, thus using data augmentation as post-processing, greatly eases the task of GeoConGAN. Figure 5 shows some of the resulting images.

3.2 Hand joint regression

To regress hand pose from (cropped) hand RGB images, we train a CNN, RegNet, which predicts the 2D and 3D positions of 21 hand joints. 2D joint positions are represented in image space as heatmaps, and 3D positions are represented as 3D coordinates relative to the root joint. We find that regressing 2D and 3D joints are complementary, as 2D heatmaps are able to represent uncertainty, while 3D positions can resolve depth ambiguities.

As shown in Figure 6, RegNet is based on a residual network consisting of 10 residual blocks from the ResNet50 architecture [11], as shown in [20]. Furthermore, we incorporate a (differentiable) optimization module based on projection layers (ProjLayer) to better merge 2D and 3D predictions. The idea of ​​ProjLayer is to perform an orthographic projection of a (preliminary) intermediate 3D prediction, from which a 2D Gaussian heatmap is created (within a layer). These heatmaps are then exploited in the remainder of the network (conv) to obtain final 2D and 3D predictions. In Figure 7a we show that this leads to improved results.

Training is based on a mixture of GANerated (Section 3.1) and synthetic images combined with corresponding 3D ground truth joint positions. In total, the training set contains about 440,000 samples, 60% of which are generated. We empirically found that increasing this percentage does not further improve performance on real test data. We train RegNet with relative 3D joint positions, which we compute by normalizing the absolute 3D ground truth joint positions such that the middle finger metacarpophalangeal joint (MCP) is at the origin and the distance between the wrist joint and the middle MCP joint is 1. See Supplementary File for details.

During testing, i.e. for hand tracking, the input to RegNet is a cropped RGB image with (square) bounding boxes from 2D detections from the previous frame. In the first frame, a square bounding box is located at the center of the image with a size equal to the height of the input image. Furthermore, we filter the output of RegNet with a 1e filter [2] to obtain temporally smoother predictions.

3.3 Kinematic skeleton fitting

After obtaining 2D joint predictions in the form of image space heatmaps and 3D joint coordinates relative to the root joint, we fit this data to a kinematic skeletal model. This ensures an anatomically plausible hand pose while allowing retrieval of absolute hand poses, which we describe below. Furthermore, when processing a series of images, i.e. performing hand tracking, we can additionally impose temporal smoothness.

Kinematic hand model: Our kinematic hand model is shown in the skeleton fitting block in Figure 2. The model includes a root joint (wrist) and 20 finger joints, for a total of 21 joints. Note that this figure includes the fingertip joints which do not have any degrees of freedom. Let , (expressed in Euler angles for convenience) be the global position and rotation of the root joint, and be the hand joint angles of 15 one-degree-of-freedom or two-degree-of-freedom finger joints. We stack all parameters to . By denoting the 3D absolute positions of all J = 21 hand joints (including root joints and fingertips), where we denote the position of the Jth joint by . To compute the positions of the non-root joints, the motion tree is traversed. Note that we use the camera coordinate system as the global coordinate system. To account for bone length variations between different users, we perform per-user bone adaptation. User-specific bone lengths were obtained by averaging relative bone lengths over 30 frames of 2D predictions when the user held the hand parallel to the camera image plane. Due to the scale ambiguity inherent in RGB data, we can determine global 3D results, which is important for many applications but not supported by previous work [63]. Furthermore, we obtain metrically accurate 3D results when provided with metric lengths of individual bones. For model fitting, we describe the various energy terms in the energy minimization as follows.

2D fitting term: The aim is to minimize the distance between the hand joint position projected onto the image plane and the heatmap maximum. Given by , where denotes the heatmap maximum value of the jth joint, ωj > 0 is the scalar confidence weight derived from the heatmap, which is the projection of the 3D space onto the 2D image plane based on the camera intrinsics. Note that this 2D term is essential in order to retrieve the absolute 3D position, since the 3D fit term only takes root relative sharpness into account, as described below.

 3D fitting term: The purpose of the term is to obtain good hand joints by predicting relative 3D joint positions. Additionally, this item resolves the depth blur that occurs when using only 2D joint positions. We define as . The variable is the user-specific position of the j-th joint relative to the root joint, which is computed from the RegNet output xj, where p(j) is the parent node of joint j, let Zroot = 0 ∈ . The idea of ​​using user-specific positions is to avoid local minima caused by bone length inconsistencies between the hand model and the 3D prediction.

Joint Angle Limit: Penalizes anatomically incorrect hand joints, forcing joints not to bend too much. Mathematically, we define , where are the upper and lower joint angle limits of the non-root joint degrees of freedom, and calculate the row maximum.

Temporal smoothness: Penalizes deviations from constant velocity in Θ. We formulate where the gradient of the pose parameter Θ is determined using finite (backward) differencing.

Optimization: To minimize the energy in (1), we use a gradient descent strategy. For the first frame, θ and t are initialized to represent a tie, which is located at the center of the image, 45cm from the camera plane. For the remaining frames, we use the transformation and connection parameters t and θ from the previous frame as initialization. In our experiments, we found that fast global hand rotations can lead to poor optimization results, corresponding to local minima in non-convex energy landscapes (1). To solve this problem, for the global rotation R, we no longer depend on the previous value , but initialize it based on the relative 3D joint prediction. Specifically, we exploit the observation that in the human hand, the root joint and its four immediate sub-joints (MCP joints, respectively) of the non-thumb fingers are (approximately) rigid (see Fig. Therefore, to find the global rotation R, we solve the problem which contains the (fixed) direction vector from the hand model, whereas the corresponding direction vector from the current RegNet prediction. Both , where are the (normalized) vectors pointing from the root joint to the respective non-thumb MCP joints j1, ..., j4, n = yj1 × yj4 are the (approximate) normal vectors of the "palm-plane". To obtain , we compute yj from the xj of the 3D model point in world space, when the global rotation of the model is unit, this is only done once for the skeleton at the start of tracking. To be obtained in each frame , xj is set to the RegNet prediction used to compute yj. Although problem (7) is non-convex, it still admits efficient computation of the global minimum because it is an instance of the orthogonal Procrustes problem [33, 48]: The global optimum of (7) for the singular value decomposition of The value is given by .

4. Experience

We quantitatively and qualitatively evaluate our method and compare our results with other state-of-the-art methods on various publicly available datasets. For this, we use the Percentage Correct Keypoints (PCK) score, a popular metric for evaluating pose estimation accuracy. PCK defines a candidate keypoint as correct if it lies within a circle (2D) or sphere (3D) of a given radius surrounding the real data.

4.1 Quantitative evaluation

Ablation Experiments: In Figure 7a, we compare the accuracy when training the joint regressor RegNet with different types of training data. Specifically, we compare using only synthetic images, synthetic images plus color enhancement, and the combination of synthetic images and GANerated images, where, for the latter, we also consider the additional use of ProjLayer in RegNet. For color enhancement, we employ gamma correction with random γ ∈ [0.25, 2] uniform sampling. Although we evaluated RegNet on the entire Stereo dataset [60] containing 12 sequences, we did not test it on any frame of the dataset. We show that training on purely synthetic data leads to poor accuracy (3D PCK@50mm ≈ 0.55). While color enhancement on synthetic images improves results, our ganated images significantly outperform standard enhancement techniques, achieving 3D PCK@50mm ≈ 0.80. This test validates the parameters using ganggenerated images.

Comparison with state-of-the-art: Figure 7b evaluates our detection accuracy on the Stereo dataset and compares it with existing methods [60, 63]. We followed the same evaluation protocol as used in [63], i.e. we trained on 10 sequences and tested on the other 2 sequences. Furthermore, [63] aligned their 3D predictions to the real wrist, which we also do for fairness. Our method outperforms all existing methods. Furthermore, we tested our method without training on any sequences of the Stereo dataset and demonstrated that we still outperform some existing work (green line in Fig. 7b). This demonstrates the generalization of our method. Figure 7c shows the 2D PCK on the Dexter+Object [40] and EgoDexter [20] datasets in pixels. We significantly outperform Zimmerman and Brox (Z&B) [63], which fail in difficult occlusal conditions. Note that we cannot report 3D PCK because [63] only outputs root-relative 3D, and these datasets do not have root joint annotations.

4.2 Qualitative evaluation

We qualitatively evaluate our method on three different video sources: publicly available datasets, real-time captures and community (or retro) videos (i.e., YouTube). Figure 8 presents the qualitative results of Z&B [63] and our method on three datasets of Stereo [60], Dexter+Object [40] and EgoDexter [20]. We are able to provide reliable hand tracking even under severe occlusal conditions, with a significant improvement in these conditions [63]. Although we already outperform Z&B [63] (Fig. 7c) in quantitative evaluations, we emphasize that this is not the full picture, as the datasets of [40, 20] only provide annotations for visible fingertips due to the manual annotation process.

Therefore, errors in occlusal joints were completely absent in the quantitative analysis. Since our method is explicitly trained to handle occlusions compared to [63], our qualitative analysis in the Supplementary Video and Figure 8 columns 3-6 highlights our advantage in this case. We show real-time tracking results in Figure 9 and the Supplementary Video. This sequence was tracked in real time using a regular desktop webcam in an office environment. Note how our method accurately recovers the full 3D joint pose of the hand. In Figure 1, we demonstrate that our method is also compatible with community or old-school RGB videos. In particular, we demonstrate 3D hand tracking in YouTube videos, which demonstrate the generalization of our method.

5. Limitations and Discussion

A difficult scenario for our method is when the background has a hand-like appearance, as our RegNet struggles to obtain good predictions, and thus tracking becomes unstable. This can be addressed by using an explicit segmenter, similar to Zimmermann and Brox [63]. Furthermore, the detection may be unreliable when multiple pointers are approaching in the input image. Thanks to our bounding box tracker, our method can handle sufficiently independent hands - tracking interacting hands or hands of multiple people is an interesting direction for follow-up work. In purely 2D images, 3D tracking of hands is an extremely challenging problem. Although our real-time method for 3D hand tracking outperforms state-of-the-art RGBD methods, there is still an accuracy gap between our results and existing RGB-D methods (the average error of our proposed RGB method is ≈5 cm, while [40 ]’s RGBD method has an average error of ≈2cm on their dataset Dexter+Object). Nonetheless, we believe our method is an important step towards the democratization of rgb 3D hand tracking.

6. Summary

Most of the existing research is based on monocular RGB for 2D hand tracking, or using additional inputs such as depth images or multi-view RGB for 3D hand motion tracking. The state-of-the-art method of Zimmermann and Brox [63] tackled the problem of monocular 3D hand tracking from RGB images, while our proposed method solves the same problem, but is one step ahead in several dimensions: our method adopts a kinematic model The fitting obtains an absolute 3D hand pose, is more robust to occlusions, and has better generalization due to enriching our synthetic data to more closely resemble the distribution of real hand images. Our experimental evaluation demonstrates these advantages, as our method significantly outperforms [63], especially in difficult occlusion scenarios. To further encourage future work on monocular 3D RGB hand tracking, we make our dataset available to the research community.

Supplementary material:

In this document, we provide details of RegNet and GeoConGAN networks (Section 1), additional quantitative evaluations (Section 2), and detailed visualizations of CNN RegNet outputs and final results (Section 3).

1. CNN and GAN details

1.1GeoConGAN

Network Design: The architecture of GeoConGAN is based on CycleGAN [13], i.e. we train two conditional generator and two discriminator networks for synthetic and real images, respectively. Recently, methods have also been proposed to enrich unpaired data synthetic images using only one generator and discriminator. Both Shrivastava et al. [9] and Liu et al. [5] used an L1 loss (in addition to the common discriminator loss) between the conditional synthesis input and the generated output due to the lack of image pairs. This loss forces the generated image to be similar to the synthetic image in every respect, i.e. it may prevent the generator from producing realistic output if the synthetic data is not close. Instead, we decided to use a combination of cycle-consistency and geometry-consistency losses to move the generator network away from the synthetic data and thus closer to the distribution of real-world data while preserving hand poses. Our GeoConGAN consists of a ResNet generator and a least squares PatchGAN discriminator network.

Training details: We train GeoConGAN in Tensorflow [1] for 20,000 iterations with a batch size of 8. We initialize the Adam optimizer [4] with a learning rate of 0.0002, β1 = 0.5, and β2 = 0.999.

 1.2.RegNet

Projection Layers: Recent work on 3D body pose estimation integrates projection layers to train 3D pose prediction with 2d-only annotated data [2]. Since our training dataset provides perfect 3D ground truth, we use the projection layer only as an optimization module to connect 2D and 3D predictions. We project the intermediate relative 3D joint position predictions using an orthographic projection, where the origin of the 3D prediction (the intermediate MCP joint) is projected to the center of the rendered heatmap. Therefore, our rendered heatmaps are also relative and do not necessarily correspond to ground truth 2D heatmap pixels. Therefore, we will do further processing on the rendered heatmap before feeding it back to the main network branch. Note that the rendered heatmap is differentiable with respect to the 3D projection, which enables the backpropagation of gradients through our ProjLayer

 Training Details: We train RegNet for 300,000 iterations with a batch size of 32 in the Caffe [3] framework. We use the AdaDelta [12] solver with an initial learning rate of 0.1, which is reduced to 0.01 after 150,000 iterations. All layers shared between our network and ResNet50 are initialized with weights obtained from ImageNet pre-training. Both the 2D heatmap loss and the local 3D joint position loss adopt Euclidean loss, and the loss weights are 1 and 100, respectively.

Computation time: In our real-time tracking system, the forward pass of RegNet takes 13 ms on a GTX 1080 Ti.

2. Comparison with RGB-D method

3D tracking of hands in pure RGB images is an extremely challenging problem due to the inherent depth ambiguity of monocular RGB images. Although our method improves the state-of-the-art of RGB-only hand tracking methods, there is still a gap between RGB-only hand tracking methods and RGB-D methods [6, 8, 10]. A quantitative analysis of this accuracy gap is shown in Figure 1, where we compare our results (dark blue) with the RGB-D method of Sridhar et al. [11] (red).

To better understand the source of errors, we performed an additional experiment in which we translated the global z position of our rgb results to best match the ground truth depth. In Figure 1, we compare these depth-normalized results (light blue) with our original results (blue). It can be seen that a large part of the gap between RGB-based methods and RGB-D-based methods is due to the inaccurate estimation of the opponent's root position. Reasons for inaccurate hand heel positions include bones not matching the user's hand exactly (in terms of bone length), and inaccuracies in 2D predictions.

3. Detailed qualitative evaluation

In Figures 2 and 3, we qualitatively evaluate each intermediate stage of the tracking solution as well as the final results. Among them, Fig. 2 is the result of subjects grabbing different objects in the office environment on the EgoDexter dataset [6], and Fig. 3 is the result of community videos downloaded from YouTube. In these two figures, we provide visualizations of: heatmap maxima for 2D joint detections (first row); roots relative to 3D joint detections (second row); global 3D tracked hand projected onto the camera plane (third row) row); globally 3D tracked hand visualized in a virtual scene with the original camera frustum (fourth and fifth row). See Supplementary Video for full sequence.

 Paper link:

https://handtracker.mpi-inf.mpg.de/projects/GANeratedHands/content/GANeratedHands_CVPR2018.pdf

Supplementary link to the paper:

https://handtracker.mpi-inf.mpg.de/projects/GANeratedHands/content/GANeratedHands_CVPR2018_Supp.pdf

Links to data sets and other materials:

GANerated Hands Dataset

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qhd12/article/details/128454540