Feature extraction series based on deep learning (1): GCNv2 paper translation

Study feature extraction based on deep learning and record the literature you read. It is task-driven and can be more efficient.

The content of the article will be modified from time to time!

Paper link

0 abstract

This paper proposes a deep learning-based network GCNv2 for the generation of key points and descriptors. GCNv2 is built on our previous method GCN. GCN is a network trained for 3D projective geometry. GCNv2 uses binary descriptor vectors as ORB features, which can easily replace ORB in systems such as ORB-slam2. Compared to GCN, which can only run on desktop hardware, GCNv2 significantly improves computational efficiency. Shows how a modified version of ORB-SLAM2 using GCNv2 functionality can run on the embedded low-power platform Jetson TX2. Experimental results show that GCNv2 maintains comparable accuracy to GCN and is robust enough to be used to control flying UAVs. The source code is available at: https://github.com/jiexiong2016/GCNv2_SLAM

1 Introduction

Position estimation capabilities are key to most applications involving mobile robots. This paper mainly studies visual odometry (VO), which is the problem of relative motion estimation based on visual information. This is the cornerstone of vision-based SLAM systems, like the one used in our demonstration method. As in our previous work [1], we only use RGB-D sensors to estimate motion, and our target platform is a UAV operating in an indoor environment. RGB-D sensors make scale directly observable without the computational cost of visual-inertial fusion or using neural networks to infer depth, such as [2], [3], [4]. This increases robustness, a key attribute of drones. Especially indoor environments, where the margins of error are small and often less textured than outdoors. This method is suitable for systems containing RGB-D sensors and does not require complex calibration and synchronization with other sensors. In contrast, fusion can occur at a slower rate and requires less precise timing, which makes integration with a drone's flight control system simpler, for example.

Researching methods based on deep learning is a trend in SLAM. SuperPoint, a CNN-based key point detection and description method, was proposed in [5]. Experimental results show that SuperPoint has stronger feature descriptors than classic methods such as SIFT, and SuperPoint's detector is equivalent to classic methods. In our previous work [1], we introduced a geometric correspondence network, GCN, specifically tailored to produce keypoints for camera motion estimation, achieving better accuracy than classical methods. However, due to the computational requirements of GCN and its multi-frame matching setup, it is difficult to achieve real-time performance in fully operational SLAM systems such as drones. Both key point extraction and matching are computationally expensive. Indeed, integrating deep learning into SLAM systems in performance-constrained environments was identified as an open problem in [6].

In this paper, we introduce GCNv2 based on the conclusions of [1] to improve computational efficiency while still maintaining the high accuracy of GCN. We correct the multi-frame setup problem and instead predict a single frame at a time. Our contributions are:
(1) GCNv2 maintains accuracy comparable to GCN, obtains significant progress in the direction of motion estimation, and significantly shortens the inference time compared with related feature extraction methods based on deep learning.
(2) We include the binarization of feature vectors in training, which greatly speeds up the matching. We designed GCNv2 to have the same functionality as ORB and can be directly used as a keypoint extractor in SLAM systems, such as ORB-SLAM2 [7] or SVO2 [8].
(3) We demonstrate our effectiveness and robustness by using GCN-SLAM control on a real drone and show that it can handle the failure cases of ORB-SLAM2. GCN-SLAM runs in real time on embedded low-power hardware (such as Jetson TX2), unlike GCN, which requires a desktop GPU for real-time inference.
Note: The contribution is divided into three parts: objective performance improvement; modular design to facilitate embedding into the system; system verification.

2 Related work

In this section, we introduce related work in two areas. VO and SLAM methods are first introduced, followed by a special focus on work on image-related methods based on deep learning.

A. VO and SLAM
In direct methods of VO and SLAM, motion is estimated by aligning frames directly based on pixel intensities, [9] being an early example. DVO (Direct Visual Odometry), introduced in [10], adds a pose map to reduce errors. DSO [11] is a direct and sparse method that adds joint optimization of all model parameters. An alternative to frame-to-frame matching is to match each new frame to the volumetric representation in KinectFusion [12], Kintinous [13] and ElasticFusion [14].

In indirect methods, the first step in a typical pipeline is to extract keypoints and then match them with the previous frame to estimate motion. Matching is based on keypoint descriptors and geometric constraints. The state-of-the-art in this category is still defined by ORB-SLAM2 [15], [7]. The ORB descriptor is a binary vector that allows high-performance matching.

We find semi-direct methods that are intermediate between direct and indirect methods. SVO2 [8] is a sparse method in this category and can operate at hundreds of Hz. Among them, LSD-SLAM [16] is the first semi-dense method. RGBDTAM [17] combines semi-dense photometric error and dense geometric error for pose estimation.

Recently there are many deep learning based mapping systems like [18], [19]. These methods focus on single-view depth estimation based on deep learning to reduce the scale drift inherent in monocular systems. CNN-SLAM [18] feeds depth into LSD-SLAM. In DVSO [20], depth is predicted in a manner similar to [2], using a virtual stereo view. CodeSLAM [21] learns optimizable representations from conditional autoencoding for 3D reconstruction. In S2D [22], we build on DSO [11] and exploit deep and regular predictions by combining an optimized CNN. There also exist some unsupervised training of work motion estimation. Image reconstruction loss is used for unsupervised learning in [4], [23]. However, geometry-based optimization methods still outperform end-to-end systems, as shown in [20].

B. Deep Correspondence Matching
There are a large number of recent works deployed on metric learning variants of training deep features for finding image correspondences [24], [25], [26], [27], [28], [29], [ 30], [31], [5]. Work in [32], [33] focuses on improving learning-based detection with better invariance. Targeting different aspects, [34], [35], [36] create generative samples in a self-supervised manner to improve general feature matching.

Among the above methods, LIFT [30] specifically uses a patch-based approach to perform two key points. Detection and descriptor extraction. SuperPoint [5] predicts keypoints used in a single network and descriptors use a self-supervised strategy in [36]. It is worth noting that the performance [5], [30], [31] in [5] is comparable to classical methods such as SIFT for motion estimation.

In GCN [1] we improve performance by learning keypoints and descriptors specifically for motion estimation – contrary to other reports on more general deep learning based keypoint extractor systems [5], [31] . In this paper, we introduce a high-throughput variant of GCN, called GCNv2. We demonstrate the applicability of these key points to SLAM and build upon ORB-SLAM2 as it provides a comprehensive multi-threaded state-of-the-art indirect SLAM system supporting monocular as well as RGB-D cameras. ORB-SLAM2 complements a tracking frontend, and a backend that simultaneously executes pose graphs. The backend is optimized with G2O [37] and uses a binary bag-of-words model [38]. To simplify this, we design GCNv2 descriptors to have the same format as ORBs.

3 GEOMETRIC CORRESPONDENCE NETWORK

In this section, we introduce the design of GCNv2, aiming to make GCN suitable for real-time SLAM applications running on embedded hardware. We first introduce the revised network structure and then detail the training methods for binarized feature descriptors and keypoint detectors.

A. Network Structure
The original GCN structure proposed in [1] consists of two main parts: the FCN [39] structure with ResNet-50 backbone and a bidirectional recurrent convolutional network. FCN is suitable for dense feature extraction, while bidirectional recurrent network is used to find keypoint locations. [1] showed that GCN has impressive tracking performance compared to existing methods, however it was also noted that this algorithm has practical limitations on real-time SLAM systems. We identified two main problems: first, the network architecture is relatively large, so powerful computing hardware is required to make it impossible to run in real time on embedded boards, such as Jetson TX2 for our UAV experiments; second, GCN inference requires simultaneous input Two or more frame times not only increase calculations but also increase algorithm complexity.

To address these limitations, we introduce a single view based simplified network structure GCNv2 for improved efficiency. The overall structure of the GCNv2 network is shown in Figure 2. Like GCN, GCNv2 predicts keypoints and descriptors simultaneously, i.e. the network outputs a probability map of keypoint confidence and a dense feature map of the descriptor. Inspired by SuperPoint [5], GCNv2 performs predictions at low resolution instead of raw images and uses only a single image. First, predict the probability map and dense feature map on the low-resolution map, then pixel shuffle the 256-channel probability map to the original resolution, and finally perform non-maximum suppression on the full-resolution probability map, using these locations to Double sampling of the corresponding feature vectors in the dense feature map (right part in Figure 2)

figure 2
Insert image description here

GCNv2, together with our adopted SLAM method GCN-SLAM, runs at approximately 80 Hz on a laptop equipped with an Intel i7-7700HQ and a mobile NVIDIA 1070. To achieve the higher frame rates required for real-time inference on Jetson TX2, we introduced a smaller version of GCNv2, called GCNv2-tiny, where we cut the number of feature maps in half from conv2 and beyond. GCNv2-tiny runs at 40 Hz and uses its GCN-SLAM to run at 20 Hz on TX2, making it ideal for deployment on drones. In addition to mobile platforms, a more comprehensive comparison of inference and matching times on desktop computers is shown in Figure 3. It can be seen that the resolution of the input affects the inference time in a multiple of the fuzzy quadratic. At the same resolution, GCNv2 can achieve shorter inference time compared with GCN and SuperPoint. This is mainly due to the modifications we made to the network architecture. More details on GCNv2 and GCNv2-tiny network details are available in our publicly available source code.

Figure 3
Insert image description here
annotation: inference time – inference time

B. Feature Extractor
keywords: metric learning; triplet loss
L feat is the loss function of the descriptor
Insert image description here
: Update position:
Insert image description here
Binarized features:
Insert image description here
Annotation: To understand, you can look at GCNv1 and the keyword information mentioned above

C. Distributed Keypoint Detector
Like GCN, we treat feature detection as a binary classification problem. The target o of the probability map in the network is the mask 0 and 1. These values ​​indicate whether a pixel is a keypoint. Weighted cross-entropy is then used as the objective function (loss function) for training. The loss value is always evaluated on two consecutive frames with the aim of enhancing the consistency of the extracted keypoints.
Loss function of feature points :
Insert image description here
where α1 and α2 are used to deal with imbalanced classes to prevent pixel-dominated losses that are not key points. We generate ground truth by detecting Shi-Tomasi corners in a 16 × 16 grid and warping them to the next frame using equation (3). This results in a better distribution of keypoints, and the objective function directly reflects the ability to track keypoints based on texture.

D. Training Details
The final training loss is a weighted combination of the two parts of the loss, L feat weight is 100 and L det weight is 1. The weights are used to balance the scale of the two items. The m parameter in triplet loss is set to 1. The relaxed criterion c for exhaustive negative sample mining is set to 8. (Relaxed criteria c for exhaustive negative sample mining is set to 8.) Cross entropy [a1, a2] is set to [0.1, 1.0]. Using the Adam optimizer, the initial training rate lr = 1e-4 is halved every 40 epochs, for a total of 100 epochs. The weights of GCNv2 are randomly initialized under uniform distribution.

4 GCN-SLAM

For a keypoint-based SLAM system, one of the most important design choices is the keypoint extractor. These key points are reused at various stages. The ORB feature in ORB-SLAM2 is a robust feature because it operates faster than other keypoint extractors of the same nature and has a compact descriptor for fast matching.

As shown before in [1], GCN can perform better than ORB-SLAM2 using a very simple pose estimator, at least on par. Obviously, this is a low-level SLAM system with no pose graph optimization, no global BA, and no loop closure detection. Combining GCN and these functions into a complete SLAM system is likely to produce better results than before. However, GCN is a rather expensive component to use in our embedded hardware real-time system. Below, we will explain how to add GCNv2 to ORB-SLAM2 and name this system GCN-SLAM.

ORB-SLAM2's motion estimation is based on tracking frame-to-frame keypoints and a feature-based BA problem. We will briefly describe feature detection and description. ORB-SLAM2 uses a scale pyramid and by running a single-scale algorithm on multiple rescaled images, it can iteratively shrink the input image to achieve multi-scale feature detection. For each scale level, the FAST corner detector is used in a 30×30 grid. If no target is detected within a cell, FAST is run again and the threshold is attenuated. In all detectors, all units are searched from the scale pyramid at a specific level, and a spatial partitioning algorithm is used to first eliminate key points according to image coordinates, and then eliminate key points according to detection scores. Finally, 1000 key points are usually extracted at a time and the perspective of each key point is calculated. Then, each layer of the scale pyramid is filtered with Gaussian blur, and a 256-bit ORB descriptor for each keypoint at each layer will be calculated based on the blurred image.

Our method computes keypoint locations and descriptors simultaneously in a single forward pass of the network. As mentioned before, the end result is intended to be a direct replacement for the ORB feature extractor described above. These two key point extraction methods are shown in Figure 4.

Figure 4
Insert image description here
Once keypoints and their corresponding descriptors are found, ORB-SLAM2 mainly relies on two frame-to-frame tracking methods: first, by assuming constant velocity and projecting the keypoints of the previous frame into the current frame. If that fails, the keypoints of the current frame are matched to the last created keyframe by using bag-of-words similarity. We disabled the former in order to only use the latter's keypoint-based reference frame tracking. We also replaced the matching algorithm with a standard nearest neighbor search in our experiments. These modifications were made to examine the performance of our keypoint extraction method rather than the performance of other tracking heuristics of ORB-SLAM2.

Finally, we retain the loop detection and pose graph optimization of ORB-SLAM2, except to adapt the GCNv2 feature descriptor by computing the bag-of-words vocabulary on the training dataset introduced in Section VA.

5 Experimental results

In this section, we present experimental results to demonstrate our conclusions on the performance of the keypoint extraction method and its implementation in the GCN-SLAM system. Our work is not intended as a replacement for ORB-SLAM2, but rather a keypoint extraction method that is: i) tailored for motion estimation, ii) computationally efficient, and iii) suitable for use in SLAM systems. Section VB presents our quantitative results by benchmarking GCN-SLAM against ORB-SLAM2 and some related methods. Section VC qualitatively compares our approach with ORB features by using them in the same SLAM framework as described in Section IV.

For quantitative experiments, a laptop equipped with an Intel i7-7700HQ processor and a mobile NVIDIA 1070 was used for evaluation. For qualitative experiments and real-life scenarios, we use an NVIDIA Jetson TX2 embedded computer for processing and an Intel RealSense D435 RGB-D camera sensor on a custom drone (see Figure 1).

A. Training Data
The original GCN is trained using the TUM dataset [44] from sensor fr2. It provides accurate camera poses through a motion capture system. In GCNv2, we train the network using a subset of the SUN-3D [45] dataset created in recent work [22]. SUN-3D contains millions of real-world recorded RGB-D images in a variety of typical indoor environments. A total of 44624 frames were extracted, approximately one frame per second. SUN-3D is a very rich dataset and its diversity may lead to a more general network. However, the provided ground truth pose is estimated by visual tracking with closed loops and is therefore relatively accurate in a global sense, but suffers from misalignment at the frame level. To account for this local error, we extract SIFT features and use the provided pose as an initial prediction for the BA problem to update the relative pose of each frame pair. In this sense, GCNv2 is trained using self-annotated data from RGB-D cameras.

B. Quantitative Results
For comparison with the original GCN, we selected the same TUM dataset sequence as in [1] and evaluated the tracking performance using open-loop and closed-loop systems. We use absolute trajectory error (ATE) [44] as a metric. Since we trained GCNv2 on a different dataset than the original GCN [1], we also show the results using the original loop structure for comparison. Therefore, we also created GCNv2-large, with ResNet-18 as the backbone and deconvolutional upsampling of the feature maps. The bidirectional feature detector is moved to the lowest scale like the other two versions of GCNv2.

The inter-frame tracking results are shown in Table 1. The column to the left of the double vertical line is from [1] where 640 × 480 images were used, the column to the right of the double vertical line has the image resolution halved, i.e. 320 × 240, since this is what we use on the drone resolution. The results are consistent with those reported in [5]. The performance of SuperPoint is comparable to classic methods such as SIFT, while the performance of GCNv2 is close to GCN but significantly better than SuperPoint. The performance of GCNv2 is comparable to GCN, or even slightly better in both cases, probably due to the use of larger training datasets, as discussed in Section VA. Exceptions (should refer to cases where GCNv2 is worse than GCN, dashes are failure cases) are fr1_floor and fr1_360. These sequences require fine details, and since GCNv2 uses lower scale feature maps to perform detection and descriptor extraction, performance suffers accordingly. This is proven by the fact that GCNv2-large successfully tracks fr1 360. Finally, we note that the smaller version of GCNv2, GCNv2-tiny, is only slightly slower than GCNv2. (It seems like more than a little...)

Table I
Insert image description here

In Table II, we compare the closed-loop performance of GCN-SLAM with our previous work as well as ORB-SLAM2, ElasticFusion and RGBD-TAM. GCN-SLAM successfully tracks all sequences with errors comparable to GCN, while ORB-SLAM2 fails on two sequences. The error of GCNv2 in fast rotation of fr1_360 is smaller than ORB-SLAM2. It is also worth noting that for this particular sequence, original GCN performed significantly better than ORB-SLAM2 and GCN-SLAM. ORB-SLAM2 tracks well in all other sequences, and the errors of both GCN-SLAM and ORB-SLAM2 are small. These results are particularly encouraging because GCN-SLAM does not use the ORB-specific feature matching heuristics present in ORB-SLAM2, leaving room for further performance improvements.
Comment: The opinions in this article are different, but as the paper says that all the sequences above are successful, there is still merit.

Table II
Insert image description here

C. Qualitative Results
To further validate the robustness of using GCNv2 in practical SLAM settings, we present results on four datasets collected in our environment under different conditions: a) walking up a corridor, turning 180 degrees, Walking back with a handheld camera, b) walking in a circle in an outdoor parking lot with a handheld sensor during daylight hours, c) flying in an alcove with a window and rotating 180 degrees, d) flying in a kitchen and doing it using GCN-SLAM Rotate 360 ​​degrees when positioning.

Since there are no ground truths in the dataset we collected, these results can only be interpreted qualitatively (qualitative results are intuitive explanations of phenomena) and as a complement to the quantitative results provided in Section VB. These datasets were chosen to show that our method can handle difficult scenarios, is robust, and can be used for real-time localization of drones. Figure 5 shows the estimated trajectory of GCN-SLAM using ORB with GCNv2 as keypoints. Please note that both methods are evaluated in the exact same tracking pipeline (the word pipeline is actually a bit difficult to understand when translated as pipeline, but the essence is exactly the same experiment, just like different media passing through the same pipe) in order to be fair. Comparison, i.e. GCNv2 or ORB features is the only difference. See the source code for exact details. In Figure 5a, ORB is characterized by its inability to cope with the 180-degree turn in the upper right corner of the trajectory. In Figure 5b, tracking fails almost immediately. Furthermore, Figures 5c and 5d show that using GCN-SLAM as the basis for UAV control can improve performance. Specifically, in Figure 5c, only the optical flow sensor is used to estimate the position, while in Figure 5d, GCN-SLAM is used as the positioning source. It's obvious that the drone maintains its position better and the latter trajectory is less noisy. In all four datasets, tracking is maintained using GCNv2 but lost using ORB. We use the remote control to send the settings to the flight control unit on the drone for control, using the built-in position hold mode.

Figure 5
Insert image description here

In Figure 6, we further compare the performance of our keypoint extractor with the ORB keypoint extractor. We plotted intrinsic quantities during tracking of local maps of our SLAM system, first using ORB keypoints and then using GCNv2 keypoints. As shown in the figure, although there are more ORB features, our method has a higher intrinsic percentage. In addition, as shown in Figure 1, GCNv2 has better distributed characteristics compared with ORB.

Figure 6
Insert image description here

figure 1
Insert image description here

6 results

In our previous work [1], we found that GCN has better performance than existing deep learning and classical methods in visual tracking. However, GCN cannot be directly deployed into real-time SLAM systems in an efficient manner due to its computational requirements and use of multiple image frames. In this paper, we address these issues by proposing a smaller and more efficient version of GCN (termed GCNv2) that can be easily adapted to existing SLAM systems. We demonstrate that GCNv2 can be effectively used in modern feature-based SLAM systems to achieve state tracking performance. The robustness and performance of this method are verified by integrating GCNv2 into GCN-SLAM and using it on our UAV for localization.

Restrictions
GCNv2 is trained to predict projective geometries rather than general feature matching. This is an intentionally limited scope. As with learning-based methods, generalization is an important factor. GCNv2 works relatively well in outdoor scenes, and as our experiments (see Figure 5b) show, performance in this environment may be improved even if the training dataset does not contain outdoor data. Here we targeted indoor environments and we did not further investigate outdoor environments.

Future Work
In the future, we are interested in leveraging semantic information to reject outliers using higher-level information and fusing this information into motion estimation to improve the capabilities of our system, especially in environments with non-static objects. We would also like to investigate training GCN in a self-supervised or unsupervised manner to enable our system to self-improve online and over time.

Afterword

Finally finished, it took more than two days, but it was also a new attempt.

Maybe for many experts who read papers, my blog is meaningless, that is, machine translation + correction. But to me, getting it done is the point. It may be transformed later and it will be good to elaborate on some key information of the paper.

Reference links:

https://blog.csdn.net/NolanTHU/article/details/123815708

GCNv1

paper

Paper translation

Understand the basics of RNN (Recurrent Neural Network) in one article

triplet loss

Revise

2023.9.8 Modified some layout issues
2023.9.10 Added some links to GCNv1 related knowledge
2023.9.11 Improved Chapter 3

Guess you like

Origin blog.csdn.net/private_Jack/article/details/132686058