Multimodal feature fusion based on RGB-D images

Multimodal feature fusion based on RGB-D images


Fusion of ideas

The spatial information of the depth map usually has two forms of expression: distance information and HHA encoding information. Distance information is information related to the distance between the target object and the surface of the collection device. It can reflect the spatial position relationship of the object within the scene and is often involved in the calculation in the form of a single-channel image . HHA encoded information is a spatially extended expression of distance information, which is the horizontal parallax, horizontal height and center of gravity angle of the point . It is often involved in the calculation in the form of a three-channel image . The representation form of distance information is relatively simple and can be easily used. In contrast, the representation form of HHA encoding information is complex, requires a large amount of calculation, and consumes a lot of computing resources. However, the representation form of HHA encoding information allows the network to extract richer depth feature information.

The fusion core is how to efficiently combine with RGB image feature information.


Early integration

The early multi-modal feature fusion structure mainly performed a simple splicing operation on two images to form a new four-channel image or six-channel image and transported it to the network model. The network model at this time is a single-branch convolutional neural network encoding-decoding structure . The fusion of RGB image features and depth image features adopts the element addition method. This method is too simple and does not go through other networks in the later stage of the network. Module processing , so this feature information processing method will** lead to less effective information extracted by the network and low semantic segmentation accuracy of the model.** This type of fusion method is called early fusion.


late fusion

The later fusion adopts a dual-branch structure . The network contains two independent convolutional neural network branches. One of these two branches is named RGB branch and the other is depth branch, which are used to extract corresponding RGB features and depth feature information respectively. Finally, the two feature information are fused at the end of the encoder. Since the late fusion structure first performs convolution processing on the depth image, the image noise is suppressed . Therefore, compared with the early fusion structure mentioned above, late fusion can effectively improve the final semantic segmentation accuracy . However, this method cannot Taking full advantage of the complementary features of the input image at each stage of the encoder, a large amount of useful information will still be lost .
***

multi-level fusion

Multi-level fusion also uses a dual-tributary structure . The core is multi-stage fusion of features during encoding or decoding. It can be divided into three major categories, namely multi-level encoding fusion , multi-level decoding fusion and third branch multi-level fusion .


Multi-level coding fusion

The multi-level coding fusion structure is shown in the figure. This fusion method fuses RGB feature information and depth feature information at each stage during encoding, taking into account the complementarity of the two modal features at each stage of the encoder. Compared with the early fusion and late fusion methods mentioned above, this information processing method can make multi-stage complementary utilization of RGB-D features and avoid the loss of a large amount of useful information . However, the fusion method is too simple and cannot be deeply mined. Complementary RGB and depth feature information.


Multi-level decoding fusion

The multi-level decoding fusion structure mainly considers the complementary features of each stage of the decoder. As shown in the figure, the extracted RGB feature information and depth feature information can be transmitted to the decoder in a skip connection at each stage and combined with The features of the decoder itself are fused . This information processing method is similar to the multi-level coding fusion method. Although the two modal features can be complementary used at multiple stages , the contribution to the model segmentation performance is ultimately limited .


The third branch of multi-level fusion

The third branch multi-level fusion structure is to add a new fusion branch in addition to the original RGB branch and depth branch . The fusion branch structure is not single . It can be a convolutional neural network branch with the same configuration as the original RGB branch and depth branch. It can also be a new fusion feature module processing structure, which is used to fuse the RGB feature information extracted from the RGB branch and the depth feature information extracted from the depth branch. The structure is as shown in the figure. Compared with the other two multi-level fusion methods, the third branch of multi-level fusion can perform deeper processing of RGB features and depth features and is more complementary , but the corresponding calculation parameters will increase and the computing resources consumed will also increase. larger .


In practical applications, these types of fusion methods are more often used in combination with each other, and the combined methods usually produce better results.

Guess you like

Origin blog.csdn.net/wagnbo/article/details/127751878