Multimodal scene graph for 3D Visual Grounding


Paper:《Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud》【ICCV’2021】
Code:https://github.com/PNXD/FFL-3DOG


introduction

The 3DVG mission has the following three challenges:

  • Find the main focus, that is, find the subject (target object) in complex and diverse text descriptions;
  • Understand point cloud scenes;
  • Locate the target object;

In order to solve these problems, this article designs the following three modules:

  • First, a language scene graph module is proposed to capture rich structural and phrase correlations from complex text descriptions;
  • Secondly, the relationship between proposals is introduced and the visual characteristics of the initial proposals are enhanced;
  • Finally, a text description-guided 3D visualization graph module is developed to encode the global context of phrases and suggestions through a node matching strategy.

A graphical summary looks like this:
Insert image description here

To put it bluntly, this article does the following three things:

  • First, the complex text description is divided into three types of phrases: noun phrases, pronouns and relative phrases, and a language scene graph G l G^l is constructed based on these phrases.Gl , where nodes and edges correspond to noun phrases + pronouns and relative phrases respectively;
  • Secondly, a proposal relation graph G o G^o is constructed based on the proposals given by VoteNet.Go , and then use the language scene graphG l G^lGlCalculate a matching scoreϕ 1 \phi_1ϕ1, using this to G o G^oGProposals in o are cropped and refined;
  • Finally, the two graphs are fused through node matching to obtain what this article calls the description guided 3D visual graph G u G^uGu , to perform 3DVG tasks.

So there are the following key issues:

  • How does a language scene diagram dismantle language and construct it?
  • How are relationships within a visual scene graph structured? Calculate by distance?

method

The method framework diagram is as follows:

Insert image description here

1. Language Scene Graph Module

Each node and edge in the language graph corresponds to the object mentioned in the text description L and its relationship with other objects mentioned in L. It is a directed graph.

ALL

Guess you like

Origin blog.csdn.net/DUDUDUTU/article/details/130464925