Overview of scene graph generation

16622248:

SG (Scene Graph, Scene Graph) can very well help people understand video scenes. The research direction of SGG (Scene Graph Generation) makes it necessary for me to have a comprehensive understanding of this field. I will write a part first and add more later. Complete, selected important translations, only for personal study records.

Reference papers:

2104.01111.pdf (arxiv.org)https://arxiv.org/pdf/2104.01111.pdf

Summary

A scene graph is a structured representation of a scene that can clearly express the objects, attributes, and relationships between objects in the scene. With the continuous development of computer vision technology, people are no longer satisfied with simply detecting and identifying objects in images; instead, people expect a higher level of understanding and reasoning about visual scenes. For example, given an image, we not only need to detect and identify objects in the image, but also understand the relationships between objects (visual relationship detection) and generate text descriptions (image captions) based on the image content. Or we might want the machine to tell us what the little girl in the image is doing (visual question answering (VQA)), or even remove the dog from the image and find similar images (image editing and retrieval), and so on. These tasks require a higher level of understanding and reasoning about image vision tasks. The scene graph is a very powerful scene understanding tool. Therefore, scene graphs have attracted the attention of a large number of researchers, and related research is often cross-modal, complex, and developing rapidly. This article summarizes the general definition of scene graphs, then conducts a comprehensive and systematic discussion on scene graph (SGG) generation methods and SGG with the help of prior knowledge, and summarizes the most commonly used data sets.

1 Introduction

Currently, there is an explosion of work related to scene graph generation (SGG), but a comprehensive and systematic survey of SGG is lacking. To fill this gap, this paper (hereinafter referred to as the paper) mainly reviews the methods and applications of SGG. The figure below shows the main structure of the thesis survey. Furthermore, in Section 6, the paper summarizes commonly used datasets and evaluation methods in scene graphs, and compares the performance of the models. In Section 7, the paper discusses the future development direction of SGG and concludes in Section 8.

1.1 Definition

The figure below summarizes the overall process of building a scene graph. As shown in Figure 2 (bottom), an object instance in the scene graph can be a person (girl), a place (tennis court), a thing (shirt), or a part of another object (arm). Properties are used to describe the current state of the object; these might include the shape of the racket (the racket is a long bar), color (the girl's clothes are white), and pose (the girl is standing). Relations are used to describe connections between pairs of objects, such as actions (e.g., a girl swings a racket) and positions (a cone placed in front of the girl). This relationship is usually represented as <subject-predicate-object>a triplet, which can be abbreviated as <s-p-o>.

Formally, the scene graph SG is a directed graph data structure. SG = (O,R,E)Define in the form of tuples

  • O = \{o_1,...,o__n\}is the set of objects detected in the image, n is the number of objects. Each object can be recorded as o_i = (c_i,A_i), where {\color{Golden} }c_iand A_i{\color{Golden} }represent the category and attributes of the object respectively.
  • R represents a set of relationships between nodes, where the relationship between the i-th object instance and the j-th object instance can be expressed as{\color{Emerald} }r_{i\rightarrow j} ,i,j\in \{1,2,...,n\}
  • E\subseteq O\times R\times ORepresents the edge between object instance nodes and relationship nodes, so there are at most n × n edges in the initial graph. Then, when o_iclassified as background r_{i\rightarrow j}or classified as irrelevant, Edge(o_i,r_{i\rightarrow j})\in Eit is automatically removed. That is, given an image Ias input, the SGG method outputs a scene graph SG, which contains the object instances positioned in the image by bounding boxes and the relationship between each object pair instance.
  • It can be expressed as:

    SG_{O,R,E}^I=SGG(I).

1.2  Construction Process

Referring to the expressions in "Unbiased scene graph generation from biased training", a general SGG process is shown in Figure 3. Figure 3 (left) is an abstract representation of this SGG process, and Figure 3 (right) is a concrete example. Specifically, nodes Irepresent a given image, nodes Xrepresent features of objects, and nodes Zrepresent categories of objects. A node Yrepresents the category of the prediction predicate and the corresponding triple <s-p-o>, which receives the output from the three branches using a fusion function to generate the final score. Nodes Yrepresent real triplet labels. The corresponding links are described as follows:

I\rightarrow X(Object Feature Extraction)

Pre-trained Faster R-CNN is often used to extract an input image of a set of bounding boxes B=\{b_i|i=1,...,m\}and corresponding feature maps . This process can be expressed as:X=\{x_i|i=1,...,m\}I

Input:\{I\}\Rightarrow Output:\{x_i|i=1,...,m\}.

Through this process, the visual context of each object is encoded.

X\rightarrow Z(Object Classification) 

This process can be simply expressed as:

Input:\{x_i\}\Rightarrow Output:\{z_i,z_i\in O\},i=1,...,m.

Z\rightarrow \widetilde{Y}(Object Class Input for SGG) 

Predicates between pairs of objects are predicted (z_i,z_j)by combining embedding layers using paired object labels . This process can be expressed as:M\widetilde{y}_{ij}\widetilde{Y}

Input:\{(z_i,z_j)\}\overset{M}{\Longrightarrow}Output:\{\tilde{y}_{ij}\},i\neq j;i,j=1,...,m.

Some prior knowledge is used here. For the calculation of prior knowledge, please refer to the original reference.

X\rightarrow \widetilde{Y}(Object Feature Input for SGG) 

The combination of paired object features [x_i,x_j]is taken as input and the corresponding predicate is predicted. This process can be expressed as:

I nput:\{[x_i,x_j]\}\Rightarrow Output:\{\widetilde{y}_{ij}\},i\neq j;i,j=1,...,m.

I\rightarrow \widetilde{Y}(Visual Context Input for SGG)

In this link, b_i\cup b_jthe visual context features of the joint area are extracted v_{i j}=C o n v s(R o I A l i g n(I,b_{i}\bigcup b_{j}))and the corresponding triples are predicted. This process can be expressed as:

Input:\{v_{ij}\}\Rightarrow Output:\{\widetilde{y}_{ij}\},i\neq j;i,j=1,...,m.

Training Loss 

Most models are trained by traditional cross-entropy loss using object labels and predicate labels. In addition, in order to prevent any one link from spontaneously dominating \widetilde{Y}the generation of logits, the relevant paper adds an auxiliary cross-entropy loss to predict the loss of each branch separately.

More concisely, scene graph generation can be roughly divided into three parts: feature extraction, contextualization, graph construction, and reasoning.

  1. Feature Extraction: This process is mainly responsible for encoding objects or pairs of objects in an image. For example, I\rightarrow Xand I\rightarrow \widetilde{Y}.
  2. Contextualization: It plays the role of associating different entities and is mainly used to enhance contextual information between entities. For example, I\rightarrow X,Z\rightarrow \widetilde{Y},X\rightarrow \widetilde{Y},I\rightarrow \widetilde{Y}.
  3. Graph construction and reasoning: Finally, use this contextual information to predict predicates to complete graph construction and reasoning. For example, \widetilde{Y}prediction of node labels.

On the other hand, as shown in Figure 2 (a), from the perspective of the SGG process, the current generation of scene graphs can be divided into two types:

  • Bottom-up: The first method is divided into two stages, namely object detection and pairwise relationship recognition. The first stage used to identify the categories and attributes of the detected objects is usually implemented using Fast-RCNN. This approach is called a bottom-up approach. It can be expressed in the following form:

P(SG|I)=P(B|I)*P(O|B,I)*P(R|B,O,I),

        P(B|I), P(O|B, I) and P(R|B, O, I) represent bounding box (B), object (O) and relationship (R) prediction models respectively.

  • Top-down: Another approach involves jointly detecting and identifying objects and their relationships, this approach is called top-down approach. The corresponding probability model can be expressed as:

P(SG|I)=P(B|I)*P(O,R|B,I),(7)

        In the formula, P(O, R|B, I) represents the joint reasoning model of objects and their relationships based on object region suggestions.

At a high level, inference tasks and other involved vision tasks include identifying objects, predicting the coordinates of objects, and detecting pairwise relational predicates between identified objects. Therefore, most current work focuses on the key challenge of visual relational reasoning.

1.3 Challenge

However, it is worth noting that the research on scene graphs still faces some challenges. Currently, research on scene graphs mainly focuses on trying to solve the following three problems:

  • Construction of SGG model. The key question here is how to build a scene graph step by step. Different learning models have a crucial impact on the accuracy and completeness of scene graphs generated by mining visual text information. Therefore, the study of relevant learning models is crucial to the research of SGG.
  • Introduction of prior knowledge. In addition to fully mining the objects and their relationships in the current training set, some additional prior knowledge is also crucial for the construction of the scene graph. Another important issue is how to make full use of existing prior knowledge.
  • Long-tail distribution of visual relationships. The long-tail distribution of predicates is a key challenge in visual relation recognition. This long-tail distribution and the data balance required for model training constitute an inherent contradiction, which will directly affect the performance of the model. For long-tail distribution problems, please refer to Long-Tailed Classification (1) Introduction to classification problems under long-tail (unbalanced) distribution - Zhihu (zhihu.com)

2 SCENE GRAPH GENERATION

A scene graph is a topological representation of a scene, whose main goal is to encode objects and their relationships. Furthermore, the key challenging task is to detect/recognize relationships between objects. At present, SGG methods can be roughly divided into CRF-based SGG, TransE-based SGG, CNN-based SGG, RNN/LSTM-based SGG and GNN-based SGG. In this section, we review each of these methods in detail.

2.1  CRF-based SGG

In visual relational triples <s-p-o>, there is a strong statistical correlation between relational predicates and object pairs. Effective use of this information can greatly aid in identifying visual relationships. CRF (Conditional Random Field) is a classic tool capable of incorporating statistical relationships into discrimination tasks. CRF has been widely used in various graph reasoning tasks, including image segmentation, named entity recognition, and image retrieval. In the context of visual relationships, CRF can be expressed as:

P(r_{s\to o}|X)=\frac{1}{Z}e x p(\Phi(r_{s\rightarrow o}|X;W)),X=x_{r},x_{s},x_{o},

where x_rrepresents the appearance characteristics and spatial configuration of the object pair, x_sand x_orepresents the appearance characteristics of the subject and object respectively. Generally speaking, most of these features are one-dimensional tensors with a size of 1 × N after ROI pooling. N is the dimension of the tensor, and its specific value can be controlled by network parameters. W is the model parameter, Z is the normalization constant, and \Phiis the joint potential. Similar CRFs are widely used in computer vision tasks and have been proven effective in capturing statistical correlations in visual relationships. Figure 4 summarizes the basic structure of CRF-based SGG, including target detection and relationship prediction. The object detection model is used to obtain the regional visual features of subjects and objects, while the relationship prediction model uses the visual features of subjects and objects to predict the relationship between them. Other improved CRF-based SGG models achieve better performance by adopting a more suitable object detection model and a relationship prediction model with stronger reasoning capabilities. For example, related papers proposed Deep Relational Network (DR-Net) and Semantic Compatibility Network (SCN). 

Inspired by the success of deep neural networks and CRF models, in order to explore statistical relationships in the context of visual relationships, DR-Net[ chooses to incorporate statistical relationship modeling into the deep neural network framework. DR-Net unfolds the reasoning of relational modeling into a feed-forward network. In addition, DR-Net is different from previous CRFs. More specifically, the statistical inference process in DRNet is embedded into deep relational networks through iterative expansion. The performance of the improved DRNet is not only better than classification-based methods, but also better than depth potential-based CRF. Further, SG-CRF (SGG via Conditional Random Fields) can be defined as o_i,o^{bbox}_i,r_{i\rightarrow j}maximizing the following probability function by finding the best prediction:

P(SG|I)=\prod\limits_{o_i\in O}P(o_i,o_i^{bbox}|I)\prod\limits_{r_i\to j\in R}P(r_{i\to j}|I),

where o^{bbox}_i\in \mathbb{R}^4represents the bounding box coordinates of the i-th object instance. Some previous methods tend to ignore the semantic compatibility between instances and relationships (i.e., the likelihood distribution of all 1-hop neighbor nodes of a given node), which leads to a significant degradation in model performance when facing real data. For example, this may cause the model to incorrectly <dog-sitting\ inside-car>identify <dog-driving-car>.

Furthermore, these models ignore the order of the two, leading to confusion between subject and object and thus potentially absurd predictions, <car-sitting\ inside-dog>e.g. To solve these problems, SG-CRF proposes an end-to-end scene graph constructed through conditional random fields to improve the quality of SGG. More specifically, in order to learn the semantic compatibility of nodes in the scene graph, SG-CRF proposes a new semantic compatibility network based on conditional random fields. To distinguish subjects and objects in relationships, SG-CRF proposes an effective relationship sequence layer that can capture subject and object sequences in visual relationships.

In general, CRF-based SGG can effectively model statistical correlations in visual relationships. This statistically relevant information modeling remains a classic tool in visual relationship recognition tasks.

2.2 TransE-based SGG

A knowledge graph is similar to a scene graph; it also has a large number of fact triples, and these multi-relationship data (head, label, tail)are represented in the form of (abbreviated as ( (h,l,t))). Among them, h,t \in Oare the head entity and the tail entity respectively, and lare the relationships between the two entities. Label. These entities are objects in the scene graph, so we use O to represent a collection of entities to avoid confusion with E (edges between objects).

Knowledge graphs represent learning to embed triples into a low-dimensional vector space, and models based on TransE (Translation Embedding) have proven to be particularly effective. Furthermore, TransE treats this relationship as a conversion between head and tail entities. The model requires learning vector embeddings of entities and relationships. That is, for tuples (h, l, t), h+l\approx t( t should be  h+l the nearest neighbor; otherwise, they should be far away from each other). This embedding learning can be achieved by minimizing the following margin-based loss function:

\mathcal{L}=\sum_{(h,l,t)\in S}^{}\sum_{({h}',{l}',{t}')\in {S}'_{(h,l,t)}}^{}[\gamma +d(h+l,t)-d({h}'+l,{t}')]_{+},

where S represents the training set, d measures the distance between two embeddings, \gammais a marginal hyperparameter, (h,l,t)is a positive tuple, (h',l,t')is a negative tuple, and

S'_{h,l,t}=\{(h',l,t)|h'\in\mathbb{E}\}\cup\{(h,l,t')|t'\in O\}.

Relation tuples in scene graphs also have similar definitions and properties, which means that learning such visual relation embeddings is also of great help to scene graphs. 

Inspired by TransE's progress in knowledge base relational representation learning, VTransE (based on TransE) explores how to model visual relationships by mapping the features of objects and predicates in a low-dimensional space. It is the first SGG method based on TransE. It works by extending the TransE network. Subsequently, as shown in Figure 6, the attention mechanism and visual context information were introduced, and MATransE (Multimodal Attentional Translation Embeddings) and UVTransE (Union visual Translation Embedding) were designed respectively. In RLSV (Representation Learning via Jointly Structural and Visual Embedding) and AT (Analogies Transfer), TransD and analog transformation are used respectively instead of TransE for visual relationship prediction.

More specifically, VTransE maps entities and predicates into a low-dimensional embedding vector space, where predicates are interpreted as transformation vectors between the embedding features of the subject and the bounding box region of the object. Similar to the tuples of the knowledge graph, the relationship in the scene graph is modeled as a simple vector transformation, that is, s+p\approx oit can be regarded as a basic vector transformation of the TransE-based SGG method. While this is a good start, VTransE only considers features of subjects and objects, not of predicates and contextual information, although these have proven useful for identifying relationships. To this end, the MATransE method based on VTransE combines the complementarity of language and vision, and combines the attention mechanism and deep supervision to propose a multi-modal attention translation embedding method. MATransE attempts to learn the projection matrices Ws, Wp and Wo for <s-p-o>projection to fractional space and uses binary masked convolutional features m in the attention module. Then it iss+p\approx o :

W_s(s,o,m)s+W_p(s,o,m)p\approx W_o(s,o,m)o.

MATransE designed two independent branches to directly handle the characteristics of predicates and the characteristics of subjects and objects, and achieved good results.

In addition to greatly changing the visual appearance of predicates, the sparsity of predicate representations in the training set and the very large predicate feature space also make the task of visual relationship detection increasingly difficult. Let us take the Stanford VRD dataset as an example. This dataset contains 100 categories of objects, 70 categories of predicates, and a total of 30k training relation annotations. The number of possible <s-p-o>triples is 1002*70 = 700k, which means a large number of possible real relationships without even training examples. These invisible relationships should not be ignored, even if they are not included in the training set. Figure 5 gives an example of this situation. However, VTransE and MATransE are not suitable to handle this problem. Therefore, detecting unseen/new relationships in the scene is crucial to building a complete scene graph.

Inspired by VTransE, UVTransE aims to improve the generalization of rare or invisible relationships. On the basis of VTransE, UVTransE introduces the joint bounding box or union feature u of the subject and object to better capture contextual information and learn the embedding p\approx u-s-oof predicates under the guidance of constraints. UVTransE introduces the union of subjects and objects and uses a context-augmented translation embedding model to capture common and rare relationships in the scene. This type of exploration is very beneficial for building a relatively complete scene graph. Finally, UVTransE combines the scores from the vision, language, and object detection modules to rank the predictions of the triple relationship. The architectural details of UVTRansE are shown in Figure 7. UVTransE handles predicate embedding as p\approx u(s,o)-(s+o). VTransE models visual relationships by mapping the characteristics of objects and predicates in a low-dimensional space, where relationship triples can be interpreted as vector translations: s+p\approx o

Based on the profound insights from knowledge graph related research, the TransE-based SGG method has been rapidly developed and has aroused strong interest among researchers. Relevant research results also prove the effectiveness of this method. In particular, the transe-based SGG method is very helpful for mining invisible visual relationships, which will directly affect the integrity of the scene graph. Therefore, relevant research remains valuable.

2.3 CNN-based SGG (The following parts will be updated later, focusing on 2.4 2.5)

The CNN-based SGG method attempts to use a convolutional neural network (CNN) to extract local and global visual features of an image, and then predicts the relationship between topics and objects through classification. As can be seen from most CNN-based SGG methods, this type of SGG method mainly includes three parts: region proposal, feature learning and relationship classification. Among these parts, feature learning is the key part . We can use the following formulas to express  l the feature learning of the subject, predicate and object at the first level respectively:

v_*^l=f(W_*^l\otimes v_*^{l-1}+c_*^l),

where * is <s,p,o>, ⊗ is the matrix-vector product, W_{*}^{l} and  c_{*}^{l} sum is the parameter of the FC or Conv layer. The subsequent CNN-based SGG method focuses on designing new modules to learn optimal features  {v}'. The mainstream SGG method based on CNN is shown in Figure 8. By jointly considering the local visual features of multiple objects in LinkNet, or introducing the box attention mechanism in BAR-CNN (Box Attention Relational CNN), the final features for relationship recognition are obtained. In order to improve the efficiency of the SGG model, Rel-PN and IM-SGG (Interpretable Model for SGG) aim to select the most effective ROI for visual relationship prediction. ViP-CNN (Visual Phrase-guided Convolutional Neural Network, Visual Phrase-guided Convolutional Neural Network) and Zoom-Net pay more attention to the interaction between local features. Since CNN performs well in extracting visual features of images, related SGG methods based on CNN have been widely studied. In this section, we introduce these CNN-based SGG methods in detail.

Scene graphs are generated by analyzing the relationships between multiple objects in an image dataset. Therefore, it is necessary to consider the connections between related objects as much as possible instead of focusing on a single object in isolation. LinkNet improves SGG by explicitly modeling the interdependencies between all relevant objects. More specifically, LinkNet designs a simple and effective relationship embedding module to jointly learn the connections between all related objects. In addition, LinkNet also introduces a global context encoding module and a geometric layout encoding module to extract global context information and spatial information between object proposals from the entire image, thereby further improving the performance of the algorithm. The specific LinkNet is divided into three main steps: bounding box proposal, object classification and relationship classification . However, LinkNet considers relation proposals for all objects, which makes it computationally expensive .

On the other hand, with the development of deep learning technology, corresponding target detection research has become increasingly mature. In contrast, identifying the associations between different entities to achieve higher-level visual task understanding has become a new challenge; it is also the key to scene graph construction. As analyzed in Section 2.2, in order to detect all relationships, first detecting all single objects and then classifying all relationship pairs is inefficient and unnecessary because the visual relationships existing in quadratic relationships are very sparse. Therefore, using visual phrases to express this visual relationship can be a good solution. Rel-PN has conducted corresponding research in this area. Similar to the region proposals for objects provided by Region Proposal Networks (RPN), Rel-PN utilizes a proposal selection module to select meaningful subject-object pairs for subsequent relationship prediction. This operation will greatly reduce the computational complexity of SGG. The model structure of Rel-PN is shown in Figure 9(a):

Rel-PN's compatibility evaluation module uses two types of modules: visual compatibility modules and spatial compatibility modules. The visual compatibility module is mainly used to analyze the consistency of the appearance of two boxes, while the spatial compatibility module is mainly used to explore the position and shape of the two boxes. In addition, RelPN-based IM-SGG considers three types of features: visual, spatial and semantic, which are extracted by three corresponding models respectively. Subsequently, similar to Rel-PN, these three types of features are fused together for final relationship recognition. Different from Rel-PN, IM-SGG utilizes an additional semantic module to capture strong prior knowledge of predicates, resulting in better performance (Figure 9(b)). This method effectively improves the interpretability of SGG. More directly, ViP-CNN uses a similar approach to Rel-PN and also explicitly treats visual relationships as visual phrases containing three components. ViP-CNN attempts to jointly learn specific visual features for interaction to facilitate consideration of visual dependencies. In ViP-CNN, Phrase-guided Message Passing Structure (PMPS) is proposed, which uses a set broadcast message passing flow mechanism to model the interdependent information between local visual features. ViPCNN is both fast and accurate. Significant improvements have been made in performance.

In addition, in order to further improve the accuracy of SGG, some methods also study the interaction between different features, with the aim of predicting the visual relationships between different entities more accurately. This is because independent detection and recognition of individual objects provides little help in fundamentally identifying visual relationships. Figure 10 gives an example where even the most perfect object detector would have difficulty distinguishing between a person standing next to a horse and a person feeding the horse. Therefore, the information interaction between different objects is extremely important for understanding visual relationships. Many relevant works have been published on this topic. For example, in Zoom-Net, the detected interactions between pairs of objects are used for visual relationship recognition. Zoom-Net achieves convincing performance by successfully identifying complex visual relationships without using any language priors using deep message propagation and interactions between local object features and global predicate features. VIP-CNN also uses similar functional interactions.

The key difference is that the CA-M (context-appearance Module) proposed by VIP-CNN attempts to directly fuse paired features to capture contextual information, while the SCA-M (Spatiality-Context-Appearance Module) proposed by Zoom-Net [23] Perform spatially aware channel-level fusion of local and global context information. Therefore, SCA-M is more advantageous in capturing the spatial and contextual relationships between subject, predicate and object features. Figure 11 shows the structural comparison diagram of the appearance module (AM), context-appearance module (CA-M) and space-context-appearance module (SCA-M) without information interaction.

Attention mechanisms are also a good tool for improving visual relationship detection. BAR-CNN observes that in state-of-the-art feature extractors, the receptive fields of neurons may still be limited, which means that the model may cover the entire attention map. To this end, BAR-CNN proposes a box attention mechanism; this allows the visual relationship detection task to use existing object detection models to complete the corresponding relationship recognition tasks without introducing additional complex components. This is a very interesting concept, and BAR-CNN also achieved competitive recognition performance. The schematic diagram of BAR-CNN is shown in Figure 9(c).

Related CNN-based SGG methods have been widely studied. However, there are still many challenges that require further research, including how to ensure deep interactions between different features of triples while reducing computational complexity as much as possible, how to handle real but very sparse visual relationships in reality, etc. Determining solutions to these problems will further deepen related research on CNN-based SGG methods.

2.4  RNN/LSTM-based SGG

A scene graph is a structured representation of an image. The interaction of information between different objects and the contextual information of these objects are crucial for identifying the visual relationships between them. Models based on RNN and LSTM have natural advantages in obtaining contextual information in scene graphs and reasoning about structured information in graph structures. Therefore, methods based on RNN/LSTM are also a popular research direction. As shown in Figure 12, several improved SGG models are proposed based on the standard RNN/LSTM network. For example, IMP (Iterative Message Passing) and MotifNet (Stacked Motif Network) consider the feature interaction of local context information and global context information respectively. Similarly, SGG uses instance-level and scene-level context information in PANet (predicate association network), and SGG also introduces attention-based RNN in SIG (Sketching Image Gist). The corresponding probabilistic interpretation of these RNN/LSTM-based SGG methods can be simplified to the conditional form of Eq. (7), and these methods mainly utilize standard/improved RNN/LSTM networks by optimizing P(R|B, O, I) to infer relationships.

Later, RNN/LSTM-based models tried to learn different types of contextual information by designing structural RNN/LSTM modules; such as AHRNN (attention-based hierarchical RNN), VCTree (visual context tree model). This type of SGG method treats the scene graph as a hierarchical graph structure, so it is necessary to construct a hierarchical entity tree based on region proposals. Hierarchical context information can then be encoded:

D=BiTreeLSTM(\{z_i\}_{i=0}^n),

where  z_i are the characteristics of the input nodes in the constructed hierarchical entity tree. Finally, the multilayer perceptron (MLP) classifier is used to predict the predicate p.

As mentioned above, in order to fully utilize the contextual information in the image to improve the accuracy of SGG, IMP is proposed. IMP attempts to solve scene graph inference problems using standard RNNs and iteratively improves the model's predictive performance through message passing. The main highlight of this method is its novel primal-dual graph, which realizes the bidirectional flow of node information and edge information, and iteratively updates the two GRUs of nodes and edges. This form of information interaction helps the model more accurately identify visual relationships between objects. Unlike the situation where local information interacts with each other such as IMP, MotifNet starts from the strong independence assumption in local prediction that limits the quality of global prediction. To this end, MotifNet encodes global context information through the cyclic sequential architecture LSTMs (Long-Short-term Memory Networks). MotifNet only considers contextual information between objects, but does not consider scene information. There are also some works that study the classification of relationships by exchanging context between nodes and edges. However, the above-mentioned SGG method mainly focuses on the structural semantic features in the scene and ignores the correlation between different predicates. To this end, a two-stage predicate association network (PANet) is proposed. The main goal of the first stage is to extract instance-level and scene-level contextual information, while the second stage is mainly used to capture the correlation between predicate-aligned features. In particular, an RNN module is used to fully capture the correlation between aligned features. This kind of predicate correlation analysis also achieves good results.

However, the methods discussed above often rely on object detection and predicate classification between objects. This method has two inherent limitations: first, object bounding boxes or relation pairs generated by object detection methods are not always necessary to generate scene graphs; second, SGG relies on probabilistic ranking of output relations, which will Resulting in semantically redundant relationships. To this end, AHRNN proposes a hierarchical recurrent neural network based on the visual attention mechanism. This method first uses a visual attention mechanism to address the first limitation. Second, AHRNN treats the identification of relational triples as a sequence learning problem using recurrent neural networks (RNN). In particular, it adopts hierarchical RNN to model relational triples, which handles long-term contextual information and sequence information more effectively, thereby avoiding sorting the probability of output relations.

On the other hand, VCTree observes that the previous scene graph uses either a chain graph or a fully connected graph. However, VCTree proposed that these two prior structures may not be optimal because the chain structure is too simple and may only capture simple spatial information or co-occurrence bias; in addition, fully connected graphs lack the distinguishing structure of hierarchical relationships and parallel relationships. To solve this problem, VCTree proposes a composite dynamic tree structure, which can use TreeLSTM [84] for efficient context encoding, thereby effectively representing hierarchical and parallel relationships in visual relationships. This tree structure provides a new research direction for scene graph representation. Figure 13 shows a comparison diagram of the chain structure, fully connected graph structure, subgraph and dynamic tree structure of the scene graph.

SIG also proposes scene graphs with a similar tree structure; the key difference comes from the observation that humans tend to first describe the subjects and key relationships in the image when analyzing a scene, which means that hierarchical analysis with major and minor orders is more In line with human habits. To this end, SIG proposed a human-like hierarchical SGG method. In this method, the scene is represented by a human-like HET (Hierarchical Entity Tree) composed of a series of image areas. The HET is parsed by Hybrid Long-ShortTerm Memory to obtain the hierarchical structure and sibling context in the HET. information.

2.5 GNN-based SGG

A scene graph can be viewed as a graph structure. Therefore, an intuitive approach is to improve the generation of scene graphs with the help of graph theory. GCN (Graph Convolutional Network) is such a method. This method is designed to handle graph-structured data, local information, and can effectively learn the information between adjacent nodes. GCN has been proven to be very effective in tasks such as relational reasoning, graph classification, node classification in large graphs, and visual understanding. Therefore, many researchers directly studied GCN-based SGG methods. Similar to the conditional form of the RNN/lstm-based SGG method (Eq. (7)), according to the expression form of the relevant variables in this article, the GCN-based SGG process can also be decomposed into three parts:

P(\langle V,E,O,R\rangle|I)=P(V|I)*P(E|V,I)*P(R,O|V,E,I)

where V is a collection of nodes (objects in the image) and E is an edge in the graph (relationships between objects). On this basis, a subsequent improved GNN-based SGG method was proposed. Most of these methods try to optimize the terms of P(E|V, I) and P(R, O|V, E, I) by designing relevant modules, and target the graph labeling process P(R, O|V, E, I ) designed a gcn-based network. Figure 14 shows some classic GNN-based SGG models. F-Net (Factorizable Net) completes the final SGG by decomposing and merging graphs, and then introduces an attention mechanism to design different types of GNN modules for SGG, such as R-CNN, GPI and ARN (attention Relational Network). Few-shot training and multi-agent training are applied to few-shot SGP [101] and CMAT (Counterfactual critic MultiAgent Training) respectively. A probabilistic graph network (PGN) was designed for DG-PGNN, and a multi-modal graph convNet was developed for SGVST. In addition, other improved gnn-based network modules have been proposed for SGG, which we will describe in detail.

As mentioned in Section 1.2, current SGG methods can be roughly divided into two categories: bottom-up methods and top-down methods. However, these types of frameworks build quadratic objects, which is time-consuming. To this end, an efficient subgraph-based SGG framework called Factorizable Net (F-Net) is proposed to improve the efficiency of scene graph generation. Using this approach, detected object region proposals are paired to facilitate the construction of a complete directed graph. Then, the edges corresponding to the similar union areas are merged into subgraphs to generate a more accurate graph; each subgraph has several objects, and the relationships between them are represented by edges. Decomposable nets can achieve higher scene graph generation efficiency by replacing the original scene graph with these subgraphs. Furthermore, Graph R-CNN attempts to prune the original scene graph (removing those unlikely relationships), thereby generating sparse candidate graph structures. Finally, attention graph convolutional network (AGCN) is used to integrate global context information to achieve more efficient and accurate SGG.

Graph-based attention mechanism also has important research value in scene graph generation. For example, previous SGG work often requires prior knowledge of graph structure. Furthermore, these methods tend to ignore the overall structure and information of the entire image because they capture the representation of nodes and edges in a step-by-step manner. Furthermore, detecting the visual relationships of pairs of regions one by one is not well suited to describing the structure of the entire scene. To this end, a semantic conversion module is proposed in ARN to produce semantic embeddings by converting label embeddings and visual features into the same space, while the relationship inference module is used to predict entity categories and relationships as the final scene graph result. In particular, to facilitate describing the structure of the entire scene, ARN proposes a graph-based self-attention model that aims to embed a joint graph representation to describe all relationships. This module helps generate more accurate scene graphs. Additionally, one intuition is that when recognizing images of “people riding horses”, the interaction between the human legs and the horse’s back can provide strong visual evidence for the recognition predicate. Therefore, RAAL-SGG (Toward Region-Aware Attention Learning For SGG) points out that it is limited to using coarse-grained bounding boxes to study SGG. Therefore, RAALSGG proposes a region-aware attention learning method that uses object-level attention map neural networks for finer-grained object region reasoning. The probability function of this model can be expressed as:

P(SG|I)=P(B|I)*P(F|B,I)*P(O|B,I,F)*\\ P(R|B,I,F,O),F =\{f^{(n)}\}_{n=1}^N,(17)

where f^{(n)}is the region set of the nth object. Different from equation (7), the object area set F considered in equation (17) is finer-grained than the coarse-grained bounding box b, which helps the model reason about predicates with the help of object interaction areas. 

The reading order of entities in the context encoded using RNN/LSTM also has a crucial impact on SGG when predicting the visual relationships of the scene graph. In these cases, a fixed read order may not be optimal. Even if there are different types of inputs, the scene graph generator should reveal the connections and relationships between objects to improve prediction accuracy. Formally, given the same features, the framework or function F should get the same result even if the input is scrambled. Motivated by this observation, the neural network architecture used for SGG should ideally be invariant to specific types of input permutations. Herzig et al. have proven this property, based on such an architecture or framework, information can be collected from the overall graph in a permutation-invariant manner. Based on this feature, the authors proposed several commonly used architectures and achieved competitive performance.

For most SGG methods, the long-tail distribution of relations remains a challenge for relation feature learning. Existing methods often cannot handle unevenly distributed predicates. Therefore, Dornadula et al. try to construct a scene graph through few-shot learning of predicates, which can be extended to new predicates. The SGG model based on few-shot learning attempts to fully train graph convolution models and spatial and semantic shift functions on the relationship between large amounts of data. For them, new shift functions are fine-tuned to new, rare relationships for a small number of examples. Compared with the traditional SGG method, the novelty of this model is that the predicate is defined as a function, so the object representation is very useful for few-shot predicate prediction; these include converting the subject representation into a forward function of the object and converting the object representation into method to convert back to the corresponding function of the subject. This model achieves good results in learning rare predicates.

A comprehensive, accurate, and coherent scene graph is what we expect to achieve, and the semantics of the same node in different visual relationships should also be consistent. However, the currently widely used cross-entropy-based supervised learning paradigm may not guarantee such visual context consistency.

RelDN (Relation Detection Network) also found that applying cross-entropy loss alone may have adverse effects on predicate classification; for example, entity instance confusion (confusion between different instances of the same type) and proximal relation ambiguity (different three instances with the same predicate) Subject-object pairing problem in tuples). RelDN was proposed to solve these two problems. In RelDN, three types of features of semantic, visual and spatial relationship suggestions are combined together through entity addition. These features are then applied through softmax normalization to obtain the distribution of predicate labels. Then, the contrast loss between graphs is specially constructed to solve the above problem.

Scene graphs provide a natural representation for reasoning tasks. Unfortunately, due to their non-differentiable representation, it is difficult to directly use scene graphs as intermediate components for visual reasoning tasks. Therefore, DSG (Differentiable Scene-Graphs) is used to solve the above obstacles. The visual features of the target are taken as input to the differentiable scene graph generation module of DSGs, which is a new set of node features and edge features. The novelty of the DSG architecture lies in its decomposition of scene graph components such that each element in a triple can be represented by a dense descriptor. Therefore, dsg can be directly used as an intermediate representation for downstream inference tasks.

Although we have studied many GNN-based SGG methods, there are still many other related methods. For example, a deep generative probabilistic graph neural network (DG-PGNN) is proposed to generate scene graphs with uncertainty. SGVST introduces a scene graph-based approach to generate story sentences from image streams. This method uses GCN to capture local fine-grained regional representations of objects in the scene graph.

In summary, the GNN-based SGG method has received extensive research attention due to its obvious ability to capture structured information. As mentioned before, "the identification of predicates is interrelated, and contextual information plays a crucial role in the generation of scene graphs." To this end, researchers are increasingly paying attention to methods based on RNN/LSTM or graphs. This is mainly because RNN/LSTM has better relationship context modeling capabilities. The graph structure characteristics of the scene graph itself have also attracted corresponding attention to GNN-based SGG. In addition, TransE-based SGG has been welcomed by researchers due to its intuitive modeling method, which makes the model highly interpretable. Since CNN has strong visual feature learning capabilities, the CNN-based SGG method will still be the mainstream SGG method in the future.

3 SGG WITH PRIOR KNOWLEDGE

For SGG, a relationship is a combination of objects whose semantic space is wider than that of objects. Furthermore, it is very difficult to exhaust all relationships from SGG training data. Therefore, it is particularly critical to effectively learn relational representations from small amounts of training data. Therefore, the introduction of prior knowledge can greatly help detect and identify visual relationships. Therefore, in order to generate a complete scene graph efficiently and accurately, the introduction of prior knowledge (such as language prior, visual prior, knowledge prior, context, etc.) is also crucial. In this section, we will introduce related work on SGG based on existing knowledge.

3.1 SGG with Language Prior

Linguistic priors often exploit information embedded in semantic words to fine-tune the likelihood of relation predictions, thereby improving the accuracy of visual relation predictions. Linguistic priors can help identify visual relationships through observations of semantically related objects. For example, horses and elephants may be arranged in semantically similar contexts, e.g., "a man riding a horse" and "a man riding an elephant." Therefore, although it is not common for elephants and people to appear at the same time in the training set, by introducing language priors and studying more common examples (such as "a person rides a horse"), we can still easily infer that people and large people There may be a horse-riding relationship between the elephants. This idea is shown in Figure 15. This approach also helps address the long tail effect in visual relationships.

Many researchers have conducted detailed studies on the introduction of linguistic priors. For example, Lu et al. propose to train a visual appearance module and a language module simultaneously, and then combine these two scores to infer visual relationships in images. In particular, the language prior module projects semantic-like relations into a tighter embedding space. This helps the model infer similar visual relationships ("man riding an elephant") from the "man on horseback" example. Similarly, VRL (deep variational structure reinforcement learning) and CDDN (context-dependent diffusion network) also use language priors to improve the prediction of visual relationships; the difference is that semantic word embeddings are used to fine-tune the possibility of predicting relationships, while VRL uses variational structures The traversal scheme traverses directed semantic action graphs from previous languages, which means that the latter can provide a richer and more compact representation of semantic associations than word embeddings. Furthermore, CDDN finds that similar objects have close internal correlations, which can be used to infer new visual relationships. To this end, CDDN adopts word embeddings to obtain semantic graphs, while constructing spatial scene graphs to encode global contextual interdependencies. CDDN can effectively learn latent representations of visual relationships by combining prior semantics with visual scenes; in addition, considering its isomorphism invariance to graphs, it can well satisfy visual relationship detection.

On the other hand, although prior language can bridge the gap between model complexity and dataset complexity, its effectiveness is also affected when semantic word embeddings are insufficient. To this end, a relationship learning module with a priori predicate distribution is further introduced based on IMP to better learn visual relationships. In more detail, pre-trained tensor-based relation modules are added to it as a dense relation before fine-tuning relation estimation, while an iterative message passing scheme with GRU is used as a GCN method for better feature representation. Improve SGG performance. In addition to using linguistic priors, visual cues are also incorporated to identify visual relationships and localize phrases in images. As such, it models the appearance, size, location, and properties of entities and the spatial relationships between pairs of objects connected by verbs or prepositions, and jointly infers visual relationships by automatically learning and combining the weights of these cues.

3.2 SGG with Statistical Prior

Statistical prior is also a form of prior knowledge widely used by SGG, because objects in visual scenes usually have strong structural regularity. For example, people tend to wear shoes, and mountains tend to have water around them. Furthermore, <cat-eat-fish>is common, while <fish-eat-cat>and <cat-ride-fish>is very unlikely. Therefore, this relationship can be represented by statistically relevant prior knowledge. Modeling statistical correlations between object pairs and relationships can help us correctly identify visual relationships.

Due to the large space and long-tail nature of the relationship distribution, simply using the annotations contained in the training set is not sufficient. Furthermore, it is difficult to collect a sufficient amount of labeled training data. Therefore, LKD (Linguistic Knowledge Distillation) uses not only annotations in the training set but also publicly available texts on the Internet (Wikipedia) to collect external linguistic knowledge. This is primarily achieved by counting the words and expressions that humans use to describe the relationship between pairs of objects in text, and then computing the <subj,obj>conditional probability distribution of the predicates for a given pair (P(pre|subj, obj)). A novel contribution is to use knowledge distillation to obtain prior knowledge from internal and external linguistic data to solve the long-tail relationship problem.

<subj-before-obj>Similarly, DR-Net (Deep Relational Network) also notices strong statistical correlations between triplets . The difference is that DR-Net proposes a deep relational network to exploit this statistical correlation. DR-Net first extracts the local area and spatial mask of each pair of objects, and then inputs them together with the appearance of a single object into the deep relationship network for joint analysis to obtain the most likely relationship category. In addition, MotifNet performed statistical analysis on the co-occurrence of relationships and object pairs on the Visual Genome dataset and found that these statistical co-occurrences can provide strong regularization for relationship prediction. To this end, MotifNet uses LSTM to encode the global context of objects and relationships, allowing the scene graph to be parsed. However, although the above methods also observed the statistical co-occurrence of triplets, the deep model they designed implicitly mines this statistical information through message transmission. KERN (Knowledge Embedded Routing Network) also noticed this statistical coexistence. The difference is that KERN formally expresses this statistical knowledge in the form of a structural graph that is incorporated into the deep propagation network as additional guidance. This effectively regularizes the distribution of possible relationships, thereby reducing the ambiguity of predictions.

In addition, similar statistical priors are also used for complex indoor scene analysis. Statistical priors can effectively improve the performance of corresponding scene analysis tasks.

3.3 SGG with Knowledge Graph

A knowledge graph is a rich knowledge base that encodes how the world is built. Common sense knowledge graph is used as prior knowledge to effectively help the generation of scene graphs.

To this end, GB-Net (Graph Bridge Network) proposes a new perspective that builds scene graphs and knowledge graphs into a unified framework. More specifically, GB-Net treats scene graphs as image-conditioned instantiations of commonsense knowledge graphs. Based on this perspective, scene graph generation is redefined as a bridge mapping between scene graphs and common sense graphs. Furthermore, the bias of existing label datasets in object pair and relation labels, as well as the noise and missing annotations they contain, increase the difficulty of developing reliable scene graph prediction models. To this end, KB-GAN (Knowledge Base and Assisted Image Generation) proposes an SGG algorithm based on external knowledge and image reconstruction loss to overcome the problems found in the dataset. More specifically, KB-GAN uses the English subgraph of ConceptNet as the knowledge graph; the knowledge-based module of KB-GAN improves the feature refinement process by reasoning about the common sense knowledge basket retrieved from ConceptNet. Similarly, there are many related works that use knowledge graphs as prior knowledge to assist relationship prediction.

Figure 16 shows the pipeline of the SGG model for different types of prior knowledge. Prior knowledge has been shown to significantly improve the quality of SGG. Existing methods either use external curated knowledge bases such as ConceptNet or use statistical information from annotated corpora to obtain common sense data. However, the former is limited by incomplete external knowledge, while the latter is often based on hard-coded heuristics such as co-occurrence probabilities for a given category. Therefore, the latest research attempts to use visual common sense as a machine learning task for the first time, automatically obtaining visual common sense data directly from the data set, and improving the robustness of scene understanding. Although this exploration is very valuable, how to obtain and fully utilize this prior knowledge is still a difficult problem that deserves further attention.

4 LONG-TAILED DISTRIBUTION ON SGG (If you are interested, learn about it yourself)

The Action Genome data set is used to effectively suppress the long-tail distribution problem, making it the first choice data set for research directions.

5 DATASETS AND PERFORMANCE EVALUATION

5.1 Datasets

In this section, we provide a detailed summary of commonly used datasets in SGG tasks so that interested readers can make choices accordingly. We investigated a total of 14 commonly used datasets, including 10 static image scene graph datasets, 2 video scene graph datasets and 2 3D scene graph datasets.

  • RW-SGD: is built by manually selecting 5000 images from the YFCC100m and Microsoft COCO datasets, and then using AMT (Amazon’s Mechanical Turk) to generate artificially generated scene graphs from these selected images.
  • Visual Relation Dataset (VRD): is built for the task of visual relationship prediction. The construction of VRD highlights the long tail of infrequent relationships.
  • UnRel Dataset (UnRel-d): is a new challenging unusual relational data set containing more than 1000 images that can be queried via 76 triple queries. The relatively small number of data and visual relationships in UnRel-D makes the long-tail distribution of relationships in this dataset less obvious.
  • HCVRD dataset: Contains 52,855 images, 1,824 object categories and 927 predicates, and 28,323 relationship types. Similar to VRD, HCVRD also has a long-tail distribution with infrequent relationships.
  • Open Images: is a large-scale dataset that provides a large number of examples for object detection and relationship detection.
  • Visual Phrase: is a dataset containing visual relationships, mainly used to improve object detection. It contains 13 common relationship types.
  • Scene Graph: is a dataset containing visual relationships designed to improve image retrieval tasks. Although it contains 23,190 relationship types, there are only 2.3 predicates per object category.
  • Visual Genome (VG), VG150 and VrR-VG (Visually-Relevant Relationships Dataset). VG is a large-scale visual data set composed of various components, including objects, attributes, relationships, question and answer pairs, etc. VG150 and VrR-VG are two data sets built based on VGD. VG150 uses VGD to eliminate objects with poor annotation quality, overlapping bounding boxes, and/or ambiguous object names, and retains 150 commonly used object categories. VrR-VG is constructed based on VGD and filters out visually irrelevant relationships. By applying a hierarchical clustering algorithm to the word vectors of relationships, the top 1600 objects and 500 relationships are filtered out from the VG. Therefore, VrR-VG is a scene graph dataset used to highlight visually relevant relationships; in this dataset, the long-tail distribution of relationships is suppressed to a large extent.
  • CAD120++ and Action Genome are two video action reasoning data sets containing human daily life scenes. They can be used for task analysis related to spatiotemporal scene graphs.
  • Armeni and 3DSSG are two large-scale 3D semantic scene graphs, including 3D reconstructions of indoor buildings or real scenes. They are widely used in research fields related to 3D scene understanding (robot navigation, augmented and virtual reality, etc.).

Information on these datasets is summarized in Table 1. This includes various properties of datasets commonly used in SGG tasks.

6.2 Evaluation method and performance comparison

 Visual relationship detection is the core of SGG. Commonly used visual relationship detection and evaluation methods are:

  • Predicate Det. (Figure 19(a)). It requires predicting possible predicates between pairs of objects given a set of localized objects and object labels. The goal is to study relationship prediction performance without the constraints of object detection [184].
  • <subject - predicate - object>Phrase Detection ((Phrase Det.) (Figure 19(c)). It requires predicting a triplet for a given image and locating a bounding box for the entire relationship that overlaps with the ground truth box by at least 0.5IoU.
  • Relationship Det. (Figure 19(d)). It needs to output a set of triples of a given image while localizing the objects in the image.

On the other hand, the above three visual relationship detection and evaluation methods do not consider the long-tail distribution phenomenon and graph-level coherence that are common in real scenes. Therefore, the evaluation method of SGG, namely SGG diagnosis, was further improved. SGG diagnosis is based on the following three key indicators:

  • Relational Retrieval (RR). This can be further divided into three subtasks.
    • Predicate Classification ( PredCls ): Same as predicate detection.
    • Scene Graph Classification ( SGCls ) (Fig.19(b)): Its input is an unlabeled true bounding box.
    • Scene Graph Detection ( SGDet ): Detect scene graphs from scratch. This is the same as Relationship Det.
  • Zero-shot relation retrieval (ZSRR). The visual relationships that ZSRR needs to test have never been observed in the training set. Similar to RR, ZSRR has the same three subtasks.
  • Sentence-to-graph retrieval (S2GR). Both RR and ZSRR are evaluated at the triplet level. The purpose of S2RR is to evaluate scene graphs at the graph level. It uses image title sentences as queries to retrieve images represented by scene graphs.

Recall@K (R@K) is often used as an evaluation metric for the above tasks . In addition, due to the reporting bias in R@K, R@K is easily disturbed by high-frequency predicates. Therefore, mean Recall@K (mR@K) is proposed. mR@K retrieves each predicate separately and then takes the average of R@K over all predicates. Graphical constraints are also a factor to consider. Some previous works constrain only one relation for a given object pair when computing R@K, while other works omit this constraint and allow multiple relations to be obtained.

Currently, most existing SGG methods use three subtasks with graph constraints in RR for performance evaluation. Referring to the classification outlined in Sections 2 and 3, Table 2 summarizes the performance of relevant SGG methods. Most current methods use SGG methods based on graphs and RNN/LSTM. Compared with R@K, the price of mR@K is generally lower. For data sets with obvious long-tail distribution, mR@K is a fairer evaluation metric. In particular, methods such as VTransE, FactorizableNet, IMP, IMP+, Pixels2Graphs, FCSGG, and VRF only use visual features, and the performance of these methods is generally low. In contrast, VCTree, KERN, GPS-Net, and GB-NET use other knowledge (such as language embeddings, statistical information, counterfactual causality, etc.) in addition to visual features. This allows these methods to gain more additional knowledge and thus gain more performance improvements.

Furthermore, almost all current methods propose that predictions of objects and relationships are interrelated rather than independent of each other. They try to consider contextual information between objects or use methods such as GNN, LSTM, message passing, etc. to capture this relationship. However, these methods are usually trained using cross-entropy loss, as shown in Equation 18, which essentially treats objects and relationships as independent entities. A new energy-based loss function is proposed in EBM. This elaborate loss computes the joint likelihood of objects and relations. Compared with traditional cross-entropy loss, it achieves consistent improvements on multiple classic models of mR@K. At the same time, EBM has the best performance among the algorithms currently studied. In addition, ZSRR is also an important task, but most current methods cannot evaluate the ZSRR task. Paying attention to the evaluation of the ZSRR task will help to study the long-tail distribution of the scene graph.

7 FUTURE RESEARCH

SGG aims to mine the relationships between objects in images or scenes to form a relationship graph. Although there are currently many related studies on SGG, there are still many directions worthy of attention.

  1. Long-tail distributions in visual relationships. The inexhaustibility of visual relationships and the complexity of real scenes determine the inevitability of the long-tail distribution of visual relationships. The data balance required for model training and the long-tail distribution of real data present an inevitable contradiction. Associative reasoning through similar objects or similar relationships across scenes may be a suitable research direction, as it may help to some extent to solve the long-tail distribution problem of relationships on current scene graph datasets. In addition, targeted long-tail distribution evaluation metrics or task design are also a potential research direction, as it can help more fairly evaluate the learning ability of the model in the zero/once/less-time context; however, related research is still very limited . Although the long-tail problem has received extensive attention from researchers (Section 4), there are still a large number of potential, infrequent, unfocused, and even invisible relationships in the scene that remain to be explored.
  2. Relationship detection between distant objects. Currently, scene graph generation is based on a large number of small-scale relationship graphs, which are abstracted from small scenes in scene graph datasets through associated relationship prediction and inference models. The selection of potential effective relationships [60], [61] and the establishment of the final relationship in the scene graph largely rely on the spatial distance between objects, such that no relationship exists between two distant objects. But in larger scenes, there are still many such relationships. Therefore, large-scale images can be added in an appropriate proportion to the existing scene graph data set, and at the same time, the relationship between distant objects can be properly considered in the SGG process to improve the integrity of the scene graph.
  3. Dynamic image based SGG. A scene graph is generated based on the static images in the scene graph dataset, and object relationships are predicted for the static objects in the image through the relevant inference model. There are few related research works, and few people pay attention to the role of the dynamic behavior of objects in the prediction and inference of relationships. However, in practice, it may be necessary to predict a large number of relationships through consecutive actions or events; i.e., video scene-based relationship detection and reasoning. This is because, compared with static images, spatiotemporal scene graph analysis of dynamic images obviously provides a wider range of application scenarios. Therefore, we believe it is necessary to focus on relationship prediction based on the dynamic actions of objects in videos.
  4. Models and methods for visual reasoning. For SGG, the current mainstream method is based on target detection and recognition. However, due to the limitations of existing scene graph datasets and the limited ability of relationship prediction models derived using these datasets, it is difficult for existing models to continuously enhance their relationship prediction capabilities. Therefore, we believe that online learning, reinforcement learning, and active learning may be relevant methods or strategies that can be introduced into future SGG methods, as this will enable the SGG model to continuously enhance its relationship predictions by leveraging large and constantly updated real-world datasets ability.

Overall, research in the field of scene graphs is developing rapidly and has broad application prospects. Scene graphs are expected to further facilitate understanding and reasoning about higher-level visual scenes. However, the current research related to scene graphs is not mature enough and more efforts and exploration are needed.

Please understand that there may be inaccuracies in the above statements. This is only used as a record for subsequent study.

Guess you like

Origin blog.csdn.net/Mr___WQ/article/details/129855604