Li Feifei's new work: using scene graph to generate images

Albert  AI   Technology Review Press: Recently, Li Feifei 's student Justin Johnson uploaded a paper on arXiv: Image Generation from Scene Graphs, proposing to use structured scene graphs instead of unstructured text to generate images , which can explicitly resolve objects and relationships between objects and can generate complex images with multiple identifiable objects.

Li Feifei's new work: using scene graph to generate images

 

Abstract

To truly understand the visual world, a model must not only be able to recognize images, but also be able to generate them. Exciting progress has been made recently in generating images from natural language descriptions. These methods provide stunning results on limited domains (such as descriptions of birds or flowers), but are difficult to replicate successfully for complex sentences with many objects and relationships. To overcome this limitation, the authors propose a method for generating images from a scene graph that is able to reason about objects and their relationships explicitly. The model developed by the authors uses graph convolutions to process the input graph, computes the scene layout by predicting bounding boxes and segmentation masks of objects, and converts the layout to an image with a cascaded refinement network. The authors of the paper use an adversarially trained network against a set of discriminators to ensure that the actual output image is sufficiently realistic. Experiments validate the method on Visual Genome and COCO-Stuff  datasets , and qualitative results and user experiment reproduction demonstrate that the method can generate complex images with multiple objects.

Background introduction

What I don't understand, I can't create. —Richard Feynman

The creation of the act is based on a deep understanding of what is created. For example, cooks understand food at a deeper level than diners, novelists understand writing at a deeper level than readers, and filmmakers understand movies at a deeper level than moviegoers . If a computer vision system is to truly understand the visual world, it must not only be able to recognize images, but also be able to generate them.

In addition to conveying deep visual understanding, methods for generating realistic images may also be useful in practice. In the short term, automatic image generation can help artists or graphic designers work better. One day, image and video search engines that rely on algorithms may be replaced by privately customizing images and videos based on each user's personal interests .

As a step towards these goals, there have been exciting advances in generating images from natural language descriptions by combining recurrent neural networks and generative adversarial networks, from text-to-image synthesis. (The author of the paper   has already completed this work during his internship at Google Cloud AI)

Li Feifei's new work: using scene graph to generate images

figure 1

There are already some of the best methods for generating images from sentences, such as Stack GAN  , but it is difficult to portray complex sentences with many objects in a realistic way. The authors of the paper overcome this limitation by generating images from the scene graph, where objects and their relationships can be inferred explicitly.

These methods can produce stunning results on limited areas, such as detailed descriptions of birds or flowers. However, as shown in Figure 1, the main methods of generating images from sentences do not perform well when encountering complex sentences containing many objects.

Sentences are linear structures, word after word; however, as shown in Figure 1, the information conveyed by complex sentences can often be more explicitly represented as objects and their relationships as a scene graph. Scene graphs are powerful structured representations of images and languages; they have been used for semantic image retrieval; evaluating and improving image captioning. Its methods were also developed for converting sentences into scene graphs and for prediction from images to scene graphs.

In this paper, the authors aim to generate images with multiple objects and complex relationships by adapting the generation of the scene graph, enabling the model to interpret objects and their relationships unambiguously.

This new task brings new challenges. The authors had to develop methods for processing scene graph inputs; for this, they used a graph convolutional network that passed information along the graph edges. After processing the graph, the gap between the symbolic graph structure input and the 2D image output must be filled; for this, the scene layout is constructed by predicting bounding boxes and segmentation masks for all objects in the graph. After a layout has been pre-set, images involving it must be generated; for this, a Cascaded Refinement Network (CRN) is used, which processes the layout at ever-increasing spatial scales. Finally, it must be ensured that the generated images are real and contain recognizable objects; thus adversarial training is performed against a set of discriminator networks for image patches and generated objects. All components of the model are learned together in an end-to-end fashion.

The authors conduct experiments on two datasets : Visual Genome provides human-annotated scene graphs, and COCO-Stuff [3] builds synthetic scene graphs from ground-truth object locations. On both datasets, qualitative results are presented, demonstrating the ability of our method to generate complex images. These complex images involve the objects and relationships of the input scene graph and perform comprehensive image segmentation to validate each component of the model.

Automatic evaluation of generative image models is itself a challenging problem, so  experimental results were evaluated through two Amazon Mechanical Turk user studies. Compared to StackGAN, a leading text-to-image synthesis system, users found that the method generated results that better matched COCO captions in 68% of trials and contained more than 59% of recognizable objects.

experimental method

The author's goal is to develop a model that takes as input a scene graph describing objects and their relationships, and generates realistic images corresponding to that graph. The main challenges are three-fold: first, a method must be developed to handle the input of graph structures; second, it must be ensured that the generated images involve the objects and relationships specified by the graph; and third, the synthesized images must be guaranteed to be realistic.

The authors convert the scene graph to an image of an image generation network f, as shown in Figure 2, which inputs a scene graph G and noise z and outputs an image I = f(G, z).

The scene graph G is processed by a graph convolutional network that gives each object an embedding vector; as shown in Figures 2 and 3, each layer of layer convolution mixes information along the edges of the graph.

We respect the objects and relations from G by using the object embedding vectors from the graph convolutional network to predict the bounding box and segmentation mask for each object; these combine to form a scene layout, as shown in the middle of Fig. 2, It acts as an intermediate layer between the graph and image domains.

The output image I^ is generated from the layout using a Cascaded Refinement Network (CRN), as shown on the right in Figure 2. Each module is processing the layout, increasing the spatial scale, and finally generating the image I^. We generate realistic images by adversarial training f on a pair of discriminator networks Dimg and Dobj that encourage images I^ to look realistic.

For a more detailed description of each component in the experiment, please refer to the original paper: https://arxiv.org/abs/1804.01622

Li Feifei's new work: using scene graph to generate images

figure 2

An overview of the image generation network f for generating images from the scene graph. The input to the model is a scene graph specifying objects and relationships; it is processed with a graph convolutional network (Figure 3) that passes information along edges to compute embedding vectors for all objects. These vectors are used to predict object bounding boxes and segmentation masks, which are combined to form the scene layout (Fig. 4). The layout is converted to an image using a Cascaded Refinement Network (CRN) [6]. The model is trained adversarially against a pair of discriminator networks. During training, the model observes ground truth object bounding boxes and (optionally) segmentation masks, but these are predicted by the model at test time.

An example computational graph for a single graph convolutional layer is shown in Figure 3.

Li Feifei's new work: using scene graph to generate images

image 3

Computer graphics represent a single layer of graphic variation. The graph consists of three objects o1, o2 and o3 and two edges (o1, r1, o2) and (o3, r2, o2). Along each edge, three input vectors are passed to the functions gs, gp, and go; gp computes the output vector of the edge directly, while gs and go compute the candidate vectors, which are fed to the symmetric pooling function h to compute the object's output vector .

In order to generate an image, one must move from the image domain to the image domain. To this end, the authors use the object embedding vectors to compute the scene layout, which gives the rough 2D structure of the generated image; the scene layout is computed by predicting segmentation masks and bounding boxes for each object using an object layout network, as shown in Figure 4 shown.

Li Feifei's new work: using scene graph to generate images

Figure 4

Figure 4. Transfer from the graph domain to the image domain by computing the scene layout. The embedding vector of each object is passed to an object layout network, which predicts the layout of the objects, summing up all the object layouts to give the scene layout. The object layout network internally predicts a soft binary segmentation mask and an object bounding box; these are combined with embedding vectors using bilinear interpolation to produce the object layout.

Li Feifei's new work: using scene graph to generate images

Figure 5

Figure 5. Example of generating 64×64 images using graphs from the Visual Genome (four left columns) and COCO (four right columns) test sets, respectively. For each example, the input scene graph is shown and the scene graph is manually converted to text; the model processes the scene graph and predicts a layout consisting of bounding boxes and segmentation masks for all objects; this layout is then used to generate an image. The authors also show some results from the model using the ground truth rather than the predicted scene layout. Some scene graphs have duplicate relationships, as indicated by the double arrows. Masks for certain things classes like sky, street, and water are ignored for clarity.

Li Feifei's new work: using scene graph to generate imagesImage 6

Images generated by the authors' method are trained on Visual Genome. In each row, we start with a simple scene graph on the left and gradually add more objects and relationships moving to the right. The images involve relationships, like "The Car Under the Kite " and "The Boat on the Grass".

Comparison of some experimental results

Li Feifei's new work: using scene graph to generate images

Table 1

 

Table 1 is an ablation study using Inception scores. On each dataset, the authors randomly divided the test set samples into 5 groups and reported the mean and standard deviation of the groups. On COCO, five samples are generated for each test set image by building different synthetic scene graphs. For StackGAN, the authors generate an image for each COCO test set caption and downsample its 256×256 output to 64×64 for fair comparison with the method in the paper.

 

Li Feifei's new work: using scene graph to generate images

Table 2

Table 2 is the statistics of the predicted bounding boxes. R@ t is an object call with an IoU threshold of t and is in agreement with the ground truth box measurement. σx and σ measure box change by computing the standard deviation of the box x position and area in each object category, respectively, and then averaging across categories.

Analysis of results

Figure 5 shows example scene graphs from the Visual Genome and COCO test sets and images generated using the authors' method, along with predicted object bounding boxes and segmentation masks.

It is clear from these examples that the method can generate scenes with multiple objects, and even multiple instances of the same object type: e.g. Figure 5(a) shows two sheep, (d) shows two bus, (g) shows three people, (i) shows two cars .

These examples also show that the method generates images involving input graph relationships; for example (i) sees the second broccoli with a broccoli to the left and a carrot below the second broccoli; in (j), the man Horseback riding and both the man's legs and the horse's legs have been properly positioned. Figure 5 also shows that the method uses the ground truth rather than the predicted object layout to generate images.

In some cases, the predicted layout of the method can be very different from the ground truth object layout. For example, the position of the bird is not specified in (k), the method makes it stand on the ground, but in the ground truth layout, the bird is flying in the sky. Models are sometimes bottlenecked by layout prediction, such as (n) using ground truth instead of predicted layout significantly improves image quality.

In Figure 6, the ability of the model to generate complex images is demonstrated by starting with the simple graph on the left and gradually building up more complex graphs. From this example, you can see that the position of the object is affected by the relationship in the graph: in the top sequence, adding the "Car is below the kite" relationship causes the car to move to the right and the kite to move to the left, so that the relationship between the kite and the car is Relationships have also changed. In the bottom sequence, adding the relation "boat on grass" causes the boat's position to shift.

Summarize

In this paper, the authors develop an end-to-end method for generating images from a scene graph. Compared with leading methods for generating images from textual descriptions, our proposed method for generating images from structured scene graphs rather than unstructured text is able to explicitly parse objects and inter-object relationships and generate images with multiple identifiable objects. complex images.

Paper download address: https://arxiv.org/abs/1804.01622

Albert AI Technology Review http://www.aibbt.com/a/29409.html

Albert (Public Number: Yuedong Smart | aibbtcom)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324533508&siteId=291194637