AIGC continuous content generation several schemes

background

From the fact that AI can draw pictures to the output of continuous content that everyone is working on now, it has only been about a year since AI drawing has entered the public circle. The research on single-image has been gradually perfected. Theoretically speaking, as long as you can do the promt design as required, the output rate should be relatively high. But for the generation of continuous graphs or continuous video generation, the ability of the model seems to be unsatisfactory.

Since it is a tool for image production, why can a single image be generated well, and why continuous images cannot be generated well. Why can't we generate continuous animation videos that can be played on the video like a manga artist? Don't animators also draw pictures one by one, it seems that this problem can also be attributed to making pictures.

To answer the above questions, we have to look at the differences between the expressions of single image, continuous image, and video. And the technical implementation ideas between them, so that we can see where the problem is through the difference, and we can also propose some possible solutions.

single image

Logically speaking, a single picture is a static picture, and the problem to be solved is to highlight the main body:

1. The image of the subject (clothing, decoration, color matching), the emotions conveyed by the subject (actions, expressions), and the intention expressed by the subject

2. In order to support the intentions that these subjects want to convey, it is necessary to arrange the scene of the subject's appearance (which can be simply understood as the background)

3. The scene includes environment, atmosphere, still life, props

4. And how to compose the above elements according to levels (hierarchy, perspective, angle, light and shadow...)

5. What kind of painting style should be used to present, this actually includes the painter's world view and the constraints of painting hardware equipment (painting style, shooting equipment, aperture and focal length...)

The derivative of a single graph is a long graph

A single image is constrained by the scope of space. When such a small area is not enough to express the things and intentions that the author wants to express, an engineering extension of a single image—a long image—appears. For example, "Dwelling in the Fuchun Mountains", "Surfing the River During the Qingming Festival", and Dunhuang Grottoes murals are examples of this kind. Is there such a thing in the presentation of modern media art? Both are. The characteristic of this type of representation is that it expands linearly from a single dimension to present more content. The best presentation method of this kind should be on paper, or a guiding linear expansion picture (promenade, corridor-like LED screen). Anyway, the presentation carrier with single-dimensional linear expansion is more suitable for this kind of expression.

The presentation in this way already has a certain continuity in it, although it is in a linear dimension.

1. Determine the number of scenes needed for the long picture, the characters that need to appear in it, and the props needed

2. Draw the key scene as a keyframe, where the drawing of the keyframe will use all the skills of the single picture

3. Conceive the front and rear keyframes as a gradual transition screen, and gradually supplement them (Microsoft has used this idea to complement ancient Chinese paintings in recent years)

Comics

Comic strips should be regarded as a non-continuous two-dimensional presentation. What is non-continuous two-dimensional presentation? A single image actually allows you to fully absorb the information of the entire image in one process of opening and closing your eyes. Long images require you to move linearly and open your eyes multiple times. Only by closing the eyes can the information be ingested; while in split-frame animation, the process of ingesting information needs to be completed by moving the eyes of the ingesting device up and down, or even obliquely. Non-sequential 2D rendering.

The most difficult part of this expression method is how to select the appropriate two-dimensional state bit to express according to the content to be expressed.

1. A few frames of comics to determine the layout of the comics

2. Determine the comic content

continuous frame comic book

The most difficult part of continuous comics is that the comics present consistent storyboards, characters and stories. If a single storyboard comic has a better effect, that is, a single comic can already match a better storyboard discontinuous state to express and present the content to be expressed. The most difficult thing about continuously outputting multiple comics is to maintain the consistency of characters and story descriptions.

Why is it difficult for character consistency and consistency in story description before and after, and why can people do it. The fundamental reason is that the current drawing tools have no memory and predictive power, that is to say, they are drawing statelessly. It does not record what it drew before, and it is impossible to think about how to express the current input information based on the content of the previous drawing, and how to maintain consistency. When people are doing these tasks, they have a memory based on the previous pictures and the input text information to conceive how to continue the picture, let the protagonist who needs to appear in this story appear, and what is the appearance and decoration of the protagonist. How to act.

If we want to maintain the consistency of the content, can we draw a continuous story picture by adding state and memory to the current stateless and memoryless drawing AI? In theory, this is the case, but the bigger problem is: how to add memory and how to add state; what memory to add and what state to add; where to add memory and where to add state.

Two methods of adding memory and state can be discussed: internal memory, external storage, self-state machine, and external clock state machine; in fact, it is a plug-in plugin to achieve this, or the model itself incorporates this part of the capability.

If it is an external memory device, extract the relatively fixed things that appear in the story: characters, scenes, props, still lifes, animals, and decorations, and store them as materials. After the decoration, you can make some changes according to the plot.

If it is an appearance state device, it is to save the rules or tables of these state transitions (this rule can be obtained through association learning, such as FP-TREE), and then according to the state information of some previous frames (this granularity can be control as needed) to predict what the next state will be.

If the model incorporates this part of the ability, it is actually very simple to add time series to the current drawing model (that is, add a layer of graphs and the sequence relationship between the graphs on top of the single graph), so that in fact The state and memory problems are solved together. This is the case for many text-generating video ideas. The design of this structure already exists. The difficulty lies in how to design the learning task so that the model can remember what it needs to remember, perform where it needs to be expressed, and know how to jump the state. Judging from the effect of the current text generation video model, there are some effects on a single specific task, but the effect is not very good in the general domain.

video generation

Let's take a look at the difference between the video form of expression and the single-picture and framed comics mentioned above. The technical implementation of video is to express the changes of the picture by refreshing the frame. What do you mean? It means that the previous two ways of expression are where the picture is, and you can see the history as long as you pull back linearly. But the implementation of the video is: destroy the previous frame and then redraw a new frame; in order to ensure the continuous look and feel of the picture, one thing to do is to make the switching between the front and back pictures feel reasonable:

1. Including incrementally changing the screen for the previous screen, using brain residual memory + incremental screen brain supplement action

2. Use various lens switching techniques that humans feel are reasonable to achieve a large picture change, or a two-frame picture change that you can reasonably accept

It seems that any point on the two frames before and after the video screen can be changed, that is to say, the video change should be a two-dimensional continuous state prediction problem. Therefore, there are two types of video generation theory to learn:

1. The distribution of incremental changes in the screen (the size of the screen is determined, what needs to be predicted is which pixels in the historical frame screen will change, and what rules do they conform to)

2. How does screen switching conform to human senses (predict the possible reasonable state of screen state switching, which can be constrained: such as driving video, driving action skeleton, driving text...)

An Introduction

single image to long image

single image

long picture

story material

Scenes

character

character - animal

still life

tool

content

Scenario 1—Biology Laboratory

Scene 2 - Hotel 11 Silicon Valley

Male Protagonist 1—Future Scientist

Heroine 1—Dark Courtesan

Cat 1 - Adorable Spy Cat

Dog 1 - Scientist Guard

laboratory reagent rack

laboratory colorful liquid

hotel bar

hotel wine rack

sci-fi robotic arm

monitor cell phone

describe

Clean and tidy laboratory, filled with various instruments and equipment, colorful bunkers flow slowly in the pipeline, biological instruments full of science fiction

Tall, bald, polo shirt, long face, serious, with gold glasses, black overalls, strong body

Bright eyes, furry, chubby, big face, flexible body

The process line generated by a single graph on the horizontal axis, and the story line on the vertical axis (represented by keyframes).

original story

what scenes in the story

who are the main characters in the story

The location of the main character's action expression in the scene

which still lifes appear in the story

still life in scene position

What props appear in the story

props in scene position

storyboard description

Image post-processing

key frame 1

Biology laboratory

hero 1

The man is shaking the instrument in front of the laboratory table, doing the experiment seriously, with a serious expression, in the middle of the picture

Reagent rack

Colorful liquids flow in laboratory apparatus

in the upper left part of the scene

Mechanical arm

In the lower right corner of the screen, close to the hero

additions, deletions

key frame 2

key frame 3

How to generate the key frame picture:

A: Generate a draft drawing, then manually replace the draft drawing with the materials in the material warehouse, place the materials in a suitable position, and use sd to re-arrange the drawing to regenerate (or the script will automatically)

1. Based on the above materials, scene description + still life description + protagonist description + prop description + position relationship description to generate a picture

2. According to the layer relationship of the draft map, use the scene map, characters (heads), still life, and props in the material warehouse to display the draft map

3. Regenerate the arranged picture

B: Scene graph + inpaint redrawing + bone pad map to change roles

1. Determine which frame should be selected and which scene in the material scene

2. According to the description of the position of the still life in the scene, and which still life is needed, put the still life in the scene position inpaint+still life controlnet

3. Redraw the graph

How to gradually transition between keyframes:

A. Between key frames, generate excessive text descriptions of 5-10 paragraphs, outpaint interpolation frames

B. Text-to-Image Model Editing, use the text description of the front and rear frames, and use the model to generate a gradient map (this model can only be controlled by text now, and it should be able to support text + image + more control conditions per step)

Directly training a model with a memory state is a more thorough way of thinking.

Framed manga to multi-page manga

This idea is the same as the continuous graph.

GIF to video generation

This can only be solved by models now. But this actually has a problem:

1. How the model needs to design learning tasks

2. How to introduce more control factors into the model to make the model more controllable

3. Which control factors are introduced into the model

summary

The article introduces several differences in the generation of continuous content, and at the same time introduces several possible solutions to generate continuous content stably. Two abstract frameworks are proposed: plug-in memory and serialization model, two possible solution frameworks and their current word problems.

1. The current AI graph-making model is a stateless generation idea, so the generation of multi-single graphs is better supported

2. Now the single image is generated, the mainstream model is generated based on the latent model, and the training tasks are mostly modeled on the photographic frame, so the support for human natural description is not enough

3. The idea of ​​generating long pictures is the easiest way to think about linear expansion. As long as the roles and scenes between the pictures are consistent, theoretically speaking, this can be solved by using the plug-in memory system + engineering skills to give the current stateless AI generation ideas. The problem; personal judgment should be effective soon

4. For video generation, the root cause is: continuous frame change distribution learning, which can basically only be solved by modeling, and theoretically speaking, modeling will be better than long pictures and continuous animation, because video pictures are Continuous, while long pictures and continuous animation are non-continuous.

 

Guess you like

Origin blog.csdn.net/liangwqi/article/details/131339525