A new generation of "mat map" artifact, a complete application interpretation of IP-Adapter

Introduction

No need to train lora, style transfer can be achieved with one picture, and it also supports multi-picture and multi-feature extraction. At the same time, its powerful expansion capabilities can also be connected to dynamic prompt matrix, controlnet, etc. This is IP-Adapter, a new "pad" "Picture" method to make your AIGC journey more efficient and easier.



They are all "cushion pictures". Who can restore the picture in your mind?

The concept of "mat picture" is certainly familiar to everyone. In the past, when I couldn't accurately describe the picture in my mind using prompt, the easiest way was to find an approximate one, and then the img2img process started, and everything was done.

Although img2img is simple, it also has limitations that it cannot get around, such as insufficient restoration of prompts and weak diversity of generated images. Especially when it is necessary to add controlnet for multi-layer control, refer to the combination of pictures, models, and controlnets. You need to choose carefully, otherwise the effect of the picture will often make people split on the spot...



But now, we have a new "mat map" artifact - IP-Adapter. Before interpreting it, let's first intuitively feel its effect.





The effect can be said to be quite explosive, so is IP-Adapter the ultimate answer? How generalizable is it? Is compatibility sufficient? What about prompt support? If it really wants to be integrated into real work, what expansion capabilities does it have? Let’s look at them one by one.

The core advantage of IP-Adapter is to only draw things you care about

Although IP-Adapter and img2img are both "mat maps" in operation, their underlying implementations can be said to have nothing to do with each other.

To use a loose but easy-to-understand example, "IP-Adapter" and "img2img" are two painters. Now they are given a prompt asking them to draw a man. Without providing a reference picture, they will probably draw it. 1, but when we add reference to Figure 2, the difference between the two becomes apparent.



img2img is equivalent to directly covering the reference picture and starting to copy. Although I know that I want to draw a man, I will modify it on the basis of the tiger. It will always be very awkward. In the middle, the tiger and the man will inevitably be confused. I will draw something that is forcibly mixed. The picture comes. Because in this process, the reference picture is more important, everything is drawn based on it, and the result is more inclined to the reference picture.

 



IP-Adapter is not copying, but actually painting by itself. It always remembers the prompt and knew that it wanted to draw a man. It is more like inviting an art master like Xu Beihong to figure out how to integrate the characteristics of tiger and human. The explanation is very remote, so in the process, I have been adding "tiger" elements to "men", such as golden pupils, king-shaped forehead wrinkles, tiger-striped beard and hair, etc. At this time, prompt is more important, because this is its ultimate goal.



 

Of course, these are within a certain parameter range. If it exceeds the threshold, you must go to extremes and copy the reference picture. But even so, you can see that img2img is only a 1:1 copy, and IP-Adapter has more hints of prompt.



Expanding a simple "cushion map" into a promising one

After understanding the logic of IP-Adapter, you will find that the changes it brings are not just "cushion pictures". Here we will first show an example of it in our work, and then we will dismantle it step by step.





The above effects are very simple to implement. You only need to add two layers of controlnet, one layer is used to provide the IP-Adapter, and the other layer uses canny to draw and solidify the products that need to be added.



If it is only applied in webui, then it is actually ok here. But this time we are going one step further and using new tools to achieve more creative capabilities.

The following points I want to share are more about the capabilities and effects of engineering construction (we will explain the specific methods in detail later):

① A picture is a lora, which greatly reduces the cost of training.

② Access to multiple reference images to provide richer generation results

③ Use strong attention to prompts to provide rich results in prompt matrix

④ Workflow deployment based on comfyui to achieve multi-step automated generation



In the past, to achieve a specific design style, it was necessary to train lora in a targeted manner, which involved multiple links such as the collection of training materials, marking, model training, and effect testing. It usually took a day or two, and the results were still very strong. of uncertainty.

But now, through the IP-Adapter step, you can intuitively see the results in a few minutes, which greatly saves time, and the agility is simply different.







When we have these characteristics, we almost get an "instant lora", and the cost is just to find a few reference pictures that meet the expectations.



At the same time, IP-Adapter can also read multiple reference images at one time, so that the generated results have richer diversity and randomness, which cannot be achieved in the img2img process, and is also the biggest difference between the two.









 

At this moment, let us open up our thinking a little more. Because of IP-Adapter's strong attention to prompts, the information in prompts can be more intuitively reflected in the results. So while inheriting the img style, we can replace the keywords in the prompt to point to different results, forming a combination matrix of prompts, and further expand the diversity of generated results.





 

Going further, different controlnets and batch material reading are added to achieve controllable guidance of generated results, and batch reading capabilities are used to provide richer templates. A set of automated processes of "0-cost real-time lora + controlnet controllable generation + prompt matrix diverse generation" has been completed.

We have used this process in projects. As for the effect, everyone’s feedback can be summed up in one sentence – one click and three connections.





 

The picture below shows how the above process is deployed in actual work. The carrier is comfyui. Both it and webui are based on stable diffusion capabilities, but unlike webui's web interface, it decomposes SD capabilities into different nodes. Node associations are built to implement various functions. Therefore, it is more open, free, and multi-source, and can realize process automation, which greatly improves the efficiency in practical applications. We will explain it specifically in the next issue.





 



At this point, the principles and applications behind IP-Adapter have been sorted out. It has many advantages, but it also needs to be applied in combination with actual scenarios. It is still the same concept. There is no best method, only a suitable method.

I hope you all have fun using it. If you have any ideas or suggestions, they are most welcome. See you in the next issue.




This is the boring dividing line



A bit boring, but interesting to talk about

After seeing its performance, let’s look at the underlying principles to see what’s so special about IP-Adapter.

We know that stable diffusion is a diffusion model, and its core mechanism is the processing of noise. Prompt can be regarded as our goal. Through the continuous denoising process, we get closer and closer to the goal, and finally generate the expected picture.





IP-Adapter proposes images separately as a prompt feature. Compared with the previous method that simply extracts image features and text features and then splices them together, IP-Adapter uses an adaptive method with decoupled cross-attention. The configuration module distinguishes the Cross-Attention of text features and the Cross-Attention of image features. A new Cross-Attention module is added to the Unet module to introduce image features.

It is equivalent to separating img and prompt in the original SD into one vector. Both img and prompt will form separate vectors and then pass them to the unet layer, so that the features in img can be better preserved, thereby achieving More explicit inheritance and retention of image features.

In essence, IP-Adapter is the process of txt2img. The prompt is still the most critical in the process, but IP-Adapter is used in the middle to strengthen the prompting function of the reference picture.





As a comparison, img2img directly passes the reference image into unet to replace the original random noise, so that all generated results are based on it, so the phenomenon of mixing humans and tigers is easier to understand. .





Finally, let’s take a look at the underlying differences between the two through pseudo code.

Structurally:

img2img uses the unet architecture, including an encoder (downsampling) and a decoder (upsampling)

IP-Adapter includes an image encoder and adapter including decoupled cross-attention mechanism

# img2img
class UNet(nn.Module):
    # ... (U-Net architecture code)

# IP-Adapter
class IPAdapter(nn.Module):
    def __init__(self, image_encoder, text_to_image_model):
        # ... (initialization code)



In terms of process:

img2img passes the encoder/decoder and needs to pass a series of upsampling and downsampling

IP-Adapter interacts with the pre-trained text-to-image model through the adaptation module via image encoder, text hints and image features

# img2img
encoded = unet_encoder(img2img_input)
decoded = unet_decoder(encoded)

# IP-Adapter
image_features = image_encoder(ip_adapter_input[1])
adapted_features = adapter_module(ip_adapter_input[0], image_features)



On output:

img2img outputs a converted image

IP-Adapter is a picture generated based on text and image prompts

# img2img
output_image = img2img_model(img2img_input)

# IP-Adapter
generated_image = ip_adapter_model(ip_adapter_input)



The above is really over. see you soon





 

Author: JD Retail He Yunshen

Source: JD Cloud Developer Community Please indicate the source when reprinting

Tang Xiaoou, founder of SenseTime, passed away at the age of 55 In 2023, PHP stagnated Wi-Fi 7 will be fully available in early 2024 Debut, 5 times faster than Wi-Fi 6 Hongmeng system is about to become independent, and many universities have set up “Hongmeng classes” Zhihui Jun’s startup company refinances , the amount exceeds 600 million yuan, and the pre-money valuation is 3.5 billion yuan Quark Browser PC version starts internal testing AI code assistant is popular, and programming language rankings are all There's nothing you can do Mate 60 Pro's 5G modem and radio frequency technology are far ahead MariaDB splits SkySQL and is established as an independent company Xiaomi responds to Yu Chengdong’s “keel pivot” plagiarism statement from Huawei
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10321747