Interpretation of Drag Your GAN paper, point-based interactive operation dragging to generate images [DragGAN]

The product that can automatically edit pictures with just a click of the mouse is popular. It can be said that it surpasses PS. It exists like a god, and there is no threshold. It can be used directly by the general public. More importantly, the obvious difference between this product and PS is that PS is generally used for beautification, while Drag Your GAN can make the eyes really open and close, and the corners of the mouth can be raised to reveal the teeth and posture The change is like a real person posing, very interesting.
You can click on the following two videos to experience how to retouch pictures magically:
DragGAN Video 1
DragGAN Video 2

Because this DragGAN is based on the StyleGAN2 model, before reading this article, it is recommended to look at the StyleGAN series and get familiar with their architecture: Interpretation of NVIDIA's StyleGAN, StyleGAN2, StyleGAN3 series of papers, sorting out the style-based generator architecture
papers A picture to start the introduction:

Users only need to click a few handle points (red dots) and target points (blue dots) on the image, and our method will move the handle points to reach the corresponding target points precisely. Users can choose to paint a mask of flexible areas (lighter areas), keeping the rest of the image fixed. This flexible point-based manipulation can control many spatial properties such as pose, shape, expression and layout across different object classes.

There are three main features:
    1. Flexibility: It can control different spatial attributes such as the position, posture, shape, expression, and layout of the generated object. 2.
    Accuracy: It can control the spatial attributes with high precision.
    3. Versatility: It is suitable for different Object categories, not limited to a certain category

For GAN, it can be divided into unconditional GAN ​​and conditional GAN

1. Unconditional GAN: A generative model that converts low-dimensional randomly sampled latent vectors into realistic images. They are trained using adversarial learning and can be used to generate high-resolution realistic images, like most GAN models like StyleGAN, which cannot directly make controllable edits to the generated images.

2. Conditional GNA: Model the joint distribution of images and segmentation maps, and then compute new images corresponding to the edited segmentation maps, enabling editability.

Although the current text generation images are very popular, the semantic control output can only be controlled at a rough high-level semantic level, and cannot fine-grained control the spatial attributes of images.
In general, this paper is relatively simple. The DragGAN model is mainly divided into two parts: motion supervision and handle point tracking.

1. Motion Supervision

Feature-based motion supervision drives the handle point (red point) to move to the target position (blue point) , so that the feature is updated during this iterative process. Specifically, motion supervision is achieved by shifting the feature patch loss, thereby optimizing the latent code (same as the latent code described in the StyleGAN series). DragGAN also allows users to selectively
draw regions of interest to perform region-specific edit. Since DragGAN does not depend on any additional network such as RAFT, it achieves efficient operation, which in most cases takes only a few seconds on a single RTX 3090 GPU. This allows for live interactive editing sessions where users can quickly iterate on different layouts until the desired output is achieved.
Unlike traditional shape deformation methods that simply apply a warp, our deformation is performed on the GAN's learned image manifold, which tends to obey the underlying object structure. For example, our method can hallucinate occluded content, like teeth in a lion's mouth, and can deform with the rigidity of an object, like the bending of a horse's leg.
As seen in the video, a GUI was also developed for the user to perform actions interactively by simply clicking on the image. Both qualitative and quantitative comparisons confirm that our method outperforms UserControllableLT (point-based editing by transforming the latent vector of GAN, but its method only supports dragging a single point on the image for editing, and cannot handle it well. Multi-point constraints. In addition, the control is not precise, that is, the target point is often not reached after editing), the following is a comparison chart between the two:

2. Point Tracking handle point tracking

Point Tracking via Nearest Neighbor Search in Feature Space. This optimization process is then repeated until the handle point reaches the target point. To track handle points, an obvious approach is to perform optical flow estimation (the classic problem of estimating a motion field between two images) between consecutive frames, such as the most widely used RAFT . Another is to use "particle video". The particle video stream obtains the motion trajectory of feature points in the video sequence, and extracts the obtained motion trajectory, and then uses the longest common subsequence LCS (Longest Common Subsequence) to cluster the trajectory to obtain The dominant direction of motion, this method is called PIPs , PIPs consider information across multiple frames, and thus handle long-distance tracking better than previous methods.
Whereas in DragGAN, we will perform point tracking on GAN-generated images without using any of the above methods or additional neural networks. Because the feature space of GAN is discriminative enough, tracking can be achieved simply by feature matching. Despite the simplicity of our method, experimental results outperform state-of-the-art point tracking methods, including RAFT and PIPs. 

3. The entire iterative process

Let's first look at the process diagram of the new image obtained after the input image is processed by the above two blocks (motion supervision and point tracking):

Initialize the image, generate the potential encoding w through the GAN generator, click the handle point (red point) and the target point (blue point) that needs to be reached by the mouse, and then the motion supervision will drive the handle point closer to the target point position and update the potential Encode w as w'. In this process, we need to track the position of the handle point, and update the position of the handle point through the discriminative feature of the feature space. Of course, at this time, the iterative potential code will also be updated at the same time, and iterate until the target point is satisfied or terminated early.
A more detailed introduction, as shown below:

Motion supervision is achieved by moving a patch loss over the generator's feature map, and we track points on the same feature space via nearest neighbor search.
where the handle points are defined as\begin{Bmatrix} p_i = (x_{p,i},y_{p,i}) | i = 1,2,..., n \end{Bmatrix} 

The corresponding target point is\begin{Bmatrix} t_i = (x_{t,i}, y_{t,i}) | i = 1,2,...,n \end{Bmatrix}

4. Technical implementation details

4.1. Sports supervision

Let's first look at how the motion supervision is specifically operated, and what mathematical formulas need to be used.
We consider the feature map F after block 6 of StyleGAN2 to perform best among all features due to a good trade-off between resolution and discriminativeness. We resize F by bilinear interpolation to have the same resolution as the final image. Among them, those who want to know about bilinear interpolation can just read a blog post posted a few days ago: LIBSVM and LIBLINEAR Support Vector Machine Library Visual Code Practice for Pattern Recognition and Regression
Of course, for bilinear interpolation, it will be clearer if you are familiar with StyleGAN3 In simple terms, it is to sample between adjacent pixels in the X and Y directions, and then insert values ​​​​between the sampling points. This is also an improvement of StyleGAN3 to StyleGAN2, which is also introduced in the previous link, welcome to check it out.

To move a handle point p_ito the target point t_i, the idea is to supervise p_ia small area on the periphery (red circle p_i) moving a small step forward (blue circle). We use pixels \Omega _1(p_i,r_i)representing p_idistances less r_1than , then our motion supervision loss formula:

Where F(q) represents the eigenvalue of F at pixel q, which is F_0the feature map corresponding to the initial image, and d_iis the p_inormalized t_ivector, the formula: d_i=\frac{t_i-p_i}{||t_i-p_i||_2}

Additionally given a binary mask M, we keep the unmasked regions fixed, and the reconstruction loss is denoted as the second term.

4.2. Point Tracking

We also said before that optical flow estimation and particle video flow are not used, because the discriminative features in GAN can capture dense correspondences well, so they can be effectively tracked by nearest neighbor search in feature patches. We denote the features of the initial processing point as f_i = F_0(p_i)
We p_idenote the surrounding patches as:\Omega _2(p_i,r_2)=\begin{Bmatrix} (x,y) | |x-x_{p,i}|<r_2 , |y-y_{p,i}|<r_2 \end{Bmatrix}

Then \Omega _2(p_i,r_2)search f_ithe nearest neighbor in , you can get the tracking point p_i, the formula:

In this way, updating p_i, we can keep track of the handle points. For multiple handle points, we can apply the same process to each point.

4.3. Hyperparameter settings

The code is implemented based on PyTorch , and there is no open source yet. Is it going to be the last day of the month? Haha
We use the Adam optimizer to optimize the latent code of FFHQ with a step size of 0.002 , and the step size of AFHQCat, LSUN Car and other datasets is set to 0.001 . The hyperparameters are set to λ=20, r_1=3, r_2=12. In our implementation, we stop the optimization process when all handle points are no more than d pixels away from the corresponding target point, where d is set to 1 for no more than 5 handle points and 2 otherwise.

Here is a comparison experiment with PIPs and RAFT:

It can be seen that our method tracks points more precisely, resulting in more precise edits.

5. Defects

The front has shown the power of DragGAN, and of course there are also disadvantages, that is, artifacts may be caused when creating human poses that deviate from the training distribution. In addition, the handle points in the non-textured area sometimes drift more during the tracking process, which generally works well in texture-rich areas. As shown below: 

 

Although our method has some extrapolation ability, it can create images outside the training image distribution, such as the picture below, an extremely open mouth and a large wheel. However, due to the distribution outside the training data set, it is easy to be distorted, and the wheels can be seen to be deformed.

Well, for more details, you can refer to the original paper. Relatively speaking, this article is an improvement based on StyleGAN2. It is very clear to those who are familiar with StyleGAN.

Quote:

Paper address: Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold
The source code is said to be released in June. Seeing that June is almost over, the source code has not yet been released:
github: https://github.com/XingangPan/ DragGAN

Guess you like

Origin blog.csdn.net/weixin_41896770/article/details/131358317