Pytorch framework implements Pix2Pix (Image-to-image)

Table of contents

1.pix2pix research background

2. Basic principle of Pix2Pix

(1) Schematic diagram

(2) Conditional GAN ​​(cGAN)

(3) Formula principle

3. Pix2Pix network model

(1) From pixel GAN ​​to block-level GAN ​​to image-level GAN ​​(From PixelGANs to PatchGANs to ImageGANs)

(2) Discriminant model

(3) Generate model

4. Dataset download

5. pix2pix code implementation

6. The mainWindow window displays the pictures generated by the generator

7. Model download 


1.pix2pix research background

Tip: pix2pix homepage

  

  • digital image task
    • Computer Vision
    • Mimics the processing and understanding of visual information by the human eye and brain
    • Image classification, object detection, face recognition
  • Computer Graphics
    • Simulating visual perception of the physical world in a digital space
    • animation, 3D modeling, virtual reality
  • Digital Image Processing
    • Based on prior knowledge, convert the display form of the image
    • Image Enhancement, Image Restoration, Camera ISP
  • Image translation (Image Translation) converts between images in different forms. Generate the corresponding image in the target domain according to the image of the sorce domain, and constrain the distribution of the generated image and the source image to be as consistent as possible in a certain dimension
    • image restoration
    • video interpolation
    • image editing
    • style transfer
    • super resolution
  • Image Quality Assessment (IQA)

    • Pixel Loss (MSE)

    • Structural Impairment (SSIM)

    • color loss

    • Sharpness Loss (GMSD)

    • Perceptual loss (use the ImagNet pre-training model to extract the feature of the image, and then compare the loss between features)

2. Basic principle of Pix2Pix

(1) Schematic diagram

Tip: This is the process of converting the edge of the image to the picture taken by the mobile phone. 

  •  generator model
    • The first is to generate a fake mobile phone photo after converting the input edge map (different from the noise input by GAN).
    • The purpose of the generative model is to make the generated photo image fool the discriminator.
  • discriminator model
    • first part
      • The photographed image generated by the generative model and the corresponding real edge image are input to the discriminator for discrimination.
      • The result of the output discriminator is judged to be false as much as possible.
    • the second part
      • Input the real edge image and the corresponding real photo image to the discriminator for discrimination.
      • The output discrimination result is as true as possible.

(2) Conditional GAN ​​(cGAN)

  • Original GAN
    • The original GAN ​​is to let the generator model (Generator Model) and the discriminator model (Discriminator Model) continuously confront and optimize each other during the model training process, and finally let the generator model approach the probability distribution of real data, and finally only need to input the noise into the generator In the model, images that obey the probability distribution of real pictures can be generated to look like real ones.
  • Conditional GANs (cGANs)
    • Conditional GANs (cGANs) are limited to discrete labels, text, and images. Image conditioning models have addressed image prediction from normal maps, future frame prediction, photo generation and image generation from sparse annotations.
    • y can be any form of auxiliary information, such as class labels or other schema data. A conditional model can be implemented by adding an additional input layer to feed y to both the generator and the discriminator.    
    •  In the generator model, the real data x is used as input, and the corresponding data y' is output;
    •  In the discriminator model, x and y are sent as input to the discriminant model to determine whether it is true or false. At the same time, x and y' are also used as input to the discriminant model to determine whether the data generated by the generative model is true or false.
    • Summarize
      • In general, the purpose of the generative model in conditional GAN ​​is to make the generated image fool the discriminator as much as possible; the purpose of the discriminative model is to discriminate the image generated by the generative model as false, and the real input image as true as possible.

(3) Formula principle

 Tip: The reason why L1 loss is added is that it is found from experiments that the effect is the best after adding L1 loss, as shown in the figure below :

 Tip: By learning related GAN knowledge points, you can find that the GAN field is the best combination of mathematical formulas and codes.

3. Pix2Pix network model

(1) From pixel GAN ​​to block-level GAN ​​to image-level GAN ​​(From PixelGANs to PatchGANs to ImageGANs)

 

Tip: The paper gives from pixel-level GAN ​​to block-level GAN, and finally to image-level GAN. When the patch size is 70 x 70, the effect is not as good as the patch with the full image size, but the quality is slightly lower.       

   As can be seen from the figure below, on the street view label (labels->photos), for an image with an input size of 256 x 256 resolution (larger images are filled with 0), the FCN-scores evaluation for different discriminators (Discriminator) The result of the receptive field size, the discriminator with a receptive field of 70 x 70 works best .

Note: FCN is a pioneering work in the field of image segmentation. The model uses a fully convolutional network to segment different targets in the image. FCN - Fully Convolutional Networks for Semantic Segmentation (Fully Convolutional Networks for Semantic Segmentation)

(2) Discriminant model

Tip: The size of the original input image is 286 x 286, so the final output size of the discriminator is [1,1,30,30]; but the input image size used in this article is 256 x 256, so the size of the discriminator output is [1,1,26,26].

Four discriminators with different receptive fields, and all discriminators use the same basic structure, that is, the structures used in the four different receptive field discriminators below are the same. As the depth increases, The receptive field is constantly changing.

  • The receptive field size is: 70 x 70
    • C64-C128-C256-C512
    • The number of output channels of the last layer of convolution is 1, and then the activation function uses Sigmoid to indicate the final discriminant probability of the output (real / fake)
      • If the Sigmoid activation function is not used here, it must be used when calculating the loss value
        BCEWithLogitsLoss(), because BCEWithLogitsLoss() has a Sigmoid activation function.
    • The first parameter convolution does not use Batch Normalization
    • The activation function used throughout the process is LeakReLU
  • The receptive field size is: 1 x 1
    • C64-C128
    • The whole process uses a 1 x 1 convolution kernel, because it is a 1 x 1 receptive field.
  • The receptive field size is: 16 x 16
    • C64-C128
  • The receptive field size is: 286 x 286 (full image)
    • C64-C128-C256-C512-C512-C512

(3) Generate model

Tip: The generative model is selected by FCN-scores on the street view of the dataset. The comparison shows that the U-Net network as the generative model is the best.

        The original text gives the original decoder and encoder structure as follows:

Replenish:

  • for the decoder
    • After the last layer of convolution, the number of output channels is 3 (RGB), of course, the number of output channels of colorization is 2, and the final activation function used is Tanh()
    • The activation function used in the convolution process is ReLU
  • for encoder
    • The first layer of convolution does not use Batch Normalization
    • The activation function used in the convolution process is LeakReLU, and the slope of its control negative activation value (negative_slope=0.2) is 0.2
  • For encoders and decoders
    • Skip connections are used throughout the process, and the splicing of the encoder and decoder is corresponding. For example, there are n convolutional layers in the whole process, and the i-th layer of the encoder and the n-i-th layer of the decoder are spliced.

        The decoder of the U-Net network is as follows: https://mydreamambitious.blog.csdn.net/article/details/126092060

4. Dataset download

pix2pix dataset data set download

Anime Sketch Colorization Pair dataset download

5. pix2pix code implementation

Tip: The code is placed on Github. The code in this article is written by referring to the blogger below, but I only made some modifications, and added a mainWindows interface code to facilitate the image style conversion of the model trained later. .

Refer to the blogger's code: https://b23.tv/QUc0CNb

Download the code for this article: GitHub - KeepTryingTo/Pytorch-GAN: The process of using Pytorch to implement GAN

6. The mainWindow window displays the pictures generated by the generator

Tip: Here is a program (mainWindow.py) that displays the pictures displayed by the generator. Load the generator model saved after the previous training, and then use the model to randomly generate pictures, as follows:

(1) Run mainWindow.py The initial interface is as follows

 7. Model download 

Link: https://pan.baidu.com/s/1J2fT-jNpmDLcqwbAN6ip6w 
Extraction code: tsx9

reference link

 Research background of pip2pix

Guess you like

Origin blog.csdn.net/Keep_Trying_Go/article/details/130506045