Deep Learning 06-Deep Convolutional Generative Adversarial Network (DCGAN)

Overview

GAN (Generative Adversarial Network) is a generative model consisting of two neural networks: Generator and Discriminator. The basic idea of ​​GAN is to learn the ability to generate real samples by pitting the generator and the discriminator against each other.

The function of the generator is to take a random noise vector as input and gradually transform it into an output similar to the real sample through a series of neural network layers. The goal of the generator is to try to make the generated samples mistaken for real samples by the discriminator, thereby deceiving the discriminator. The training goal of the generator is to minimize the difference between generated samples and real samples.

The role of the discriminator is to distinguish the input samples into real samples and generated samples. The goal of the discriminator is to determine the authenticity of the sample as accurately as possible. The training goal of the discriminator is to maximize the ability to distinguish between real samples and generated samples.

The training process of GAN can be briefly described as the following steps:

Initialize the parameters of the generator and discriminator.
A batch of samples are randomly selected from real samples as a training set for the discriminator. At the same time, a batch of random noise vectors are generated as input to the generator.
Use the generator to generate a batch of samples and mix them with real samples to form the training set of the discriminator.
Use the discriminator to discriminate the samples in the training set and calculate the loss between the generated samples and the real samples.
Update the parameters of the generator and the discriminator to gradually improve the quality of the generated samples, and at the same time, the discriminant ability of the discriminator is also gradually enhanced.
Repeat steps 2-5 until the generator is able to generate samples similar to real samples.
DCGAN (Deep Convolutional GAN) is an improved version of GAN, which mainly improves the performance of the generator and discriminator by introducing a convolutional neural network (CNN). DCGAN uses convolutional and deconvolutional layers in the generator and discriminator to enable it to process image data. Compared with traditional GAN, DCGAN has better performance in generating image details and texture.

In general, GAN is a generative model that learns to generate real samples by confronting the generator and the discriminator against each other, while DCGAN introduces a convolutional neural network based on GAN to improve the processing capabilities of image data.

Introduction to the principle

The pioneering work of GAN is the classic paper Generative Adversarial Networks[2] published in 2014 by Ian Goodfellow, known as the "father of GAN". In this paper, he proposed a generative adversarial network and designed the first GAN. Experiment - handwritten number generation.

The emergence of GAN came from a sudden idea:

“What I cannot create, I do not understand.”
—Richard Feynman

Similarly, if deep learning can’t create images, then it doesn’t really understand them. At that time, deep learning had begun to conquer various fields of computer vision and achieved breakthroughs in almost all tasks. However, people have always questioned the black box model of neural networks, so more and more people are exploring the features and combinations of features learned by convolutional networks from a visual perspective, while GAN shows the characteristics of neural networks from a generative learning perspective. Powerful capabilities. GANs solve a well-known problem in unsupervised learning: given a batch of samples, train a system to generate similar new samples.

The network structure of the generative adversarial network is shown in Figure 7-2, which mainly includes the following two sub-networks.
• Generator: Input a random noise and generate an image.
• Discriminator: determines whether the input picture is a real picture or a fake picture.
Insert image description here
When training the discriminator, you need to use the fake pictures generated by the generator and real pictures from the real world; when training the generator, you only use noise to generate fake pictures. The discriminator evaluates the quality of the generated fake images, prompting the generator to adjust parameters accordingly.

The goal of the generator is to generate as fake a picture as possible so that the discriminator thinks it is a real picture; the goal of the discriminator is to distinguish the pictures generated by the generator from real-world pictures. It can be seen that the two have opposite goals and compete with each other during the training process, which is why it is called a generative adversarial network.
The above description may be a bit abstract, so let us use the examples of calligraphy and painting collectors and counterfeit painting dealers who collect Qi Baishi's works (Qi Baishi's works are shown in Figure 7-3). Counterfeit painting dealers are equivalent to generators. They hope to imitate the original works of masters and forge fake paintings to deceive collectors and sell them at high prices. Calligraphy and painting collectors hope to distinguish fakes from authentic works, so that the authentic works can be circulated in the world and destroyed. Counterfeit. The paintings traded by fake painting dealers and collectors here are mainly shrimps painted by Qi Baishi. Qi Baishi's shrimp painting can be said to be a masterpiece in the painting world and has always been sought after by the world.
Insert image description here
In this case, both the counterfeit painting dealer and the calligraphy and painting collector were novices at the beginning, and they had a vague concept of genuine works and fakes. The counterfeit paintings produced by counterfeit painting dealers are almost all random graffiti, and calligraphy and painting collectors have poor identification skills. Many fakes are regarded as genuine by him, and many authentic works are regarded as fakes.

First of all, the calligraphy and painting collector collected a lot of fakes on the market and the original works of Master Qi Baishi, carefully studied and compared them, and initially learned the structure of the shrimp in the painting. He understood that the creature in the painting had a curved shape and a pair of "chelicerae" that resembled pliers. ”, all fake paintings that do not meet this condition will be filtered out. When collectors use this standard to conduct appraisals on the market, fake paintings are basically unable to deceive collectors, and dealers in fake paintings suffer heavy losses. However, some of the fakes produced by fake painting dealers are still deceptive. These fakes have curved shapes and a pair of "chelicerae" that resemble pliers. So the counterfeit painting dealers began to modify the techniques of imitation, adding curved shapes and a pair of plier-like "chelicerae" to the imitation works. Apart from these features, other places such as colors and lines are drawn randomly. The first forgery created by a dealer in fake paintings is shown here.
Insert image description here
When counterfeit painting dealers put these paintings on the market for sale, they easily deceive collectors because there is a curved creature in the painting with a pair of pliers-like things in front of the creature, which meets the standards of authenticity recognized by collectors. So collectors bought it as an authentic piece. As time went by, the collector bought back more and more fake paintings and suffered heavy losses, so he studied the difference between fakes and authentic paintings behind closed doors. After repeated comparisons and comparisons, he found that in addition to the The shape of the shrimp is curved, the tentacles of the shrimp are long and the whole body is translucent. The details of the shrimp are very rich, and each segment of the shrimp is white.

After the collectors completed their studies, they came back to work, but the counterfeiting skills of the fake painting dealers did not improve, and the fakes they produced were easily seen through by collectors. As a result, counterfeit painting dealers also began to try different techniques for painting shrimps, most of which were in vain. However, among the many attempts, some fakes still deceived collectors. Fake painting dealers discovered that these imitations had long tentacles and translucent bodies, and the shrimp paintings were very detailed, as shown in Figure 7-5. As a result, fake painting dealers began to imitate such paintings in large quantities and sell them on the market. Many of them successfully deceived collectors.
Insert image description here
The collector once again suffered heavy losses and was forced to close his shop to study the difference between Qi Baishi's original works and fakes, learn the characteristics of Qi Baishi's original works, and improve his identification skills. In this way, through the game between collectors and fake painting dealers, the collectors gradually improved their ability to distinguish authentic works from fakes from scratch, while the fake painting dealers also continuously improved their level of imitating Qi Baishi's original works. Collectors use the fakes provided by counterfeit painting dealers to compare with the original works, and have a better appreciation of the original works of Qi Baishi paintings; and counterfeit painting dealers continue to try to improve the level of counterfeiting and improve the quality of the counterfeit paintings, even if in the end What is produced is still a fake, but it is very close to the original. Collectors and counterfeit painting dealers compete with each other, and at the same time constantly encourage each other to learn and progress, so as to achieve the goal of mutual improvement.

In this example, the fake painting dealer is equivalent to a generator, and the collector is equivalent to a discriminator. Both the generator and the discriminator perform poorly at the beginning because both are randomly initialized. The training process is divided into two alternating steps. The first step is to train the discriminator (only modify the parameters of the discriminator and fix the generator). The goal is to distinguish authentic works from fakes; the second step is to train the generator (only modify the generator parameters, fixed discriminator), so that the generated fake paintings can be judged as authentic by the discriminator (considered to be authentic by collectors). These two steps are performed alternately, and both the classifier and the discriminator reach a very high level. At the end of the training, the shrimp pictures generated by the generator (as shown in Figure 7-6) are almost indistinguishable from Qi Baishi’s real pictures.
Insert image description here
Next let’s think about the design of the network structure. The goal of the discriminator is to determine whether the input picture is authentic or fake, so it can be regarded as a two-classification network. We can design a simple convolutional network. The goal of the generator is to generate a color picture from noise. Here we use the widely used DCGAN (Deep Convolutional Generative Adversarial Networks) structure, that is, a fully convolutional network, whose structure is shown in the figure below. The input of the network is a 100-dimensional noise, and the output is a 3×64×64 image. The input here can be regarded as a 100×1×1 picture, which is slowly increased to 4×4, 8×8, 16×16, 32×32 and 64×64 through upconvolution. Upconvolution, or transposed convolution, is a special convolution operation, similar to the inverse operation of the convolution operation. When the stride of the convolution is 2, the output will be downsampled to half the size of the input; and when the stride of the upconvolution is 2, the output will be upsampled to twice the size of the input. This upsampling method can be understood as the information of the picture is stored in 100 vectors. Based on the information described by these 100 vectors, the neural network first outlines basic information such as contours and tones in the first few steps of upsampling. Upsampling slowly refines the details. The deeper the network, the more detailed the details.
Insert image description here

This chapter is quoted from the book "Deep Learning Framework Pytorch, Getting Started to Practice"

Terminology

  • Upsample is used in the field of deep learning in computer vision. Since the input image extracts features through a convolutional neural network (CNN), the size of the output often becomes smaller, and sometimes we need to restore the image to its original size in order to To perform further calculations (eg: semantic segmentation of images), this operation uses expanding the image size to map the image from small resolution to large resolution, which is called upsampling.
  • Matrix zero padding is the process of adding zero values ​​to the boundaries of a matrix. In computer vision and deep learning, matrix zero padding is commonly used in image processing and convolutional neural networks (CNN).
    In image processing, matrix zero-padding can be used to expand the dimensions of an image so that the size of the image remains unchanged when performing a convolution operation. By adding zero values ​​around the image, you ensure that the convolution kernel completely covers the edge pixels of the image. This avoids losing information about image edges during convolution operations.
    In convolutional neural networks (CNN), matrix zero padding is often used to adjust the input size and output size of the convolutional layer. By adding zero values ​​to the boundaries of the input matrix, you ensure that the dimensions of the feature map produced by the convolution operation are consistent with the dimensions of the input matrix. This is important for building deep neural networks and processing input data of different sizes.
    The size of a matrix's zero padding is usually determined by the number of rows and columns filled. In CNN, the size of padding is often related to the size and stride of the convolution kernel. By properly choosing the padding size, you can control the size of the feature map and the size of the receptive field while keeping the input and output sizes consistent.
  • There are three common methods of deconvolution (Transposed Convolution) upsampling: bilinear interpolation (bilinear), deconvolution (Transposed Convolution), and unpooling (Unpooling). We only discuss deconvolution here. The deconvolution referred to here is also called transposed convolution. It is not the complete reverse process of forward convolution. To explain in one sentence: deconvolution is a special forward convolution. It first follows a certain ratio. Expand the size of the input image by padding 000, then rotate the convolution kernel, and then perform forward convolution.

Zero padding

For some processes in image processing, I need to expand the size of the read numpy matrix. For example, the matrix that was originally (4, 6) now needs to be expanded by 3 rows at the top, bottom, left and right, and in order not to affect the numerical calculation, use 0 padding.
For example, in the picture below, I have a 4x5 matrix of all 1s, but now I need to add 3 rows of 0s around it to expand the size. After the expansion, I still need to operate on the original area.

  1. If the original matrix has shape (m, n) and is filled with values ​​of p rows and q columns on each edge, then the shape of the filled matrix will be (m + 2p, n + 2q)
  2. If the shape of the original matrix is ​​(m, n) and a zero is inserted between each element, then the new matrix is ​​(m+m-1,n,n-1)=(2m-1,2n-1 )

Insert image description here
Numpy has encapsulated a function, which is pad

 #%%
import numpy as np
oneArry=np.ones((4,5))
print(oneArry)
print("周围",np.pad(oneArry,3))  #等价于print(np.pad(oneArry,(3,3)))
#注意元组0是左上角补充3行 ,元组1表示右小角
print("左上角",np.pad(oneArry,(3,0)))
print("右下角",np.pad(oneArry,(0,3)))

output

[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]
周围 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0.]
 [0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0.]
 [0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0.]
 [0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
左上角 [[0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1. 1. 1.]]
右下角 [[1. 1. 1. 1. 1. 0. 0. 0.]
 [1. 1. 1. 1. 1. 0. 0. 0.]
 [1. 1. 1. 1. 1. 0. 0. 0.]
 [1. 1. 1. 1. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]

Padding within an element refers to filling up, down, left, and right inside the element,
for example

[[1, 1],
 [1, 1]]

Filled with:

[[1, 0, 1], 
[0, 0, 0],
 [1, 0, 1]]

Code

import numpy as np

matrix = [[1, 1],
          [1, 1]]

zero_inserted_matrix = np.zeros((2*len(matrix)-1, 2*len(matrix[0])-1))

for i in range(len(matrix)):
    for j in range(len(matrix[0])):
        zero_inserted_matrix[2*i][2*j] = matrix[i][j]

print(zero_inserted_matrix)

output

[[1. 0. 1.]
 [0. 0. 0.]
 [1. 0. 1.]]

transposed convolution

Reference: https://www.zhihu.com/question/48279880
​ Transposed convolution or micro-stride convolution. However, it should be pointed out that the name deconvolution is not very appropriate, because transposed convolution is not a true deconvolution as defined in the signal/image processing field. Technically speaking, deconvolution in signal processing is the inverse of the convolution operation. But this is not the case here. Later we will explain why it is more natural and appropriate to call this operation transposed convolution.

We can use common convolutions to implement transposed convolutions. Here we use a simple example to illustrate. The input layer is 2*2 (the blue part below). First, the padding value Padding is zero padding with a unit step of 2*2 (the following is filled with 2 rows of 2 above, below, left, and right in blue. column), and then use a 3*3 convolution kernel with a stride of 1 to perform a convolution operation (the convolution obtains one value at a time) to achieve upsampling, and the size of the upsampling output is 4*4, which is (6-3 +1,6-3+1)

Insert image description here
It is worth mentioning that with various padding and strides, we can map the same 2*2 input to different image sizes. In the figure below, the transposed convolution is applied to the same 2*2 input (a zero is inserted between the inputs, and zero padding of a unit step of 2*2 is added around it), and a 3*3 convolution is applied. Kernel, the resulting result (i.e., the upsampling result) has a size of 5*5.
Insert image description here
Observing the transposed convolution in the above example can help us build some intuitive understanding. But in order to further apply transposed convolution, we also need to understand how computer matrix multiplication is implemented . From the perspective of the implementation process, we can understand why transposed convolution is the most appropriate name.

​ In convolution, we define it like this: C represents the convolution kernel, input is the input image, and output is the output image. After convolution (matrix multiplication), we downsample the input from the large image to a small image output. This matrix multiplication implementation follows C*input=output.

The following example shows how this operation works in a computer. It flattens the input matrix (16*1) and converts the convolution kernel into a sparse matrix, (4*16). Then, use matrix multiplication between the sparse matrix and the flattened input. After that, the obtained matrix (4*1) is converted into 2*2 and output.
Insert image description here
At this time, if the convolution kernel is used to correspond to the transposed CTC^T of the sparse matrixCMultiplying T (16*4) by the flatten of the output (4*1) results in a result (16*1) that has the same shape as the input (16*1).


However, it is worth noting that the above two operations are not reversible. For the same convolution kernel (because its sparse matrix is ​​not an orthogonal matrix), the result cannot be restored to the original value after the transposition operation, but only the original value is retained. Shape , so the name of transposed convolution comes from this. And to answer the question mentioned above, transposed convolution is more accurate than inverse convolution.

Generate animated images

Use DCGAN to train a model to generate 64*64 animation images, and use this example to plan the directory structure organization of the neural network. Most open source projects have similar directory structures, and it will be easier to analyze the open source model in the future.

Computing power selection

Since gan training requires large resources and takes a long time, it is recommended to use a gpu server.
There are many GPU models provided on the gpt cloud platform. We can roughly divide it into five categories according to the GPU architecture (autodl or inscode is recommended, which can be billed on time and released when used up):

  1. NVIDIA Pascal architecture GPUs, such as TitanXp, GTX 10 series, etc. This type of GPU lacks low-precision hardware acceleration capabilities, but has moderate single-precision computing power. Because it is cheap, it is suitable for practicing training small models (such as Cifar10) or debugging model code.
  2. NVIDIA Volta/Turing architecture GPUs, such as GTX 20 series, Tesla V100, etc. This type of GPU is equipped with TensorCore, which is designed to accelerate low-precision (int8/float16) calculations, but the single-precision computing power has not improved much compared to the previous generation. We recommend enabling mixed-precision training of deep learning frameworks on the instance to speed up model computation. Compared with single-precision training, mixed-precision training can usually provide more than 2 times the training acceleration.
  3. NVIDIA Ampere architecture GPUs, such as GTX 30 series, Tesla A40/A100, etc. This type of GPU is equipped with third-generation TensorCore. Compared with the previous generation, it supports the TensorFloat32 format, which can directly accelerate single-precision training (PyTorch is enabled by default). However, we still recommend using the float16 half-precision training model with ultra-high computing power, which can achieve more significant performance improvements than the previous generation of GPUs.
  4. Cambrian MLU 200 series accelerator cards. Model training is not supported yet. Model inference using this series of accelerator cards needs to be quantized into int8 for calculation. And you need to install the deep learning framework adapted to Cambrian MLU.
  5. Huawei Ascend series accelerator cards. Support model training and inference. However, the MindSpore framework needs to be installed for calculation.

Choosing a GPU model is not difficult. For commonly used deep learning models, the performance of the GPU training model can be roughly estimated based on the computing power of the GPU corresponding to the accuracy. The AutoDL platform marks and ranks the computing power of each type of GPU to facilitate everyone to choose the GPU that suits them.

The number of GPUs selected is related to the training task. Generally, we believe that a training session of the model should be completed within 24 hours, so that the improved model can be trained the next day. Here are some suggestions for choosing multiple GPUs:

  • 1 GPU. Suitable for some training tasks with smaller data sets, such as Pascal VOC, etc.
  • 2 GPUs. Same as a single GPU, but you can run two sets of parameters at once or expand the Batchsize.
  • 4 GPUs. Suitable for training tasks with some medium data sets, such as MS COCO, etc.
  • 8 GPUs. A classic configuration that will last forever! It is suitable for various training tasks and is also very convenient for reproducing the results of the paper.
  • I want more! Used to train large parameter models, adjust parameters on a large scale, or complete model training ultra-fast.

My commonly used GPUs are ranked in order from high to low performance. The GPU computing power of these machines is ranked as follows, along with their basic configuration information:

  1. A100:
    • Number of CUDA cores: 6912
    • Number of Tensor cores: 432
    • Video memory capacity: 40 GB
    • Memory bandwidth: 1555 GB/s
    • Architecture: Ampere
  2. V100:
    • Number of CUDA cores: 5120
    • Number of Tensor cores: 640
    • Video memory capacity: 16 GB / 32 GB / 32 GB HBM2
    • Memory bandwidth: 900 GB/s / 1134 GB/s / 1134 GB/s
    • Architecture: Volta
  3. P100:
    • Number of CUDA cores: 3584
    • Number of Tensor cores: 0
    • Video memory capacity: 16 GB / 12 GB HBM2
    • Memory bandwidth: 732 GB/s / 549 GB/s
    • Architecture: Pascal
  4. Tesla T4:
    • Number of CUDA cores: 2560
    • Number of Tensor cores: 320
    • Video memory capacity: 16 GB
    • Memory bandwidth: 320 GB/s
    • Architecture: Turing
  5. RTX A4000:
    • Number of CUDA cores: 6144
    • Number of Tensor cores: 192
    • Video memory capacity: 16 GB
    • Memory bandwidth: 448 GB/s
    • Architecture: Ampere
      uses P100 to complete training in about 1 hour (3831.2s)

data set

On kagle: https://www.kaggle.com/code/splcher/starter-anime-face-dataset
Register an account by email and you can download it. Under the data set, there are different users’ training codes and results based on the data set.
Reference code: https://www.kaggle.com/code/splcher/starter-anime-face-dataset

Directory planning

Insert image description here
The main contents and functions of each file are as follows.

• checkpoints/: used to save the trained model, so that the program can still reload the model and resume training after abnormal exit.
• data/: Data-related operations, including data preprocessing, dataset implementation, etc.
• models/: Model definition, there can be multiple models, such as AlexNet and ResNet34 above, one model corresponds to one file.
• utils/: Utility functions that may be used. In this experiment, visualization tools are mainly encapsulated.
• config.py: Configuration file, all configurable variables are concentrated here and default values ​​are provided.
• main.py: The main file is the entrance to the training and testing programs. Different operations and parameters can be specified through different commands.
• requirements.txt: third-party libraries that the program depends on.
• README.md: Provides necessary instructions for the program.

source code

Data source loading

Add the downloaded AnimeFaceDataset to the data directory and create a new dataset.py to load the data set.

from torch.utils.data import Dataset,DataLoader
from torchvision import datasets, transforms
class AtomicDataset(Dataset):
    def __init__(self,root,image_size):
        Dataset.__init__(self)
        self.dataset=datasets.ImageFolder(root,
                           transform=transforms.Compose([
                               transforms.Resize(image_size),
                               transforms.CenterCrop(image_size),
                               transforms.ToTensor(),
                               transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
                           ]))
    def __getitem__(self,index):
        return self.dataset[index]
    def __len__(self):
        return len(self.dataset)

    def toBatchLoader(self,batch_size):
        return DataLoader(self,batch_size=batch_size, shuffle=False)

Define configuration class

config.py defines the configuration class

class Config:
    #定义转换后图像的大小
    img_size=64
    #训练图片所在目录,目录必须是有子目录,子目录名称就是分类名
    img_root="./data/AnimeFaceDataset"
    #每次加载的批次数
    batch_size=64
    """
    在卷积神经网络中,这些缩写通常表示以下含义:
        nz:表示输入噪声向量的维度。全称为"noise dimension",即噪声维度。
        ngf:表示生成器网络中特征图的通道数。全称为"number of generator features",即生成器特征图通道数。
        nc:表示输入图像的通道数。全称为"number of image channels",即图像通道数。
    """
    #表示噪声的维度,一般是(100,1,1)
    nz=100
    #表示生成特征图的维度,64*64的图片
    ngf=64
    #生成或者传入图片的通道数
    nc=3
    # 表示判别器输入特征图的维度,64*64的图片
    ndf = 64
    # 优化器的学习率
    lr = 0.0002
    # Beta1 hyperparam for Adam optimizers
    beta1 = 0.5
    # epochs的次数
    num_epochs=50
    def __init__(self,kv):
        for key, value in kv.items():
            setattr(self, key, value)

Since these configurations are static by default, you can use fire to define parameters to the command line, via

python main.py 函数名 --参数值1=值1 --参数值2=值2的方式传入到**kwargs

main.py defines the train method

def train(**kwargs):
	print(kwargs)
if __name__ == "__main__":
    # 将main.py中所有的函数映射成  python main.py 方法名 --参数1=参数值 --参数2=参数值的形式,这些参数以keyvalue字典的形式传入kwargs
    fire.Fire()

Define model

Create a new models.py in the models directory to define the G and D models

import torch.nn as nn
"""
nn.ConvTranspose2d的参数包括:
    in_channels:输入通道数
    out_channels:输出通道数
    kernel_size:卷积核大小
    stride:步长
    padding:填充大小
    output_padding:输出填充大小
    groups:分组卷积数量,默认为1
    bias:是否使用偏置,默认为True
  生成器的目标是从一个随机噪声向量生成逼真的图像。在生成器中,通道数从大到小可以理解为从抽象的特征逐渐转化为具体的图像细节。通过逐层转置卷积(ConvTranspose2d)操作,
将低维度的特征逐渐转化为高维度的图像。通道数的减少可以理解为对特征进行提取和压缩,以生成更具细节和逼真度的图像。
"""
#生成网络
class Generator(nn.Module):
    def __init__(self, nz,ngf,nc):
        super(Generator, self).__init__()
        self.main = nn.Sequential(
            # nz表示噪声的维度,一般是(100,1,1)
            # ngf表示生成特征图的维度
            # nc表示输入或者输出图像的维度
            #输出尺寸 = (输入尺寸(高度) - 1) * stride - 2 * padding + kernel_size + output_padding
            #如果(卷积核,步长,填充)=(4, 1, 0)表示图像的维度是卷积核的大小(卷积核高,卷积核宽)
            #如果(卷积核,步长,填充)=(4, 2, 1)表示图像的维度是是上一个图像的2被(输入图像高度*2,输入图像宽度*2)
            nn.ConvTranspose2d(nz, ngf * 8, 4, 1, 0, bias=False),
            nn.BatchNorm2d(ngf * 8),
            nn.ReLU(True),
            # state size. (ngf*8) x 4 x 4
            nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 4),
            nn.ReLU(True),
            # state size. (ngf*4) x 8 x 8
            nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 2),
            nn.ReLU(True),
            # state size. (ngf*2) x 16 x 16
            nn.ConvTranspose2d(ngf * 2, ngf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf),
            nn.ReLU(True),
            # state size. (ngf) x 32 x 32
            nn.ConvTranspose2d(ngf, nc, 4, 2, 1, bias=False),
            nn.Tanh()
            # state size. (nc) x 64 x 64
        )

    def forward(self, input):
        return self.main(input)

"""
    和转置卷积相反的是(4,2,1)会让维度2倍降低
    卷积过程是height-kerel+1
"""
class Discriminator(nn.Module):
    def __init__(self, nc,ndf):
        super(Discriminator, self).__init__()
        self.main = nn.Sequential(
            # input is (nc) x 64 x 64
            nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf) x 32 x 32
            nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 2),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*2) x 16 x 16
            nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 4),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*4) x 8 x 8
            nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 8),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*8) x 4 x 4
            nn.Conv2d(ndf * 8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid()
            # state size (1,1,1)
        )

    def forward(self, input):
        return self.main(input)

train

Train the D model and compare the output data with 1 to calculate the loss and minimize the loss. Use the D model prediction to compare the data generated by G with 0 (cannot be fooled) to calculate the loss and minimize the loss.
Train the G model, generate pictures, use the D model to predict, and compare the loss with 1 (deceive the D model) to minimize the loss.

def train(**kwargs):
    # 通过传入的参数初始化Config
    defaultConfig = Config(kwargs)
    # 通过给定的目录和图像大小转换成数据集
    dataset = AtomicDataset(defaultConfig.img_root, defaultConfig.img_size)
    # 转换为可迭代的批次为defaultConfig.batch_size的数据集
    dataloader = dataset.toBatchLoader(defaultConfig.batch_size)
    # 创建生成网络模型
    netG = Generator(defaultConfig.nz, defaultConfig.ngf, defaultConfig.nc).to(device)
    # 创建分类器模型
    netD = Discriminator(defaultConfig.nc, defaultConfig.ndf).to(device)
    # 使用criterion = nn.BCELoss()
    criterion = nn.BCELoss()
    # Setup Adam optimizers for both G and D
    optimizerD = optim.Adam(netD.parameters(), lr=defaultConfig.lr, betas=(defaultConfig.beta1, 0.999))
    optimizerG = optim.Adam(netG.parameters(), lr=defaultConfig.lr, betas=(defaultConfig.beta1, 0.999))
    # 如果是真的图片label=1,伪造的图片为0
    real_label = 1
    fake_label = 0
    # Lists to keep track of progress
    img_list = []
    G_losses = []
    D_losses = []
    iters = 0
    #生成一个64批次100*1*1的噪声
    fixed_noise = torch.randn(64, defaultConfig.nz, 1, 1, device=device)

    print("Starting Training Loop...")
    # For each epoch
    for epoch in range(defaultConfig.num_epochs):
        # For each batch in the dataloader
        for i, data in enumerate(dataloader, 0):

            ############################
            # (1) Update D network: maximize log(D(x)) + log(1 - D(G(z)))
            # 对于真实传入的图片进行判断器训练,label肯定是1
            # 对于噪声传入的图片进行判断器训练,label肯定是0
            ###########################
            ## 通过真实图片训练D网络
            netD.zero_grad()
            # 将64批次数据转换为gpu设备
            real_cpu = data[0].to(device)
            # 获取批次的个数
            b_size = real_cpu.size(0)
            # 生成的是一个一维的张量,其中包含64个元素,每个元素的值为1。
            label = torch.full((b_size,), real_label, device=device).float()
            # 分类器捲積后最后产生一个64个批次的1*1,转换成1维数组。
            output = netD(real_cpu).view(-1)
            # 计算和真实数据的损失
            errD_real = criterion(output, label)
            # 反向传播计算梯度
            errD_real.backward()
            # D_x的值表示判别器对真实样本的平均预测概率
            D_x = output.mean().item()

            ## 通过噪声训练生成器模型
            # 生成噪声的变量 也是64批次,噪声的通道数是100
            noise = torch.randn(b_size, defaultConfig.nz, 1, 1, device=device)
            # 传入到生成网络中,生成一张64*3*64*64的图片
            fake = netG(noise)
            # 生成器生成的图片对应的真实的label应该是0
            label.fill_(fake_label)
            # detach()是PyTorch中的一个函数,它用于从计算图中分离出一个Tensor。当我们调用detach()函数时,它会返回一个新的Tensor,该Tensor与原始Tensor共享相同的底层数据,但不会有梯度信息。
            # 使用判别器网络来判断通过噪声生成的图片,转换为1维
            output = netD(fake.detach()).view(-1)
            # 进行损失函数计算
            errD_fake = criterion(output, label)
            # 反向传播计算梯度
            errD_fake.backward()
            # 表示判别器对虚假样本的平均预测概率
            D_G_z1 = output.mean().item()
            # 将真实图片和虚假图片的损失求和获取所有的损失
            errD = errD_real + errD_fake
            # 更新权重参数
            optimizerD.step()

            ############################
            # (2) Update G network: maximize log(D(G(z)))
            # 对于G网络来说,对于虚假传入的图片进行判断器训练,尽量让判别器认为是真1,生成的图片才够真实
            ###########################
            netG.zero_grad()
            label.fill_(real_label)  # fake labels are real for generator cost
            # 使用之前的G网络生成的图片64*3*64*64,传入D网络
            output = netD(fake).view(-1)
            # 计算G网路的损失
            errG = criterion(output, label)
            # 反向计算梯度
            errG.backward()
            #表示判别器对虚假样本判断为真的的平均预测概率
            D_G_z2 = output.mean().item()
            # 更新G的权重
            optimizerG.step()

            # 输出训练统计,每1000批次
            if i % 1000 == 0:
                print('[%d/%d][%d/%d]\tLoss_D: %.4f\tLoss_G: %.4f\tD(x): %.4f\tD(G(z)): %.4f / %.4f'
                      % (epoch, defaultConfig.num_epochs, i, len(dataloader),
                         errD.item(), errG.item(), D_x, D_G_z1, D_G_z2))

            # Save Losses for plotting later
            G_losses.append(errG.item())
            D_losses.append(errD.item())

            # 即每经过一定数量的迭代(iters % 250 == 0)或者是训练的最后一个epoch的最后一个batch((epoch == defaultConfig.num_epochs - 1) and (i == len(dataloader) - 1)),
            # 就会使用G网络通过噪声生成64批次3通道64*64的图像,并且加入到img_list去做可视化,看看效果
            if (iters % 250 == 0) or ((epoch == defaultConfig.num_epochs - 1) and (i == len(dataloader) - 1)):
                with torch.no_grad():
                    fake = netG(fixed_noise).detach().cpu()
                img_list.append(vutils.make_grid(fake, padding=2, normalize=True))

            iters += 1
    #保存生成器的网络到checkpoints目录
    torch.save(netG.state_dict(), "./checkpoints/optimizerG.pt")

Visualization

draw loss

    #绘制G和D的损失函数图像
    plt.figure(figsize=(10, 5))
    plt.title("Generator and Discriminator Loss During Training")
    #一维数组的索引值是x坐标也就是批次索引
    plt.plot(G_losses, label="G")
    plt.plot(D_losses, label="D")
    plt.xlabel("iterations")
    plt.ylabel("Loss")
    plt.legend()
    plt.show()

Insert image description here

Draw generator image changes

    #创建一个8*8的画布
    fig = plt.figure(figsize=(8, 8))
    plt.axis("off")
    ims = [[plt.imshow(np.transpose(i, (1, 2, 0)), animated=True)] for i in img_list]
    ani = animation.ArtistAnimation(fig, ims, interval=1000, repeat_delay=1000, blit=True)
    HTML(ani.to_jshtml())

Every 250 iterations, 64 pictures are generated through G, and it is found that the pictures become clearer towards the back.
Insert image description here
Insert image description here

other projects

CycleGAN

Reference address: https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

stargan

Reference address: https://github.com/yunjey/stargan

Guess you like

Origin blog.csdn.net/liaomin416100569/article/details/131620536