Self-Attention Generative Adversarial Networks (SAGAN) 论文模型复现

版权声明: https://blog.csdn.net/dickdick111/article/details/90110758

Lab midterm report —— Self-Attention Generative Adversarial Networks

Paper Title: Self-Attention Generative Adversarial Networks

Paper Authors: Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena

Year: 2018

Student: Liang Yinglin-16340132

Description of the problem

​ Image synthesis is an important problem in computer vision. There has been remarkable progress in this direction with the emergence of Generative Adversarial Networks (GANs). However, GAN model excels at synthesizing image classes with few structural constraints , but it fails to capture geometric or structural patterns that occur consistently in some classes. Since the convolution operator has a local receptive field, long range dependencies can only be processed after passing through several convolutional layers.

​ One disadvantage of GANs is that after training on large datasets containing multiple types of images, they can not clearly distinguish image categories, and it is difficult to capture the structure, texture and details of these images. Therefore, we can not use a GAN to generate a large number of high-quality images with different categories.

​ On the other hand, although increasing the size of the convolution core (receptive field) can retain more representations, it is at the expense of efficiency and computation.

Introduction of the method

​ The authors propose Self-Attention Generative Adversarial Networks (SAGANs), which introduce a self-attention mechanism into convolutional GANs.

​ The self-attention module is complementary to convolutions and helps with modeling long range, multi-level dependencies across image regions. Armed with self-attention, the generator can draw images in which fine details at every location are carefully coordinated with fine details in distant portions of the image. Moreover, the discriminator can also more accurately enforce complicated geometric constraints on the global image structure.

​ In addition to self-attention, They propose enforcing good conditioning of GAN generators using the spectral normalization technique that has previously been applied only to the discriminator.

​ As a result, SAGAN significantly outperforms the state of the art in image synthesis by boosting the best reported Inception score from 36.8 to 52.52 and reducing Fréchet Inception distance from 27.62 to 18.65.

Preliminary results of the experiment

The structure of SGAN model

Generator includes five layers and two self-attention layer:

  • Layer one
    • x = ConvTranspose2d(128, 512, kernel_size=(4, 4), stride=(1, 1))
    • x = SpectralNorm(x)
    • x = BatchNorm2d(512)
    • x = ReLU(x)
  • Layer two
    • x = ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = SpectralNorm(x)
    • x = BatchNorm2d(256)
    • x = ReLU(x)
  • Layer three
    • ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = SpectralNorm(x)
    • x = BatchNorm2d(128)
    • x = ReLU(x)
  • Self_Attn(128)
  • Layer four
    • ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = SpectralNorm(x)
    • x = BatchNorm2d(64)
    • x = ReLU(x)
  • Self_Attn(64)
  • Layer five
    • x = ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = Tanh()
Generator(
  (l1): Sequential(
    (0): SpectralNorm(
      (module): ConvTranspose2d(128, 512, kernel_size=(4, 4), stride=(1, 1))
    )
    (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (l2): Sequential(
    (0): SpectralNorm(
      (module): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (l3): Sequential(
    (0): SpectralNorm(
      (module): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (l4): Sequential(
    (0): SpectralNorm(
      (module): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (last): Sequential(
    (0): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (1): Tanh()
  )
  (attn1): Self_Attn(
    (query_conv): Conv2d(128, 16, kernel_size=(1, 1), stride=(1, 1))
    (key_conv): Conv2d(128, 16, kernel_size=(1, 1), stride=(1, 1))
    (value_conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1))
    (softmax): Softmax()
  )
  (attn2): Self_Attn(
    (query_conv): Conv2d(64, 8, kernel_size=(1, 1), stride=(1, 1))
    (key_conv): Conv2d(64, 8, kernel_size=(1, 1), stride=(1, 1))
    (value_conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
    (softmax): Softmax()
  )
)

Discriminator also includes five layers and two self-attention layer:

  • Layer one
    • x = Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)
    • x = LeakyReLU(negative_slope=0.1)
  • Layer two
    • x = Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = SpectralNorm(x)
    • x = LeakyReLU(negative_slope=0.1)
  • Layer three
    • x = Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = SpectralNorm(x)
    • x = LeakyReLU(negative_slope=0.1)
  • Self_Attn(256)
  • Layer four
    • x = Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = SpectralNorm(x)
    • x = LeakyReLU(negative_slope=0.1)
  • Self_Attn(512)
  • Layer five
    • x = Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1))
Discriminator(
  (l1): Sequential(
    (0): SpectralNorm(
      (module): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): LeakyReLU(negative_slope=0.1)
  )
  (l2): Sequential(
    (0): SpectralNorm(
      (module): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): LeakyReLU(negative_slope=0.1)
  )
  (l3): Sequential(
    (0): SpectralNorm(
      (module): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): LeakyReLU(negative_slope=0.1)
  )
  (l4): Sequential(
    (0): SpectralNorm(
      (module): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): LeakyReLU(negative_slope=0.1)
  )
  (last): Sequential(
    (0): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1))
  )
  (attn1): Self_Attn(
    (query_conv): Conv2d(256, 32, kernel_size=(1, 1), stride=(1, 1))
    (key_conv): Conv2d(256, 32, kernel_size=(1, 1), stride=(1, 1))
    (value_conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
    (softmax): Softmax()
  )
  (attn2): Self_Attn(
    (query_conv): Conv2d(512, 64, kernel_size=(1, 1), stride=(1, 1))
    (key_conv): Conv2d(512, 64, kernel_size=(1, 1), stride=(1, 1))
    (value_conv): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
    (softmax): Softmax()
  )
)

The hyperparameter of SGAN model

batch_size = 64
g_lr = 0.0001
d_lr = 0.0004
lr_decay = 0.95
imsize = 64
total_step = 100000
optimizer = 'Adam'
beta1 = 0.0
beta2 = 0.9

The training set

def load_lsun(self, classes='church_outdoor_train'):
    lsun_transforms = transforms.Compose([
        transforms.Resize((self.imsize,self.imsize)),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    dataset = dsets.LSUN(self.path, classes=[classes], transform=lsun_transforms)
    return dataset

the ground truth

1

the generated photos

after 1000 steps:

Elapsed [0:09:07.832233], G_step [1000/100000], D_step[1000/100000], d_out_real: 1.1565,  ave_gamma_l3: -0.0323, ave_gamma_l4: -0.0486

3

after 10000 steps:

Elapsed [0:45:39.616833], G_step [10000/100000], D_step[10000/100000], d_out_real: 0.7750,  ave_gamma_l3: -0.1495, ave_gamma_l4: -0.2459

2

after 35000 steps:

Elapsed [2:32:44.314908], G_step [35000/100000], D_step[35000/100000], d_out_real: 0.2414,  ave_gamma_l3: -0.2588, ave_gamma_l4: -0.3762

4

The planned work

  • Compare the Spectral Normalization with other normalization in this experiment
  • Use two-timescale update rule(TTUR) specifically to compensate for the problem of slow learning in a regularized discriminator, making it possible to use fewer generator steps per discriminator step.
  • Prove the effect of self-attention module on the experimental results.
  • Adjust hyperparameter to train model.

猜你喜欢

转载自blog.csdn.net/dickdick111/article/details/90110758