Application of protein design in GAN (a)

Soon to be willing to test out the repair

You need to submit a research proposal to the supervisor,

According to previous expectations of tutor, master phase of the study focused on applications in the design of proteins GAN

So in his busy schedule to review and mathematics courses, ah, have to study a new liver and then plan out,

It begins with the classic GAN, to learn about WGAN, and then began to consider their ideas in this regard.

BOTH

  GAN primarily composed of two parts, a Generator, a Discriminator.

  Generator input is a random sample of data or a noise, the Generator is trained by real data, so that the output is the target that we want to generate.

  Discriminator is a classifier for the real target returns 1 for false target generated returns 0.

  Generator To generate Discriminator goal can be confusing, but Discriminator to identify false target generation as much as possible,

Two model equivalent in constant confrontation, and Generator by Discriminator against each other iterative and evolutionary, finally Generator

You will be able to generate enough real ones goals.

1.GAN principle

  First, for any real data sets, we can know the true distribution P target of the Data (the X-), the X-is a real data, for example, we can imagine a real vector data,

This distribution vector is P Data (X). Generator's goal is to generate a series of new vector obey this distribution, assuming that we now have the Generator can generate the distribution P G (the X-| [theta]),

θ is a series of parameters to control this distribution, our goal is to be able to make Generator by training θ makes possible artifacts generated in line with real data distribution P the Data (the X-).

  So for Generator training, we first collect a series of real data {X = D . 1 , X 2 , ..., X m }, the parameters θ and for certain real data X I , Generator generated data X I is joint probability density function P G (X I  | [theta]),

We can then X { . 1 , X 2 , ..., X m } calculates a likelihood, i.e. to train Generator obtained by making the maximum likelihood function θ values:

 

 

                                                                          

 

    By the above series of derivation, we can see the result, the maximum likelihood function that is closest to the real time distribution data generated Generator. That let Generator generates real data as possible. So we should look for a [theta] * to make P G closer to P Data . We assume that P G (X | [theta]) is a NN, then generates a random vector Z, by G (Z ) = X this network, generating data x. How to Compare G () distribution to generate data whether the true distribution of the same it? Sample taken as a group Z, the group Z which conform to one profile, by G () may be generated after another distribution P G , with the P G to the true distribution P Data were compared.

  As we all know, NN as long as there is a nonlinear function can be activated to fit any function, then the distribution can also be sampled by a normal distribution, Gaussian distribution, etc., to learn the depth to a very complex distribution.

                       

 

    Get through a generative model generated distribution, that is, to get the distribution of artifacts, the distribution and the distribution of real data is very similar. GAN is to explore the role to find this very close to the real distribution by changing the parameter θ. GAN formula is as follows:

 

                                        (G=Generator,D=Discriminator)

 

    Our aim is to find a level sufficient to generate Generator G as real data distribution * , at the same time, for a given every version of G, D G need to have to find as data generated with the real data differently,

So there will be optimal for each generation G * = D Discriminator , and then look at the evolution of the entire G and D, you can find an optimal .

  Now we fix G, solving an optimal D, after then V (G, D) expand written as follows: 

 

 

 

 

   Here, assume that D (x) can take any value, then the integral sign off, corresponding to the maximum value of this function we require the following

 

 

 

 

 

  Original equation by substitution, can be obtained as a function of the D: ; to be D derivative, can be obtained when the maximum function takes the value D:

                                                                       

 

That is: , so for a given G, D * is obtained in this optimal. Was substituted into the D *

                                                                                          

 

   The last of the emergence of a new divergence, called the Jensen-Shannon divergence, is used to measure the difference between the two different distributions, if the distribution of two completely independent (no intersection),

Then the value of Iog2 JSD is, if the value of the distribution of two identical, then the JSD = 0. Prim that, given a G, there

 

 

   Then when the time, G = G * .

   同时,为了得到G*和D*我们需要找到Loss Function以方便我们通过Gradient Descent来找到G*和D*

  GAN的训练过程及Generator和Discriminator的loss function如下图:

 

 

   其中,得到D*可能需要多次Gradient Descent才能到达max点,而对于G,如果更新后的Gnew比之前的Gold有V(Gold,Dold)<V(Gnew,Dnew),

避免这种情况的办法就是每次少更新一点G,于是就有了上图所示的,D更新多次,而G只更新一次,并且,我们无法保证Gradient Descent之后的点是全局最值点,

只能确定它是局部最值点。

 

 2.经典GAN存在的问题

 log(1-D(x))是G在训练时的loss function,但是通过观察函数图像,可以发现,D(x)在接近于0的时候,函数的梯度很小,这样会导致在刚开始时G的收敛变得十分缓慢,我们可以改变这个loss function,将其修改为这样这个loss function的趋势就变成了上面的那条曲线,趋势不变,但函数在刚开始训练时梯度很大,而在最后快收敛时梯度很小,这与我们训练时的想法是相吻合的。

 

 

                                                     

 

  另一个问题:因为我们在GAN的训练中使用的是两个分布的Sampling,所以我们在对D进行训练时会发现,无论我们有如何powerful的G,因为数据来源于采样,所以P和 Pdata 完全没有交集,以至于我们的Discriminator总是能分辨出那些artifacts,如下图所示,Discriminator总能找出分辨P和 Pdata的方法:

 

 

 

 

    这时我们就想到是否能让Discriminator更弱一点,但又想让它能够区分出假图片,所以就产生了矛盾。还有一种可能,假设两个分布都是很窄的(例如线),那么JSD=log2,也就相当于没有交集,这样子无法对Generator进行Gradient Descent,有一种解决办法就是添加噪声,让两个分布变得很宽,这样就可以计算出JSD,但随着时间变化,需要减小噪声,使得两个分布趋同。

    还有一个问题,叫做Mode Collapse,这个是因为Generator在很多情况下无法生成具有多样性的分布(例如生成多个高斯分布),这就与我们想象中的训练过程产生了偏差,我们希望Generator能够生成尽可能和真实分布一样的分布,但是Generator往往只能够生成真实分布的其中一种情况,比如说我们用猫狗牛羊这几种动物的图片去训练一个GAN,希望它也能够生成这几种动物的图片,但是GAN最后得出的结果是只能够生成狗的图片。

                                                                      

 

 

    对于左边的分布,当Pdata有值而PG没有取值时,KLD会趋向于无穷大,因此为了不让KLD趋向无穷大Generator要在每个有Pdata的地方都有一个PG,而鉴于它本身的特性,它无法生成多个高斯分布,所以只能够尽可能地去覆盖Pdata的分布曲线,所以就造成了左图的情形。这样虽然不是Mode Collapse,但是会产生许多无意义的样本,训练结果效果很差。

    对于右边的分布,当PG有值时,Pdata也一定要有取值,否则Generator将面临一个非常大的惩罚,因此为了避免惩罚,Generator趋向于在确定正确的分布内生成数据,而不去尝试在别的分布内产生不一样的数据,这样就造成了Mode Collapse,我们的Generator对于一个多个高斯分布叠加的真实分布,只会去产生其中一个高斯分布的数据。

3.对于经典GAN的总结

     终于把GAN给看完了,说说我自己的感想,确实是个很不错的思想,但是仔细一想,有非常多的缺点,直接把经典的GAN拿来解决现实问题很可能不会有很好的效果。具体来说,经典GAN大概有以下缺点:

    (1)训练困难(2)生成器和判别器的loss无法指示训练进程(3)生成样本缺乏多样性等

    对于经典的GAN的这些缺点,有各种论文提出了各种改进,比较著名的有通过Wasserstein距离进行改进的WGAN,解决了GAN训练不稳定,基本解决了mode collapse问题,并且有一个类似于交叉熵的量来表示训练的进程和结果的好坏,这个量越小代表生成的样本质量越好。

    近来还有把强化学习与GAN进行结合的seq-GAN,我自己的初步想法也是探索一下seq-GAN在蛋白质设计方面有没有什么用武之地,第一部分先写这些内容吧,下一步学习一下WGAN和seq-GAN然后看看有没有已经做出一些成果的大佬,来探求一下将seq-GAN运用在蛋白质设计方面的可能性。

 

Guess you like

Origin www.cnblogs.com/MemoryOfStars/p/11755693.html