Other papers

1.Learning Deep Transformer Models for Machine Translation

https://arxiv.org/pdf/1906.01787.pdf

Description of how to train a major deep Transformer, the problem is that deep gradient disappearing, the method used is the output of all preceding layers were oncat, the results are fed with modifications at one linear dimension layer,

And similar ideas residuals, but can take advantage of all the results was the front, while the linear matrix layer can be trained.

 

 

It also discusses the positive effects of antecedent and the regular term after term regular deep depth case again next time generating situation gradient disappears, before items are not, but adds that there is not a problem after a linear connection. After the item can also train

The figure is the difference before and after:

 

2.RBF neural network

 

Guess you like

Origin www.cnblogs.com/wb-learn/p/11693988.html