1.Learning Deep Transformer Models for Machine Translation
https://arxiv.org/pdf/1906.01787.pdf
Description of how to train a major deep Transformer, the problem is that deep gradient disappearing, the method used is the output of all preceding layers were oncat, the results are fed with modifications at one linear dimension layer,
And similar ideas residuals, but can take advantage of all the results was the front, while the linear matrix layer can be trained.
It also discusses the positive effects of antecedent and the regular term after term regular deep depth case again next time generating situation gradient disappears, before items are not, but adds that there is not a problem after a linear connection. After the item can also train
The figure is the difference before and after:
2.RBF neural network