[Artificial Intelligence and Deep Learning] Deep Learning in Natural Language Processing
- Multi-Head Attention Mechanism
-
- Some tricks to use (for multi-head attention and position embeddings) and how to decode from language models
-
- Tip 1: Use layer normalization to stabilize training
- Tip 2: Learning rate warmup (Warmup) and inverse variance learning rate adjustment
- Tip 3: Carefully initialize parameters
- Tip 4: Label smoothing
- Below are the results of the method we discussed earlier. The "ppl" listed on the right stands for perplexity (the exponential form of cross-entropy). The lower the ppl, the better.
- Important knowledge points about the transformer language model
- </