https://www.wxnmh.com/thread-1528249.htm
https://www.wxnmh.com/thread-1528251.htm
https://www.wxnmh.com/thread-1528254.htm
Word embeddings
using pre-trained embeddings (Kim, 2014) [12]
The optimal dimensionality of word embeddings is mostly task-dependent: a smaller dimensionality works better for more syntactic tasks such as named entity recognition (Melamud et al., 2016) [44] or part-of-speech (POS) tagging (Plank et al., 2016) [32], while a larger dimensionality is more useful for more semantic tasks such as sentiment analysis (Ruder et al., 2016) [45]
Depth
use deep Bi-LSTMs, typically consisting of 3-4 layers, e.g. for POS tagging (Plank et al., 2016) and semantic role labelling (He et al., 2017) [33]. Models for some tasks can be even deeper, cf. Google's NMT model with 8 encoder and 8 decoder layers (Wu et al., 2016) [20]
performance improvements of making the model deeper than 2 layers are minimal (Reimers & Gurevych, 2017) [46]
For classification, deep or very deep models perform well only with character-level input
shallow word-level models are still the state-of-the-art (Zhang et al., 2015; Conneau et al., 2016; Le et al., 2017) [28, 29, 30]
Layer connections
vanishing gradient problem
Highway layers (Srivastava et al., 2015) [1] are inspired by the gates of an LSTM.
Highway layers have been used pre-dominantly to achieve state-of-the-art results for language modelling (Kim et al., 2016; Jozefowicz et al., 2016; Zilly et al., 2017) [2, 3, 4], but have also been used for other tasks such as speech recognition (Zhang et al., 2016) [5]
Residual connections (He et al., 2016) [6] have been first proposed for computer vision and were the main factor for winning ImageNet 2016.
This simple modification mitigates the vanishing gradient problem, as the model can default to using the identity function if the layer is not beneficial.
dense connections (Huang et al., 2017) [7] (best paper award at CVPR 2017) add direct connections from each layer to all subsequent layers.
Dense connections have been successfully used in computer vision. They have also found to be useful for Multi-Task Learning of different NLP tasks (Ruder et al., 2017) [49], while a residual variant that uses summation has been shown to consistently outperform residual connections for neural machine translation (Britz et al., 2017) [27].
Dropout
While batch normalisation in computer vision has made other regularizers obsolete in most applications, dropout (Srivasta et al., 2014) [8] is still the go-to regularizer for deep neural networks in NLP.
A dropout rate of 0.5 has been shown to be effective in most scenarios (Kim, 2014).
The main problem hindering dropout in NLP has been that it could not be applied to recurrent connections, as the aggregating dropout masks would effectively zero out embeddings over time.
Recurrent dropout (Gal & Ghahramani, 2016) [11]...
Multi-task learning...
Attention...
Optimization
Adam (Kingma & Ba, 2015) [21] is one of the most popular and widely used optimization algorithms and often the go-to optimizer for NLP researchers. It is often thought that Adam clearly outperforms vanilla stochastic gradient descent (SGD).
while it converges much faster than SGD, it has been observed that SGD with learning rate annealing slightly outperforms Adam (Wu et al., 2016). Recent work furthermore shows that SGD with properly tuned momentum outperforms Adam (Zhang et al., 2017) [42].
Ensembling
Ensembling is an important way to ensure that results are still reliable if the diversity of the evaluated models increases (Denkowski & Neubig, 2017).
ensembling different checkpoints of a model has been shown to be effective (Jean et al., 2015; Sennrich et al., 2016) [51, 52]
Hyperparameter optimization
simply tuning the hyperparameters of our model can yield significant improvements over baselines.
Automatic tuning of hyperparameters of an LSTM has led to state-of-the-art results in language modeling, outperforming models that are far more complex (Melis et al., 2017).
LSTM tricks...
Task-specific best practices
Classification
CNNs have been popular for classification tasks in NLP.
Combining filter sizes near the optimal filter size, e.g. (3,4,5) performs best (Kim, 2014; Kim et al., 2016).
The optimal number of feature maps is in the range of 50-600 (Zhang & Wallace, 2015) [59].
1-max-pooling outperforms average-pooling and k-max pooling (Zhang & Wallace, 2015).
Sequence labelling
Tagging scheme BIO, which marks the first token in a segment with a B- tag, all remaining tokens in the span with an I-tag, and tokens outside of segments with an O- tag
IOBES, which in addition distinguishes between single-token entities (S-) and the last token in a segment (E-).
Using IOBES and BIO yield similar performance (Lample et al., 2017)
CRF output layer If there are any dependencies between outputs, such as in named entity recognition the final softmax layer can be replaced with a linear-chain conditional random field (CRF). This has been shown to yield consistent improvements for tasks that require the modelling of constraints (Huang et al., 2015; Max & Hovy, 2016; Lample et al., 2016) [60, 61, 62].
Constrained decoding...
Natural language generation
many of the tips presented so far stem from advances in language modelling
Modelling coverage A checklist can be used if it is known in advances, which entities should be mentioned in the output, e.g. ingredients in recipes (Kiddon et al., 2016) [63]
...
Neural machine translation
While neural machine translation (NMT) is an instance of NLG, NMT receives so much attention that many best practices or hyperparameter choices apply exclusively to it.
Embedding dimensionality
2048-dimensional embeddings yield the best performance, but only do so by a small margin.
Even 128-dimensional embeddings perform surprisingly well and converge almost twice as quickly (Britz et al., 2017).
Encoder and decoder depth...
The encoder does not need to be deeper than 2-4 layers.
Deeper models outperform shallower ones, but more than 4 is not necessary for the decoder
Directionality
Bidirectional encoders outperform unidirectional ones by a small margin...
Beam search strategy
Medium beam sizes around 10 with length normalization penalty of 1.0 (Wu et al., 2016) yield the best performance (Britz et al., 2017).
Sub-word translation...