1. Overview

1.1 Tasks

Spoken Language Understanding (SLU) , an emerging field between speech recognition and natural language processing, aims to allow computers to understand users' intentions from their speech. SLU is a very critical link of Spoken Dialog Systems . The following figure shows the main flow of the spoken dialogue system.

SLU mainly understands the user's language through the following three subtasks:

Domain Detection
User Intent Determination
Semantic Slot Filling

For example, if the user inputs "play Jay Chou's Daoxiang", it is first identified as "music" by the domain recognition module, then the user's intent is identified as "play_music" (instead of "find_lyrics") by the user intent detection module, and finally the slot is filled To fill each word into the corresponding slot: "Play[O] / Jay Chou[B-singer] / 的[O] / Dao Xiang [B-song]".

As can be seen from the above examples, domain recognition and user intent detection are usually regarded as text classification problems, while slot filling is regarded as a sequence tagging problem, that is, each word in a continuous sequence is assigned a corresponding semantic category label. The task of this experiment is to fill semantic slots based on the ATIS dataset . ( Full code address : https://github.com/llhthinker/slot-filling )

1.2 Dataset

This experiment is based on the ATIS (Airline Travel Information Systems) dataset. As the name suggests, the domain of the ATIS dataset is "Airline Travel". The ATIS dataset adopts the popular " in/out/begin(IOB) notation": "I-xxx" indicates that the word belongs to slot xxx, but is not the first word in slot xxx; "O" indicates that the word does not belong to any Semantic slot; "B-xxx" means that the word belongs to slot xxx and is in the first position of slot xxx. Part of the ATIS training dataset is as follows:

what    O
is  O
the O
arrival B-flight_time
time    I-flight_time
in  O
san B-fromloc.city_name
francisco   I-fromloc.city_name
for O
the O
DIGITDIGITDIGIT B-depart_time.time
am  I-depart_time.time
flight  O
leaving O
washington  B-fromloc.city_name

The ATIS dataset has a total of 83 semantic slots, so there are a total of \(83+83+1=167\) label categories for sequence annotation . The ATIS data set is divided into training set and test set. The scale of the data is as follows:

	Training set	test set
total number of sentences	4978	893
total number of words	56590	9198
Average number of words in a sentence	11.4	10.3

2. Model

As mentioned above, slot filling is usually regarded as a sequence labeling problem. Many machine learning algorithms can solve sequence labeling problems, including generative models such as HMM/CFG, hidden vector state (HVS), and discriminative models such as CRF and SVM. This experiment mainly refers to the paper "Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding" , which implements semantic slot filling based on RNN.

RNN can be divided into simple RNN (Simple RNN) and gating mechanism RNN (Gated RNN). The former RNN unit completely receives the input of the previous moment; the latter is based on the gating mechanism and determines the last moment by the learned parameters. The input volume and the retention volume of the current state. The following will introduce three simple RNNs: Elman-RNN, Jordan-RNN, Hybrid-RNN (the combination of Elman and Jordan), and the classic gating mechanism RNN: LSTM.

2.1 Elman-RNN

Elman-RNN takes the input at the current moment \(x_t\) and the hidden state output at the previous moment \(h_{(t-1)}\) as input, as follows:

\[\begin{split}\begin{array}{ll}h_t = \sigma(W_{ih} x_t + b_{ih} + W_{hh} h_{(t-1)} + b_{hh}) \end{array}\end{split}\]

It should be noted that the Pytorchdefault RNN is Elman-RNN, but it only supports \(\tanh\) and ReLU activation functions. This experiment is set according to the paper, and the activation function adopts the sigmoid function. The Pytorchspecific implementation is as follows:

class ElmanRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(ElmanRNNCell, self).__init__()
        self.hidden_size = hidden_size
        self.i2h_fc1 = nn.Linear(input_size, hidden_size)
        self.i2h_fc2 = nn.Linear(hidden_size, hidden_size)
        self.h2o_fc = nn.Linear(hidden_size, hidden_size)

    def forward(self, input, hidden):
        hidden = F.sigmoid(self.i2h_fc1(input) + self.i2h_fc2(hidden))
        output = F.sigmoid(self.h2o_fc(hidden))
        return output, hidden

2.2 Jordan-RNN

Jordan-RNN takes the input at the current moment \(x_t\) and the output layer output at the previous moment \(y_{(t-1)}\) as input, as follows:

\[\begin{split}\begin{array}{ll}h_t = \sigma(W_{ih} x_t + b_{ih} + W_{yh} y_{(t-1)} + b_{yh}) \end{array}\end{split}\]

The Pytorchspecific implementation is as follows, where \(y_0\) is initialized as a trainable parameter:

class JordanRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(JordanRNNCell, self).__init__()
        self.hidden_size = hidden_size
        self.i2h_fc1 = nn.Linear(input_size, hidden_size) 
        self.i2h_fc2 = nn.Linear(hidden_size, hidden_size)
        self.h2o_fc = nn.Linear(hidden_size, hidden_size)
        self.y_0 = nn.Parameter(nn.init.xavier_uniform(torch.Tensor(1, hidden_size)), requires_grad=True)

    def forward(self, input, hidden=None):
        if hidden is None:
            hidden = self.y_0
        hidden = F.sigmoid(self.i2h_fc1(input) + self.i2h_fc2(hidden))
        output = F.sigmoid(self.h2o_fc(hidden))
        return output, output

2.4 Hybrid-RNN

Hybrid-RNN converts the input at the current moment \(x_t\) , the hidden state at the last moment \(h_{(t-1)}\), and the output layer output at the last moment \(y_{(t-1) }\) as input, as follows:

\[\begin{split}\begin{array}{ll}h_t = \sigma(W_{ih} x_t + b_{ih} + W_{hh} h_{(t-1)} + b_{hh} + W_ {yh} y_{(t-1)} + b_{yh}) \end{array}\end{split}\] , and \(y_0\) are initialized as trainable parameters. The Pytorchspecific implementation is as follows:

class HybridRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(HybridRNNCell, self).__init__()
        self.hidden_size = hidden_size
        self.i2h_fc1 = nn.Linear(input_size, hidden_size)
        self.i2h_fc2 = nn.Linear(hidden_size, hidden_size)
        self.i2h_fc3 = nn.Linear(hidden_size, hidden_size)
        self.h2o_fc = nn.Linear(hidden_size, hidden_size)
        self.y_0 = nn.Parameter(nn.init.xavier_uniform(torch.Tensor(1, hidden_size)), requires_grad=True)

    def forward(self, input, hidden, output=None):
        if output is None:
            output = self.y_0    
        hidden = F.sigmoid(self.i2h_fc1(input)+self.i2h_fc2(hidden)+self.i2h_fc3(output))
        output = F.sigmoid(self.h2o_fc(hidden)) 
        return output, hidden

2.5 LSTM

LSTM introduces memory unit \(c_t\) and 3 kinds of control gates, including input gate (input gate) \(i_t\) , forget gate (forget gate) \(f_t\) , output gate (output gate) \(o_t \) , first of all, the input layer accepts the current moment input \(x_t\) and the last moment hidden state output \(h_{(t-1)}\) , and obtains the input of the memory unit through the \(\tanh\) activation function \(g_t\) ; Then the forget gate \(f_t\) determines the retention ratio of the memory unit \(c_{(t-1)}\) at the previous moment , and the input gate \(i_t\) determines the input of the memory unit at the current moment The retention ratio of \(g_t\) , the two are added together to obtain the current memory unit \(c_t\) ; the last memory unit \(c_t\) The value obtained by the activation function of \(\tanh\) is in the output gate \(o_t \) to get the final hidden state at the current moment\(h_t\) , as follows:

\[\begin{split}\begin{array}{ll}i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{(t-1)} + b_{hg}) \\o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\c_t = f_t c_{(t-1)} + i_t g_t \\h_t = o_t \tanh(c_t)\end{array}\end{split}\]

PytorchLSTM has been implemented , just call the corresponding API. The code snippet for the call is as follows:

self.rnn = nn.LSTM(input_size=embedding_dim,
                   hidden_size=hidden_size,
                   bidirectional=bidirectional,
                   batch_first=True)

3. Experiment

3.1 Experimental setup

Experiments are based on Python 3.6and Pytorch 0.4.0, for control experiments, the following settings are used for all RNN models:

All RNN models use only a single layer;
The word vector dimension is set to 100 dimensions, and it is randomly initialized and adjusted during the training process;
The hidden state dimension is set to 75 dimensions;
Stochastic gradient descent (SGD) with momentum is adopted, the batch size is 1, the learning rate is 0.1, and the momentum is 0.9 and remains unchanged;
epoch=10；
Each RNN model implements Single and Bi-Directional and is trained separately.

3.2 Experimental results

In the case of using the CPU, the results of the \(F_1\) scores of different models in the test set and the average training duration of an epoch are as follows:

\(F_1(\%) / T(s)\)	Elman	Jordan	Hybrid	LSTM
Single	87.26 / 438	87.90 / 487	88.46 / 494	92.16 / 3721
Bi-Directional	92.88 / 565	90.31 / 580	91.85 / 613	93.75 / 4357

As can be seen from the above table:

Due to the increase of parameters and operation steps of LSTM based on gating mechanism, the training time of one epoch is about 9 times that of the other three Simple RNNs, and the \(F_1\) score is also higher than that of Simple RNN;
The \(F_1\) score of Bi-Directional RNN is generally higher than that of Single RNN, and the running time is also higher.

In the case of using the same GPU, the results of the \(F_1\) scores of different models in the test set and the average training duration of an epoch are as follows:

\(F_1(\%) / T(s)\)	Elman	Jordan	Hybrid	LSTM
Single	88.89 / 35.2	88.36 / 41.3	89.65 / 43.5	92.44 / 16.8
Bi-Directional	91.78 / 68.0	89.82 / 72.2	93.61 / 81.6	94.26 / 18.7

As can be seen from the above table, even with stochastic gradient descent (batch_size=1), the acceleration effect of GPU is still quite obvious. It is worth pointing out that although LSTM has more operation steps than the other three Simple-RNNs, it takes the least amount of time. This may be because LSTM is Pytorchan API that is directly called and optimized for GPU, while the other three are their own. Realized, the GPU acceleration effect is not Pytorchgood.

4. Summary and Outlook

In general, it is an effective approach to treat the slot filling problem as a sequence labeling problem, and RNN can better model the sequence and extract relevant contextual features. Bidirectional RNNs outperform unidirectional RNNs, while LSTMs outperform Simple RNNs. For Simple RNN, Elman performs no worse (or even better) than Jordan, but takes less time and is simpler to implement, which may be the reason why simple RNNs of mainstream deep learning frameworks ( TensorFlow/ Pytorchetc) are based on Elman. As a mixture of Elman and Jordan, Hybrid has more training time than Elman and Jordan. The score of \(F_1\) is slightly improved, but it is not particularly obvious (two-way Elman performance when using CPU is better than two-way Hybrid), it needs more Experiment to verify.

It can be seen from the experimental settings that there is not too much parameter adjustment in this experiment. If you want better results, you can make more detailed adjustments, including:

Change the word vector dimension and the hidden state dimension;
Consider using pre-trained word vectors and then fix or fine-tune them;
Use regularization techniques, including L1/L2, Dropout, Batch Normalization, Layer Normalization, etc.;
Try using a different optimizer (like Adam), use mini-batch, adjust the learning rate;
Increase the number of epochs.

In addition, you can consider incorporating information such as part-of-speech tagging and named entity recognition in the input, and use the Viterbi algorithm for decoding in the output. You can also try different forms of gated RNNs (such as GRU, LSTM variants, etc.) and use multi-layer RNNs, And consider whether to use residual connections etc.

References

Mesnil G, Dauphin Y, Yao K, et al. Using recurrent neural networks for slot filling in spoken language understanding[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(3): 530-539.

Wikipedia. Recurrent neural network. https://en.wikipedia.org/wiki/Recurrent_neural_network

PyTorch documentation. Recurrent layers. http://pytorch.org/docs/stable/nn.html#recurrent-layers

Hung-yi Lee. Machine Learning (2017,Spring). http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RNN.pdf

YUN-NUNG (VIVIAN) CHEN. Spring 105 - Intelligent Conversational Bot. https://www.csie.ntu.edu.tw/~yvchen/s105-icb/doc/170321_LU.pdf

"What do you mean" RNN-based semantic slot filling (Pytorch implementation)