ELMo interpretation

Article Directory

Preface

ELMo comes from a paper "Deep contextualized word representations" published by the Allen Institute at the NAACL2018 conference. From the name of the paper, it should be a new method of word representation. According to their own introduction: ELMo is a deep contextual word representation model that can simultaneously model (1) the complex features used by words (for example, syntax and semantics); (2) how these features will change in context ( Such as ambiguity, etc.). These word vectors are derived from the hidden state of the deep bidirectional language model (biLM), which is pretrained on a large-scale corpus. They can be flexibly and easily added to existing models, and can significantly improve existing performance in many NLP tasks, such as question answering, textual implication, and sentiment analysis. It sounds very exciting, and its principle is also very reasonable! The following will analyze the paper and its PyTorch source code. For specific information, please refer to the portal at the end of the article .

Let me make a statement first: I think the name "ELMo" can represent the model of the word vector, or the word vector itself. Just like the names Word2Vec and GloVe, they can represent two meanings. When referring to ELMo below, generally the word vector with the word "model" refers to the trained word vector model, and the word vector with the word vector refers to the word vector obtained.

1. ELMo principle

Previously we generally used word embedding methods such as Word2Vec and GloVe, but the training methods of these word embeddings are generally context-independent, and for the same word, no matter what context it is in, its word The vectors are all the same, which is very unfriendly to those ambiguous words. Therefore, the paper considers the specific calculation of the representation of each word according to the input sentence as the context, and proposes ELMo (Embeddings from Language Model). Its basic idea, in the vernacular, is to use the routine of training the language model, and then extract the output of the hidden layer in the middle of the language model as a representation of the word in the current context. It is simple but very useful!

1. ELMo overall model structure

Regarding the ELMo model structure, in fact, the paper did not give a specific picture (this is very painful for people with extremely poor imagination like the author). The author obtained it by integrating the clues in the paper and the source code of PyTorch Probably something like the following (the ugly hand-painted party painting, don’t blame):

Suppose the dimension of the input sentence is B ∗ W ∗ C, where B means batch_size, W means the num_wordsnumber of words in a sentence, padding may be required in a batch, C means the max_characters_per_tokennumber of characters in each word, here in the paper A fixed value of 50 is used, which is not dynamically set according to each batch. D means projection_dimthat the word is input into biLMs embedding_size, or understood as 1/2 of the dimension of the final ELMo word vector.

From the picture, the input sentence will go through:

  1. Char Encode Layer: First go through a character encoding layer. Because ELMo is actually based on char, it will first encode all chars in each word to get the word's representation. Therefore, the dimension after the character encoding layer is B ∗ W ∗ D, which is the encoding dimension of a sentence at the word level that we are familiar with.
  2. biLMs: Then the sentence indicates that it will be modeled by biLMs, which is a two-way language model. In fact, two forward and reverse language models are trained separately, and then their representations are spliced. The final output dimension is (L + 1) ∗ B ∗ W ∗ 2 D, +1 actually adds the initial embedding layer, which is a bit like residual, which will be mentioned in detail later in the "biLMs" section.
  3. Scalar Mixer: Immediately after obtaining the representation of each layer of biLMs, it will go through a mixing layer, which will linearly merge the representations of the previous layers (detailed description will be given in the "Generating ELMo Word Vector" section later), and the result is The final ELMo vector has dimensions B ∗ W ∗ 2D.

This is just a general overview of the ELMo model. Is it still very confusing about the structure of each module? It doesn't matter, let's analyze one by one below:

2. Character encoding layer

This layer is called "Char Encode Layer". Its input dimension is B ∗ W ∗ C and output dimension is B ∗ W ∗ D. After checking the source code, its structure picture looks like this:

The painting is a bit messy, everyone will watch~

First, the input sentence will be reshaped into BW ∗ C because it is processed for all chars. Then it will go through the following layers:

  1. Char Embedding: This is the normal embedding layer, which is encoded for each char. In fact, the vocabulary of all chars is about 262, of which 0-255 is the unicode encoding of char, and the 6 256-261 are <bow>(the beginning of the word) ), <eow>(the end of the word), (the beginning of the sentence), (the end of the sentence), (the word completion symbol) and (the sentence completion symbol). It can be seen that the vocabulary is still relatively small, and there is no OOV. The Embedding parameter dimension here is 262 (num_characters) ∗ d (char_embed_dim). Note that d here and D mentioned in the previous section are two concepts. d represents the embedding dimension of characters , and D represents the embedding dimension of words . The mapping relationship between them will be seen later. The output dimension of this part is BW ∗ C ∗ d. <bos><eos><pow><pos>
  2. Multi-Scale convolutional layer: Here are the convolutional layers of different scales. Note that it is expanded in width, not depth, that is, the input is the same. The difference between convolutions is that the kernel_sizesum channel_sizeis different. It is used to capture the information between different n-grams, which is actually imitating  the model structure of  TextCNN . Suppose there are m such convolutional layers, which are kernel_sizefrom k1, k2,..., Km, such as 1,2,3,4,5,6,7this, which are channel_sizefrom d1, d2,..., Dm, such as 32,64,128,256,512,1024this. Note: The convolutions here are all 1-dimensional convolutions , that is, convolution is only done on the length of the sequence. Similar to the processing in the image, after convolution, it will be pooled by MaxPooling. The purpose here is that the length of the sequence obtained by the previous convolution is often inconsistent, and there is no way to merge it in the later stage, so MaxPooling is performed in the sequence dimension. In fact, the largest char representation in a word is taken as the representation of the entire word. Finally, through the activation layer, this step is over. According to different channel_sizesizes, the output dimensions of this step are BW ∗ d1, BW ∗ d2,..., BW ∗ dm.
  3. Concat layer: In the previous step, m matrices of different dimensions are obtained. In order to facilitate post-processing, we will concatenate them on the last dimension, and then reshape them back to the word-level dimension B ∗ W ∗ (d1 + d2 + ... + dm ).
  4. Highway layer: Highway (see: https://arxiv.org/abs/1505.00387  ) is imitating the residual method in the image, and it is often used in the NLP field. See the implementation in the code. The formula for this layer is as follows: Actually It is a fully connected + residual realization method, but here an element-wise gate matrix is ​​needed to transform x and f (A (x) ). Here you need to go through a Highway layer such as the H layer, and the output dimension is still B ∗ W ∗ (d 1 + d 2 +... + Dm ).
  5.                          y = g ∗ x + ( 1 − g ) ∗ f ( A ( x ) ) , g = S i g m o i d ( B ( x ) )
  6.  Linear mapping layer: After the previous calculations, the vector dimension d1 + d2 +... + Dm is often relatively long. An additional layer of Linear is added for mapping, and the dimension is mapped to D, which is sent to the follow-up as the embedding of the word In the layer of, the dimension output here is B ∗ W ∗ D.

3. Principle of biLMs

ELMo is mainly based on biLMs (bidirectional language models). Let's first introduce mathematically what biLMs are.

In particular, given a token of a sequence of N ( t_{1}t_{2},...,  t_{N}), The language model before (typically multilayered LSTM the like) for calculating a given case tokens preceding the current token The probability of:

                                       p(t_{1},t_{2},...,t_{N})=\prod_{k=1}^{N}p(t_{k}|t_{1},t_{2},...,t_{k-1})

At each position k, the model will output a context-sensitive representation in each layer \ vec {h} _ {k, j} ^ {LM}, where j = 1,..., L represents the layer. The output of the top layer is \ vec {h} _ {k, j} ^ {LM}used to predict the next one token:t_{k+1}.

Similarly, the reverse language model training is the same as the forward one, except that the input is reversed, that is, the probability of the current token given the following tokens is calculated:

                                    p(t_{1},t_{2},...,t_{N})=\prod_{k=1}^{N}p(t_{k}|t_{k+1},t_{k+2},...,t_{N})

Similarly, the reverse LM at each position k will also generate a context-sensitive representation at each layer \overleftarrow{h}_{k,j}^{LM}.

The biLMs used by ELMo is a language model that combines both forward and reverse at the same time. Its goal is to maximize the likelihood as follows:

                       

Inside \Theta_{x}, \Theta_{s}and \overrightarrow{\Theta}_{LSTM}, and \overleftarrow{\Theta}_{LSTM}are embedded in the word, and the positive and negative output layer LSTM parameters (before Softmax).

It can be seen that it is actually equivalent to training two LMs in the forward and reverse directions respectively.  It seems that training can only be carried out separately, because LM cannot train two-way.

In the schematic diagram, it looks like the following multilayer BiLSTM:

biLMs architecture

Here h represents the LSTM unit, hidden_sizewhich may be relatively large, such as D = 512, h = 4096. So after the end of each layer, another Linearlayer is needed to map the dimension from h to D, and then input to the next layer. The final output is to stack all the outputs of each layer and the output of embedding. The output of each layer is to concat the forward and reverse output of each timestep, so the final output dimension is (L + 1) ∗ B ∗ W ∗ 2 D, where + 1 in L + 1 represents the embedding output of that layer, which will be copied into two copies to keep the output dimension of each layer of biLMs consistent.

4. Generate ELMo word vector

After passing through the biLMs layer, the obtained representation dimension is (L + 1) ∗ B ∗ W ∗ 2D, and then the final ELMo vector needs to be generated!

For each token t_{k}, the biLMs of the L layer, there are 2 L + 1 representations generated, as shown in the following formula:

                             

Hereh_{k,0}^{LM}​ is the embedding output of the word, which h_{k,j}^{LM}=\left [ \vec{h}_{k,j}^{LM},\right \overleftarrow{h}_{k,j}^{LM}]represents the spliced ​​result of the forward and reverse output of each layer.

For these representations, the paper uses the following formula to make a scalar mixer for them:

                            

Here s_{j}^{task}is a probability value after softmax, and the scalar parameter \gamma^{task}is used to scale the entire ELMo vector. Both parts are learned as parameters and have different values ​​for different tasks.

At the same time, the paper also mentioned that there may be a big difference between the output distribution of each layer, so sometimes before the linear fusion, a Layer Normalization is made for the output of each layer, which is consistent with the Transformer.

The dimension of the vector after Scalar Mixer is B ∗ W ∗ 2D, which is the generated ELMo word vector, which can be used for subsequent tasks.

5. Combine downstream NLP tasks

Generally, the ELMo model will be pre-trained on a super-large corpus. Because it is a training language model, no tags are required. Plain text is enough. Therefore, a super-large corpus can be used here. The advantage of this point is very obvious. After training the ELMo model, you can enter a new sentence and get the ELMo word vector of each word in the context of the current sentence.

The paper mentioned that during training, it was found that using appropriate dropout and L2 on the ELMo model would improve the effect.

At this time, the word vector can be connected to downstream NLP tasks, such as question answering and sentiment analysis. From the point of view of the access position, it can be spliced ​​with the embedding input of the downstream NLP task itself, or it can be spliced ​​with its output. From the perspective of whether the model is fixed, all ELMo word vectors can be extracted in advance, that is, the fixed ELMo model does not allow it to be trained, and the fine-tune ELMo model can also be used when training downstream NLP tasks. In short, it is very convenient to use and can be inserted wherever you want to add.

2. PyTorch implementation

1. Character encoding layer

What is implemented here is the Char Encode Layer mentioned earlier.

The first is the realization of multi-scale CNN:

# multi-scale CNN

# 网络定义
for i, (width, num) in enumerate(filters):
    conv = torch.nn.Conv1d(
            in_channels=char_embed_dim,
            out_channels=num,
            kernel_size=width,
            bias=True
    )
    self.add_module('char_conv_{}'.format(i), conv)

# forward函数
def forward(sef, character_embedding)
	convs = []
	for i in range(len(self._convolutions)):
	    conv = getattr(self, 'char_conv_{}'.format(i))
	    convolved = conv(character_embedding)
	    # (batch_size * sequence_length, n_filters for this width)
	    convolved, _ = torch.max(convolved, dim=-1)
	    convolved = activation(convolved)
	    convs.append(convolved)
	# (batch_size * sequence_length, n_filters)
	token_embedding = torch.cat(convs, dim=-1)
	return token_embedding

   Then the realization of highway:

# HighWay

# 网络定义
self._layers = torch.nn.ModuleList([torch.nn.Linear(input_dim, input_dim * 2)
                                    for _ in range(num_layers)])
                                    
# forward函数
def forward(self, inputs):
	current_input = inputs
	for layer in self._layers:
	    projected_input = layer(current_input)
	    linear_part = current_input
	    # NOTE: if you modify this, think about whether you should modify the initialization
	    # above, too.
	    nonlinear_part, gate = projected_input.chunk(2, dim=-1)
	    nonlinear_part = self._activation(nonlinear_part)
	    gate = torch.sigmoid(gate)
	    current_input = gate * linear_part + (1 - gate) * nonlinear_part
	return current_input

2. biLMs layer

This part is actually the BiLSTM training in two different directions, and then the output can be directly spliced after mapping . The code is as follows: (Take the one-way single-layer as an example)

# 网络定义
# input_size:输入embedding的维度
# hidden_size:输入和输出hidden state的维度
# cell_size:LSTMCell的内部维度。
# 一般input_size = hidden_size = D, hidden_size即为h。
self.input_linearity = torch.nn.Linear(input_size, 4 * cell_size, bias=False)
self.state_linearity = torch.nn.Linear(hidden_size, 4 * cell_size, bias=True)
self.state_projection = torch.nn.Linear(cell_size, hidden_size, bias=False)  

# forward函数
def forward(self, inputs, batch_lengths, initial_state):
    for timestep in range(total_timesteps):

        # Do the projections for all the gates all at once.
        # Both have shape (batch_size, 4 * cell_size)
        projected_input = self.input_linearity(timestep_input)
        projected_state = self.state_linearity(previous_state)

        # Main LSTM equations using relevant chunks of the big linear
        # projections of the hidden state and inputs.
        input_gate = torch.sigmoid(projected_input[:, (0 * self.cell_size):(1 * self.cell_size)] +
                                   projected_state[:, (0 * self.cell_size):(1 * self.cell_size)])
        forget_gate = torch.sigmoid(projected_input[:, (1 * self.cell_size):(2 * self.cell_size)] +
                                    projected_state[:, (1 * self.cell_size):(2 * self.cell_size)])
        memory_init = torch.tanh(projected_input[:, (2 * self.cell_size):(3 * self.cell_size)] +
                                 projected_state[:, (2 * self.cell_size):(3 * self.cell_size)])
        output_gate = torch.sigmoid(projected_input[:, (3 * self.cell_size):(4 * self.cell_size)] +
                                    projected_state[:, (3 * self.cell_size):(4 * self.cell_size)])
        memory = input_gate * memory_init + forget_gate * previous_memory

        # shape (current_length_index, cell_size)
        pre_projection_timestep_output = output_gate * torch.tanh(memory)

        # shape (current_length_index, hidden_size)
        timestep_output = self.state_projection(pre_projection_timestep_output)

        output_accumulator[0:current_length_index + 1, index] = timestep_output

	# Mimic the pytorch API by returning state in the following shape:
    # (num_layers * num_directions, batch_size, ...). As this
    # LSTM cell cannot be stacked, the first dimension here is just 1.
    final_state = (full_batch_previous_state.unsqueeze(0),
                   full_batch_previous_memory.unsqueeze(0))

    return output_accumulator, final_state      

3. Generate ELMo word vector

This part is Scalar Mixer, and its code is as follows:

# 参数定义
self.scalar_parameters = ParameterList(
        [Parameter(torch.FloatTensor([initial_scalar_parameters[i]]),
                   requires_grad=trainable) for i
         in range(mixture_size)])
self.gamma = Parameter(torch.FloatTensor([1.0]), requires_grad=trainable)

# forward函数
def forward(tensors, mask):

	def _do_layer_norm(tensor, broadcast_mask, num_elements_not_masked):
	    tensor_masked = tensor * broadcast_mask
	    mean = torch.sum(tensor_masked) / num_elements_not_masked
	    variance = torch.sum(((tensor_masked - mean) * broadcast_mask)**2) / num_elements_not_masked
	    return (tensor - mean) / torch.sqrt(variance + 1E-12)
	
	normed_weights = torch.nn.functional.softmax(torch.cat([parameter for parameter
	                                                        in self.scalar_parameters]), dim=0)
	normed_weights = torch.split(normed_weights, split_size_or_sections=1)
	
	if not self.do_layer_norm:
	    pieces = []
	    for weight, tensor in zip(normed_weights, tensors):
	        pieces.append(weight * tensor)
	    return self.gamma * sum(pieces)
	
	else:
	    mask_float = mask.float()
	    broadcast_mask = mask_float.unsqueeze(-1)
	    input_dim = tensors[0].size(-1)
	    num_elements_not_masked = torch.sum(mask_float) * input_dim
	
	    pieces = []
	    for weight, tensor in zip(normed_weights, tensors):
	        pieces.append(weight * _do_layer_norm(tensor,
	                                              broadcast_mask, num_elements_not_masked))
	    return self.gamma * sum(pieces)

3. Experiment

Here are some examples of the actual downstream tasks combined with ELMo performance, namely SQuAD (question and answer task), SNLI (text implication), SRL (semantic role labeling), Coref (common reference resolution), NER (named entity recognition) and SST -5 (Sentiment analysis task), the results are as follows:

EMLo combines the performance of downstream NLP tasks

It can be seen that, basically in the case of a lower baseline, after using ELMo, the effect of surpassing the previous SoTA is achieved!

Four. Some analysis

In the paper, the author also made some interesting analysis, prying into the advantages and characteristics of ELMo from various angles. such as:

1. Which layer of output is used?

The author explored the effects of using different biLMs layers and the weights of using different L2 norms, as shown in the following table:

The Last Only here refers to only the output of the top layer of biLM, and λ refers to the weight of the L2 norm. It can be seen that the effect of using all layers is generally better, and the effect of the lower L2 norm is also better, because of its Let the representation of each layer tend to be different. When the weight of the L2 norm is large, the parameter values ​​of all layers of the model will tend to be consistent, resulting in the output of each layer of the model also tending to be consistent.

2. Where can I join ELMo?

As mentioned earlier, the ELMo vector can be added to the input and output. The author compared the differences between the two:

For question answering and textual implication tasks, it is better to add ELMo to input and output at the same time, while for semantic role labeling tasks, it is better to add only to input. The paper guessed that this reason may be because attention is needed in the first two tasks, and adding ELMo when outputting allows attention to directly see the output of ELMo, which is beneficial to the entire task. In semantic role labeling, the contextual representation related to the task is more important than the general output of biLMs.

3. What is the focus of each layer's output?

The paper concludes through experiments that at the lower level of biLMs, the representation focuses more on such grammatical features as part of speech, while the representation at the high level focuses more on semantic features. For example, the following experimental results:

Insert picture description here

The task on the left is semantic disambiguation, and the task on the right is part-of-speech tagging. It can be seen that on the semantic disambiguation task, the effect of using the second layer is better than that of the first layer; while on the part-of-speech tagging task, the effect of using the first layer is instead The effect is better than using the second layer.

In general, it is better to use the output of all layers, and the specific weight is just as good for the model to learn by itself.

4. Efficiency Analysis

Generally speaking, a network that uses a pre-trained model tends to converge faster, and it can also use fewer data sets. The paper verified this through experiments:

For example, in the SRL task, the model using the ELMo can achieve the effect of using the 10% data set without using the ELMo model using only 1% of the data set!

V. Summary

ELMo has the following excellent characteristics:

  1. Context-sensitive: The representation of each word depends on the entire context in which it is used.
  2. Depth: The word represents a combination of all layers of a deep pre-trained neural network.
  3. Character-based: ELMo means purely based on characters, and then used as a word representation after CharCNN, which solves the OOV problem, and the input vocabulary is also very small.
  4. Rich resources: There are complete source code, pre-training models, parameters, and detailed calling methods and examples. It is another good project to benefit the party! And: there are people who have realized multilingualism, which seems to be done by Harbin Institute of Technology . Click here to see the project.

Portal

Paper: https://arxiv.org/pdf/1802.05365.pdf
Project home page: https://allennlp.org/elmo
Source code: https://github.com/allenai/allennlp  (PyTorch, click here for the part about ELMo )
https://github.com/allenai/bilm-tf  (TensorFlow)
Multilingual: https://github.com/HIT-SCIR/ELMoForManyLangs  (Multilingual ELMo evaluated by Harbin Institute of Technology CoNLL, as well as traditional Chinese)

Guess you like

Origin blog.csdn.net/devil_son1234/article/details/109102972