Tianchi NLP Competition-News Text Classification (6)-Text Classification Based on Deep Learning 3-BERT


Series
Tianchi NLP events - news text classification (a) - the title match to understand
Tianchi NLP events - news text categorization (b) - data reading and data analysis
Tianchi NLP events - news text classification (three) - based on machine learning Text classification
Tianchi NLP competition-news text classification (4)-deep learning-based text classification 1-FastText
Tianchi NLP competition-news text classification (5)-deep learning-based text classification 2-TextCNN, TextRNN
Tianchi NLP competition -News text classification (6)-text classification based on deep learning 3-BERT


Six, text classification based on deep learning 3-BERT

6.1 Text representation method-Part4

Transformer principle

Transformer is proposed in "Attention is All You Need". The encoding part of the model is a stack of a set of encoders (six encoders are stacked in sequence in the paper), and the decoding part of the model is a stack of the same number of decoders.

Insert picture description here

We focus on the coding part. They have exactly the same structure, but they do not share parameters, and each encoder can be disassembled into two parts. After vectorizing the input sequence, they first flow through a self-attention layer, which helps the encoder to see other words in the input sequence when it encodes words. The output of self-attention flows to a feed forward network (Feed Forward Neural Network), and the forward network corresponding to each input position is independent and does not interfere with each other. Finally, the output is passed to the next encoder.

Insert picture description here

A key feature of Transformer can be seen here. The word at each position only flows through its own encoder path. In the self-attention layer, these paths are dependent on each other. The forward network layer does not have these dependencies , but these paths can be executed in parallel when flowing through the forward network.

The multi-head mechanism is used in Self-Attention, so that different attention heads focus on different parts.

Insert picture description here

When encoding "it", one attention head focuses on "the animal" and the other head focuses on "tired". In a sense, the model's expression of "it" is a combination of "animal" and "tired". .

For the detailed calculation of self-attention, please refer to Jay Alammar 's blog on Transformer, which will not be expanded here.

In addition, in order to make the model maintain the word order of words, a position coding vector is added to the model. As shown in the figure below, each row corresponds to the position code of a vector. Therefore, the first row will be the vector we want to add to the embedding of the first word in the input sequence. Each row contains 512 values—each value is between 1 and -1. Because the left side is generated by the sine function and the right side is generated by the cosine, a significant separation can be observed in the middle.

Insert picture description here

One detail worth mentioning in the encoder structure is that in each sub-layer (Self-attention, FFNN), there is a residual connection followed by layer-normalization . If we visualize the vector and LayerNorm operation, it will look like this:Insert picture description here

Word representation based on pre-trained language model

The word representation based on the pre-trained language model can model the context information, thereby solving the problem that the traditional static word vector cannot model the "polysemy" language phenomenon. The earliest proposed ELMo is based on two one-way LSTMs, and the hidden layer vectors from left to right and from right to left represent the concatenated learning context word embedding. GPT uses Transformer instead of LSTM as the encoder. First, it pre-trains the language model, and then fine-tunes the model parameters in downstream tasks. However, since GPT only uses a one-way language model, it is difficult to model contextual information. In order to solve the above problems, researchers proposed BERT. The structure of the BERT model is shown in the figure below. It is a Transformer-based multi-layer Encoder that performs a series of pre-training to obtain a deep context representation.

Insert picture description here

In the topic of ELMo's thesis, Deep refers to two-way double-layer LSTM, and the more important thing is context. The form of the word mapping table generated by the traditional method is to first generate a static word vector for each word, and then the representation of the word is fixed and will not change with the change of context. In fact, due to the linguistic phenomenon of polysemous words, static word vectors have great drawbacks. Take bank as an example. If the training corpus is large enough, all the semantics will be mixed in the word vectors learned in advance. And when applied downstream, even in the new sentence, the context of bank contains words such as money, we can basically determine that bank has the semantics of "bank" instead of the semantics of "riverbed" in other contexts, but because of the static word vector It cannot be changed according to the context, so the bank representation is still mixed with multiple semantics. In order to solve this problem, ELMo first pre-trained the language model, and then dynamically adjusted Word Embedding in downstream tasks. Therefore, the final output word representation can fully express the specific semantics of the word in the context, thereby solving the problem of polysemy of a word .

GPT comes from openai and is a generative pre-training model. In addition to replacing the LSTM in ELMo with the Encoder of Transformer, GPT also created a new paradigm based on pre-training and fine-tuning in the NLP world. Although GPT uses the same two-stage model as ELMo, in the first stage, GPT does not adopt the structure of two one-way double-layer LSTM splicing in ELMo, but uses a one-way language model based on autoregressive.

Google proposed BERT in a paper published in NAACL 2018. Like GPT, BERT also uses a two-stage model of pre-training and fine-tuning. But in terms of model structure, BERT adopts the ELMO paradigm, which uses a two-way language model to replace the one-way language model in GPT. However, the author of BERT thinks that ELMo uses two one-way language models to splice too rough, so in the first In the pre-training process of the stage, BERT proposes a mask language model, which is similar to cloze-filling, predicting the word itself through context, rather than modeling from right to left or left to right, which allows the model to be freely encoded Information from two directions in each layer. In order to learn the word order relationship of sentences, BERT replaces the position representation of the trigonometric function in Transformer with learnable parameters. Secondly, in order to distinguish between single sentence and double sentence input, BERT also introduces sentence type representation. The input of BERT is shown in the figure. In addition, in order to fully learn the relationship between sentences, BERT proposes the next sentence prediction task. Specifically, during training, 50% of the second sentence in the sentence pair comes from the original continuous sentence, while the remaining 50% of the sentences are randomly sampled from other sentences. At the same time, the ablation experiment also proved that this pre-training task has a great contribution to the task of judging the relationship between sentences. In addition to the different model structure, the unlabeled data size used by BERT in pre-training is much larger than that of GPT.

Insert picture description here

In the second stage, like GPT, BERT also uses Fine-Tuning mode to fine-tune downstream tasks. As shown in the figure below, BERT is different from GPT. It greatly reduces the requirements for transforming downstream tasks. Only by adding a Linear classifier on the basis of the BERT model, downstream tasks can be completed. Specifically, for the task of judging the relationship between sentences, similar to GPT, you only need to add a separator between sentences, and then add start and stop symbols at both ends. When outputting, you only need to connect the starting symbol [CLS] of the sentence to the corresponding position in the last layer of BERT with a Softmax+Linear classification layer; for the single sentence classification problem, it is similar to GPT, and only needs to be in the two sentences. The beginning and ending symbols are added to the paragraphs, and the output part is consistent with the judgment task of the relationship between sentences. For the question and answer task, because the output answer needs to be at the beginning and ending positions of a given paragraph, it is necessary to first follow the question and paragraph to the sentence The inter-relation judgment task constructs the input, and the output only needs to be connected to the classifier for judging the start and end positions of the second sentence in the last layer of the BERT, that is, the position corresponding to each word in the paragraph; finally, for the sequence in the NLP For labeling questions, the input is the same as the single sentence classification task. The difference is that the classifier can be connected to the position corresponding to each word in the last layer of BERT.

Insert picture description here

More importantly, BERT has opened up a new two-stage paradigm of "pre-training-fine-tuning" in the NLP field. In the first stage, a two-way language model is pre-trained on a large amount of unlabeled text. It is particularly worth noting here that using Transformer as a feature extractor is ahead of traditional RNN or CNN in solving problems of parallelism and long-distance dependence. By means of pre-training, the lexical, syntactic, and grammatical knowledge in the training data can be refined into the model in the form of network parameters. In the second stage, the data of the downstream task is used to Fine-tuning the BERT model parameters of different layers, or Use BERT as a feature extractor to generate BERT Embedding and introduce it to downstream tasks as a new feature. Although this two-stage brand-new paradigm comes from the field of computer vision, it has not been used well in the field of natural language processing. As a master of breakthroughs in NLP in recent years, the biggest highlight of BERT can be said to be not only The model performance is good, and almost all NLP tasks can be easily modified based on BERT, and then the linguistic knowledge learned in pre-training can be introduced into downstream tasks to further improve the performance of the model.

6.2 Bert-based text classification

Bert Searchin

The pre-training process uses the BERT source code released by Google based on Tensorflow. First, create training data from the original text. Since the data for this competition are all IDs, the vocabulary is rebuilt here, and a word segmenter based on spaces is established.

class WhitespaceTokenizer(object):
    """WhitespaceTokenizer with vocab."""
    def __init__(self, vocab_file):
        self.vocab = load_vocab(vocab_file)
        self.inv_vocab = {
    
    v: k for k, v in self.vocab.items()}

    def tokenize(self, text):
        split_tokens = whitespace_tokenize(text)
        output_tokens = []
        for token in split_tokens:
            if token in self.vocab:
                output_tokens.append(token)
            else:
                output_tokens.append("[UNK]")
        return output_tokens

    def convert_tokens_to_ids(self, tokens):
        return convert_by_vocab(self.vocab, tokens)

    def convert_ids_to_tokens(self, ids):
        return convert_by_vocab(self.inv_vocab, ids)

Pre-training Because the NSP pre-training task is removed, the document is processed with multiple segments with a maximum length of 256. If the length of the last segment is less than 256/2, it will be discarded. Each segment is executed according to the mask language model in the BERT original text, and then processed into the tfrecord format.

def create_segments_from_document(document, max_segment_length):
    """Split single document to segments according to max_segment_length."""
    assert len(document) == 1
    document = document[0]
    document_len = len(document)

    index = list(range(0, document_len, max_segment_length))
    other_len = document_len % max_segment_length
    if other_len > max_segment_length / 2:
        index.append(document_len)

    segments = []
    for i in range(len(index) - 1):
        segment = document[index[i]: index[i+1]]
        segments.append(segment)

    return segments

In the pre-training process, only the mask language model task is executed, so the loss of the next sentence prediction task is not calculated.

(masked_lm_loss, masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(
    bert_config, model.get_sequence_output(), model.get_embedding_table(),
    masked_lm_positions, masked_lm_ids, masked_lm_weights)

total_loss = masked_lm_loss

In order to adapt the length of the sentence and reduce the training time of the model, we adopted the BERT-mini model, the detailed configuration is as follows.

{
    
    
  "hidden_size": 256,
  "hidden_act": "gelu",
  "initializer_range": 0.02,
  "vocab_size": 5981,
  "hidden_dropout_prob": 0.1,
  "num_attention_heads": 4,
  "type_vocab_size": 2,
  "max_position_embeddings": 256,
  "num_hidden_layers": 4,
  "intermediate_size": 1024,
  "attention_probs_dropout_prob": 0.1
}

Since our overall framework uses Pytorch, we need to convert the last checkpoint into Pytorch weights.

def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_file, pytorch_dump_path):
    # Initialise PyTorch model
    config = BertConfig.from_json_file(bert_config_file)
    print("Building PyTorch model from configuration: {}".format(str(config)))
    model = BertForPreTraining(config)

    # Load weights from tf checkpoint
    load_tf_weights_in_bert(model, config, tf_checkpoint_path)

    # Save pytorch-model
    print("Save PyTorch model to {}".format(pytorch_dump_path))
    torch.save(model.state_dict(), pytorch_dump_path)

Pre-training consumes a lot of resources. If the hardware conditions do not allow it, it is recommended to download the open source model directly

Bert Finetune

Insert picture description here

Fine-tuning takes the first token of the last layer, the hidden vector of [CLS], as the representation of the sentence, and then inputs it to the softmax layer for classification.

sequence_output, pooled_output = \
    self.bert(input_ids=input_ids, token_type_ids=token_type_ids)

if self.pooled:
    reps = pooled_output
else:
    reps = sequence_output[:, 0, :]  # sen_num x 256

if self.training:
    reps = self.dropout(reps)

Guess you like

Origin blog.csdn.net/bosszhao20190517/article/details/107804216