Natural language processing of BERT

1 Introduction

         Hush! BERT is here, the model that refreshed 11 NLP task records at the same time. In essence, bert belongs to a pre-training model. The model can understand the relationship between context and words, as well as the relationship between sentences and sentences. For different tasks, BERT can be used for different transfer learning. In terms of the model structure, which is equivalent Transformer stacked in layers Encoder ( B idirectional  E nCoder  R & lt epresentations from  T ransformers). How did such an unpretentious model in essence stir up waves for a while and turn the peaceful NLP world upside down. Let's take a look at its fascinating things together. If you don’t know Transfomer, it is strongly recommended to learn Transformer first .

2.BERT theory

      1. Network structure

              Let's first understand the overall process performed by bert. From the figure, we can see that the main structure used in pre-training and fine-tuning is unchanged. First, each input is a sentence pair composed of two sentences, and some words will be masked randomly, using CLS as the starting mark and SEP as the separation mark between sentences. Then these words will pass through the embedding layer in Figure 2, with three layers of embedding information: word embedding information, sentence embedding information, and position embedding information. Then connect multiple encoder layers in Transfomer. The final output is to predict the words of the mask and whether sentence B is the next sentence of sentence A.

       

     

   2. Multi-task learning 

           From the above introduction, we know that bert is a multi-task learning model.

              1. Masked Language Model : First, a part of the words will be randomly masked in each sentence, which is somewhat similar to the mask of the decoder part in Transfomer. In actual training, 15% of words in each sentence will be randomly masked. Use mask 80% of the time, use another word instead of mask 10% of the time, and use all sentence information 10% of the time.

             Why use mask? As we said earlier, bert is equivalent to stacking encoders in transfomer. The first layer of self-attention in the encoder will perform pairwise calculations between words, that is, each word will contain the information of all words (this is equivalent to the information to be predicted has not been exposed). The essence of encoder is equivalent to the process of encoding words, and the input and output length are the same (not designed for prediction tasks). So in order to make the model have a predictive function, while solving the problem of information exposure. The encoder needs to be modified, and the mechanism of using surrounding words to predict the middle word in the natural language model is used for reference, and the mask mechanism in the decoder is used for reference. When we predict, we randomly mask out some words and predict these words.
             2.Next Sentence Prediction: At the same time, we want the model to have the ability to understand the relationship between sentences and sentences. For example, given two sentences, judge whether they are in a contextual relationship. For this we designed to use two sentences as sentence pairs for input. The subsequent embedding layer will make each word carry the sentence information to which it belongs.

            In this way, we transform the input and output. So that our pre-training model has multiple functions, simple fine-tuning can complete a variety of tasks, such as text classification, sequence labeling (word segmentation, entity recognition, part-of-speech labeling), and judgment of sentence relationships (QA, natural language) Reasoning) etc.

    3.Fine-tune

            The original intention of the BERT model is to fine-tune to adapt to a variety of downstream tasks. Its design is also very convenient for fine-tuning. There are generally two ideas for fine-tuning:

           1. Use the existing model as a feature extractor, remove its output layer, and then follow the output layer required by the corresponding task. Of course, according to the actual situation, it is not necessary to mention only the feature output of the last layer, it can be any layer you need.

           2. Train on the basis of the existing model to make the new model more suitable for your task.

           3. Sometimes according to the actual situation, we don't need to fine-tune all layers. At this time, we only fine-tune some layers, followed by our own output layer.

   4. Theoretical summary

             Above we introduced the theoretical part of bert, a stacking model of encoder based on Transfomer. In order to make predictions and prevent information leakage, a mask mechanism is adopted. In order to extract more advanced semantic information, and the model has the ability to judge sentence relations, the next-sentence mechanism is used. There are not too many fancy operations, just talk too much. But its effect has indeed refreshed people's cognition. This may be the so-called Dao Zhi Jian! pay tribute!

3. Source code

         This article uses the tensorflow version of the BERT source for learning, and the original github link is here ! The operation of BERT has relatively high hardware requirements, and I am still relatively poor now. So I just tried to use BERT for Fine-Tuning's own project, and it naturally failed. Of course, in order to be worthy of myself, I tried various methods and still failed! At this time, I feel that this is what God meant. The revolution has not yet succeeded and comrades can still give up. No matter how persistent it is, it is a waste of time, so I give up decisively.

         But I still read the code seriously, I have to do my best! I think the code is divided into two parts, one is trying to write comments to the source code, and the other part is worth learning by yourself. The most important thing is to be relaxed and mindful. Now, let's start our journey of source code!

1. Thinking of training process

         In fact, the traditional training process is the same as tensorflow, which is nothing more than these steps: data preprocessing, model building, training, and verification. They all revolve around a core model. We imagine the model as a baby, the data is the food, the building of the model is himself, the training is for him to learn and grow, and the verification is the answer sheet handed in by life. He keeps growing up by continuously iterating this process. Under-fitting means that the child is too inferior, and over-fitting is too arrogant. The ultimate goal is to grow up and make contributions to society, which means model use. Just like no one is perfect, we don't require the model to find the global optimum. This means that there is a principle-the model can be used.

        The data is different, which means that some children have rich families, and some are born poor. The data determines the upper limit of the model, so the height of a person's birth determines the range of life height to a certain extent. Of course, learning well may lead to better performance. Different learning rates mean that children have different personalities, some are positive but impulsive, and some are cautious but timid.

        Of course, in the end, everyone's fate is completely different. As great as BERT, it is sought after by thousands of people, and the star of the times. Like my 4G video memory, the model trained with a play mentality is doomed. This is our model, and this is part of us.

2. Parameter definition 

        Some parameters need to be specified when BERT is running, including mandatory items such as data_dir (data), bert_config_file (configuration file), task_name (task name), vocab_file (word list), init_checkpoint (pre-training model) and some optional items. Use the tf.flags class to achieve the function of obtaining command line parameters, where the tf.app.run() method will find the defined main method, and the verification parameters must be passed!   

#使用tf.flags来定义命令行参数,然后使用FLAGS来获取对应的参数
def main(_):
    print('she is a very beautilful girl,she name is',FLAGS.name)
flags = tf.flags
FLAGS = flags.FLAGS
flags.DEFINE_string('name', None, 'who is the one you love')
flags.mark_flag_as_required('name')
tf.app.run()

3. Doubts about training steps

  I have not been very clear about the difference between epoch and step. Then this opportunity to explain.

     batch_size: the number of samples per batch

     steps : The number of training steps, each update of the parameter corresponds to one step (equivalent to a step down the gradient)

     epochs: Indicates that all data needs to be traversed several times.

    Of course, a warmup (the number of warm-up steps) is usually followed, and a small learning rate is generally used to train a certain round in order to make the model stable. It can prevent over-fitting to a certain extent.

  if FLAGS.do_train:
    #获取训练的数据
    train_examples = processor.get_train_examples(FLAGS.data_dir)
    #训练步数 = 总样本数*轮次/批次大小
    num_train_steps = int(
        len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
    #根据warmup比例计算warmup的步数
    num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)

4. Embedding layer

      First, the embedding corresponding to each input word will be obtained, and then the position information embedding layer will be initialized, and the embedding of Token_Type will be calculated according to the sentence to which the word belongs. Add the three to get the final embedding layer, which corresponds to the three-layer embedding in the paper. It's just that the position embedding does not use the cosine sine function, but is obtained through network learning.

5. The realization of mask

      In attention_mask, the normal word is 1, and the masked word is 0. The adder obtained after the final calculation, the normal word is 0, and the word of the mask is -10000. Then the score is added to the original attention score, and the word masked is a very small value. Mask in this way.

  if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
    # masked positions, this operation will create a tensor which is 0.0 for
    # positions we want to attend and -10000.0 for masked positions.
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

    # Since we are adding it to the raw scores before the softmax, this is
    # effectively the same as removing these entirely.
    attention_scores += adder

6. Variable context management

        tf.variable_scope can be used as a variable context manager, using tf.get_variable() to define variables. That is, the variable belongs to the context, when there are many variables and variable names are repeated. This method can be used to distinguish variables in different contexts. The variable names are the same, but in different contexts, they belong to different variables.

        At the same time, reuse can be used to reuse a context.

with tf.variable_scope("lover"):
    v=tf.get_variable("V",[1])
with tf.variable_scope("lover",reuse=True):
    v1 = tf.get_variable("V", [1])
if v == v1:
    print('两者是同一个变量')
else:
    print('两者不是同一个变量')

4. Nonsense

             Love is the source of all power. I have always thought that the soul is imprisoned by the body. It was originally so chic and free. Until one day I suddenly realized that no matter how free the world of the soul is, it is only a lonely one. It enjoys time and space. But it has no way to communicate and merge with other worlds. Only through this body can the love in this world flow out, and the love in other worlds can flow in. Our soul is like water, it needs temperature, and love is the temperature of the soul. Water without temperature will no longer flow and there will be no happiness. Of course, it will not die, it will only freeze temporarily. If the warm sunlight comes in again one day, it can still be melted. The ability to love is also large and small, in many forms. The great land has its love, the sky and white clouds have their love, and the sun, moon and stars have their own endless love. I don't know where their eyes and hearts are. After all, I am just a small person living on the surface. But I believe that the world they live in will have richer colors, more beautiful music, and that big world must be more lively. Maybe when my short life is over, I can also participate in them. So I am not afraid that after this life is over, I will face darkness. The light that people can see is a very small part. Imagine that one day I can see all the light. Of course, there will definitely be different eyes. So far, I love this life, let time pass, nothing to fear!

[Trap Music X The Matrix]

 

Guess you like

Origin blog.csdn.net/gaobing1993/article/details/108711664