I heard that you haven't read the Bert source code?

Preface

In the interview a few days ago, an interviewer directly took bert's source code for me to analyze, emm, it's a bit powerful. Fortunately, Lao Song is fine. I have seen the implementation of Transformer before and wrote a text classification task with Transformer. It didn't stump me, hahahaha.

However, it seems that nowadays, interviewers are no longer satisfied with just asking the principle, but it is also how to see a person's code ability by looking at his ability to read the source code.

Therefore, Lao Song thinks that you really want to take a look at the source code of Bert, so I spent an afternoon clearing up each code and commenting on key points. You can take a look at my warehouse: BERT-pytorch (Https://github.com/songyingxin/BERT-pytorch)

1. Overall description

When BERT-Pytorch distributes packages, it mainly sets two major functions:

  • bert-vocab: Statistics word frequency, token2idx, idx2token and other information. Corresponding  bert_pytorch.dataset.vocab to the  build function.

  • bert: the corresponding  bert_pytorch.__main__ train function.

In order to be able to debug, I re-created two files to debug these two functions separately.

1. bert-vocab

python3 -m ipdb test_bert_vocab.py # debug bert-vocab

In fact, there is no important information inside bert-vocab. It is nothing more than some common preprocessing methods in natural language processing. It takes ten minutes to debug it and you can understand it. I added a few comments and it is easy to understand.

The internal inheritance relationship is:

TorchVocab --> Vocab --> WordVocab

2. Model architecture

  • Debug command:

python3 -m ipdb test_bert.py -c data/corpus.small -v data/vocab.small -o output/bert.model

image

From the perspective of the model as a whole, it is divided into two parts:  MaskedLanguageModel  and  NextSentencePrediction  , and both use  BERT  as the front model, and add a fully connected layer and a softmax layer to obtain the output respectively.

This code is relatively simple and easy to understand, so skip it.

1. Bert Model

image

This part is actually the  Transformer Encoder part  +  BERT Embedding . If you are not familiar with Transformer, you can get a deeper understanding from here.

This part of the source code reading suggestion can be a general overview of the whole, there is a general framework, understand the dependencies between each class, and then gradually understand from the details to the whole, that is, from the above picture, read from right to left, the effect will be more Great.

1. BERTEmbedding

Divided into three parts:

  • TokenEmbedding: the encoding of the token, inherited from it  nn.Embedding, the default initialization is:N(0,1)

  • SegmentEmbedding: Encode the sentence information, inherited from it  nn.Embedding, the default initialization is:N(0,1)

  • PositionalEmbedding: For positional information encoding, see the paper, it generates a fixed vector representation and does not participate in training

The thing to pay attention to here is Positional Embedding, because some interviewers will be very careful about the details, and for these things that I don't think are helpful to me, I generally let them go after knowing the details. The details have proved to be a disadvantage.

2. Transformer

It is highly recommended to read the contents of the paper together, of course, you can skip it if you are familiar with it. I have added notes to the key points in it. If you still don’t understand, you can raise an issue. I won’t go into details here.

At last

I personally think that the code written by Google is really beautiful and the structure is very clear. It takes less than a few hours to understand the whole code. I recommend using my debugging method to debug it from beginning to end, which will be more clear.

I think it's good, just like it.


Original link: https://zhuanlan.zhihu.com/p/76936436



Guess you like

Origin blog.51cto.com/15060464/2678592