Preface
In the interview a few days ago, an interviewer directly took bert's source code for me to analyze, emm, it's a bit powerful. Fortunately, Lao Song is fine. I have seen the implementation of Transformer before and wrote a text classification task with Transformer. It didn't stump me, hahahaha.
However, it seems that nowadays, interviewers are no longer satisfied with just asking the principle, but it is also how to see a person's code ability by looking at his ability to read the source code.
Therefore, Lao Song thinks that you really want to take a look at the source code of Bert, so I spent an afternoon clearing up each code and commenting on key points. You can take a look at my warehouse: BERT-pytorch (Https://github.com/songyingxin/BERT-pytorch)
1. Overall description
When BERT-Pytorch distributes packages, it mainly sets two major functions:
bert-vocab: Statistics word frequency, token2idx, idx2token and other information. Corresponding
bert_pytorch.dataset.vocab
to thebuild
function.bert: the corresponding
bert_pytorch.__main__
train function.
In order to be able to debug, I re-created two files to debug these two functions separately.
1. bert-vocab
python3 -m ipdb test_bert_vocab.py # debug bert-vocab
In fact, there is no important information inside bert-vocab. It is nothing more than some common preprocessing methods in natural language processing. It takes ten minutes to debug it and you can understand it. I added a few comments and it is easy to understand.
The internal inheritance relationship is:
TorchVocab --> Vocab --> WordVocab
2. Model architecture
Debug command:
python3 -m ipdb test_bert.py -c data/corpus.small -v data/vocab.small -o output/bert.model
From the perspective of the model as a whole, it is divided into two parts: MaskedLanguageModel and NextSentencePrediction , and both use BERT as the front model, and add a fully connected layer and a softmax layer to obtain the output respectively.
This code is relatively simple and easy to understand, so skip it.
1. Bert Model
This part is actually the Transformer Encoder part + BERT Embedding . If you are not familiar with Transformer, you can get a deeper understanding from here.
This part of the source code reading suggestion can be a general overview of the whole, there is a general framework, understand the dependencies between each class, and then gradually understand from the details to the whole, that is, from the above picture, read from right to left, the effect will be more Great.
1. BERTEmbedding
Divided into three parts:
TokenEmbedding: the encoding of the token, inherited from it
nn.Embedding
, the default initialization is:N(0,1)
SegmentEmbedding: Encode the sentence information, inherited from it
nn.Embedding
, the default initialization is:N(0,1)
PositionalEmbedding: For positional information encoding, see the paper, it generates a fixed vector representation and does not participate in training
The thing to pay attention to here is Positional Embedding, because some interviewers will be very careful about the details, and for these things that I don't think are helpful to me, I generally let them go after knowing the details. The details have proved to be a disadvantage.
2. Transformer
It is highly recommended to read the contents of the paper together, of course, you can skip it if you are familiar with it. I have added notes to the key points in it. If you still don’t understand, you can raise an issue. I won’t go into details here.
At last
I personally think that the code written by Google is really beautiful and the structure is very clear. It takes less than a few hours to understand the whole code. I recommend using my debugging method to debug it from beginning to end, which will be more clear.
I think it's good, just like it.
Original link: https://zhuanlan.zhihu.com/p/76936436