Train the BERT language model from scratch

1. Data Preparation

1.1 Building a corpus

  If there is no given corpus file (such as corpus.txt), you can use the training set and test set data to construct the corpus file. The specific code is as follows (code file name):

filtered_line = set()

with open('../../data/raw/train.txt', 'r') as f:
    line = f.readline()
    while line:
        if line[-1] != '\n':
            line += '\n'
        filtered_line.add(line)

        line = f.readlin

Guess you like

Origin blog.csdn.net/herosunly/article/details/113937736