Bert source code modification to complete multi-classification tasks

(Author: Chen Yujue data-master )

For a long time, I have used regular expressions to process text classification, but regular expressions need to be maintained frequently. If the SMS template is updated, it is necessary to consider adding the new template to the regular expression. This is actually quite labor-intensive. Although we can use pyltp to replace some organization names/person names in the text, and then deduplicate the text to clean out relatively easy to extract regular templates, but after there are more categories and sources, it needs to be processed and tested. The amount of templates is still very large, so I thought, can I label the text that has been classified with regular expressions and the classification identified by regular expressions, and then use bert to classify the text, so that I don’t need to update the regular expressions manually La.

But about bert, there are many ways to use it on the Internet. There is bert4keras developed by Su Jianlin, and the code found on the Internet has also run successfully, but you still don’t know the details. Therefore, I want to find a script that can use source code for modeling. The official documentation is attached here: https://github.com/google-research/bert

The official document is actually written very clearly. To do classification tasks, you only need to read the section Sentence (and sentence-pair) classification tasks. Let's look at the key parts:

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/path/to/glue

python run_classifier.py \
  --task_name=MRPC \
  --do_train=true \
  --do_eval=true \
  --data_dir=$GLUE_DIR/MRPC \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=/tmp/mrpc_output/

Define the path of the bert pre-trained model as BERT_BASE_DIR (the link is also given in the official document of the pre-trained model, you can find it yourself, I use Harbin Institute of Technology, https://github.com/ymcui/Chinese-BERT- wwm# Chinese model download, this address is found on the git of Su Jianlin's bert4keras), the downloaded model file, after decompression, the structure is as follows:
insert image description here
bert_model.ckpt is a checkpoint file, which records the weight parameters of the model, including three files (I I downloaded a pre-training model from other sources before, and encountered a bug. When reading the three files of ckpt, they will be read uniformly. I have been unable to read them before, because my .ckpt.data file, the name It is .ckpt(1).data, and later changed to .ckpt. It shows tensorflow.python.framework.errors_impl.datalosserror: unable to open table filethat this is because the ckpt file is not fully loaded, and the meta and index are not loaded. Solution reference: https://blog.csdn.net/ Zhang_xi_xi_94/article/details/81293048. In fact, ckpt consists of three files, as long as there are no extra characters in the middle, the secondary suffixes of the three files are the same.)

bert_model.json is the network structure, and vocab.txt is the dictionary of the model.

GLUE_DIR is the path of the input sample. In the bert source code, there are fixed types of input data that can be processed. If your input data type is not among them, you need to write a data processing processor yourself, which requires changing the source code. The built-in processor can process XNLI/MultiNLI/MRPC/CoLA data sets.

The rest of the parameters are more conventional.

When it comes to classification, you can directly write it like the official website. After adding your own processor, pass parameters to the script to directly call the script for text classification. You can also directly copy the run_classifier.py script and write the parameters directly in the script. In, (that is, write the parameters to death), run the script directly without passing parameters.

One thing to note when running is that bert uses tpu by default. Although there is a parameter to define whether to use tpu for model training, even if the parameter is set to False, the Estimator used in the code is tf.contrib.tpu.TPUEstimatorSpec , when you only have cpu or gpu, you need to change this part to tf.estimator.EstimatorSpec.

In fact, this is the content. As for what fine-tuning does, it is actually adding a new layer of model to the pre-trained bert model to achieve your task. For example, if you do classification, you add a new layer to bert. softmax or sigmoid.

For the specific code implementation, see https://github.com/javaidnabi31/Multi-Label-Text-classification-Using-BERT/blob/master/multi-label-classification-bert.ipynb, and the official website Yes, this git can be used directly, there is no bug, just need to modify the part of reading data, the official website is used to understand the code.

After the code was changed, I used it for training. At the beginning, I took less than 3W text messages, which were marked. The batch size is 32, the learning rate is 4e-5, NUM_TRAIN_EPOCHS = 1.0, max_seq_length is 128, and I have 11 Classification, and my train data set has 11 different classifications of text, but there is only 1 category on the test, because I did not randomly select the training set and test set. It turns out that classification is not as effective as regularization. So I adjusted the parameters, increased the training + test samples to 6W, and randomly selected training and testing to ensure that the samples in both data sets are rich, the batch size was changed to 256, the learning rate was increased to 2e-5, and NUM_TRAIN_EPOCHS = 2.0, max_seq_length is changed to 50 (because mine are all short texts, most of them can’t reach 128 words, if converted into a vector with a length of 128, the following are all 0, which may affect the judgment), the effect is better than the regular division Now, it can completely replace regularity.

References:

  1. https://github.com/google-research/bert Official document of bert
  2. https://blog.csdn.net/HGlyh/article/details/106744286 (Using bert4keras developed by Su Jianlin, the implementation is relatively simple, you can try this one first)
  3. https://github.com/javaidnabi31/Multi-Label-Text-classification-Using-BERT/blob/master/multi-label-classification-bert.ipynb According to the official run_classifier.py modified to do multi-classification script, follow The original script is still a bit different, maybe because the official made some changes, or the blogger made some changes himself
  4. https://blog.csdn.net/weixin_37947156/article/details/84877254 Official run_classifier.py source code usage analysis, some content overlaps with the official github
  5. https://github.com/ymcui/Chinese-BERT-wwm#%E4%B8%AD%E6%96%87%E6%A8%A1%E5%9E%8B%E4%B8%8B%E8%BD %BD (download of various bert Chinese models of Harbin Institute of Technology, downloaded by Su Jianlin's bert4keras sang)

Guess you like

Origin blog.csdn.net/weixin_39750084/article/details/108122492