LayoutLM source code stepping pit record

LayoutLM (V1/V2) source code stepping record


Foreword: I was reading VQA-related content recently, and I happened to find the LayoutLM repo, which has open source code and a pretrained model, so I thought about it for nothing

LayoutLM repo address: https://github.com/microsoft/unilm/tree/master/layoutlm

Environmental preparation

  • Install lfs (download large files)

Under ubuntu, you can install lfs according to the following method

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get install git-lfs
git lfs install

After installation, you can download the pretrained model according to the following method, the link of huggingface, there is also a
method of downloading the pretrained model in the official website repo

git clone https://huggingface.co/microsoft/layoutlm-base-uncased/tree/main

This method is currently quite common based on torch-transformer.

  • Install layoutlm: In this folder, install the layoutlm whl package according to the following method
python set.py bdist_wheel
pip install dist/***.whl
  • Then install the following requirements
seqeval
tensorboardX
transformers

data preparation

Here, because the document classification data set is too large, 30+G, and cannot be accessed temporarily, so Sequence Labeling Taskthe experiment and process verification are done first.

  • Prepare data. Next, we will download the data set from the website and decompress it. Since it is an external network, even if the file is not large ( 17Mabout), it will take a long time
进入example文件夹
bash preprocess.sh

Then follow the documentation on the official website to train.

python3.7 run_seq_labeling.py  --data_dir data \
                            --model_type layoutlm \
                            --model_name_or_path ./layoutlm-base-uncased/ \
                            --do_lower_case \
                            --max_seq_length 512 \
                            --do_train \
                            --num_train_epochs 100.0 \
                            --logging_steps 10 \
                            --save_steps 2 \
                            --output_dir ./output \
                            --overwrite_output_dir \
                            --labels data/labels.txt \
                            --per_gpu_train_batch_size 16 \
                            --per_gpu_eval_batch_size 16 \
                            --evaluate_during_training \
                            --fp16

output/train.logReal-time logs will be recorded, and recall, precision, f1 and other information will be printed out, as follows.

           precision    recall  f1-score   support

 QUESTION       0.29      0.56      0.38         9
   HEADER       0.50      1.00      0.67         1
   ANSWER       0.04      0.20      0.06         5

micro avg       0.15      0.47      0.23        15
macro avg       0.22      0.47      0.30        15

07/04/2021 09:33:51 - INFO - __main__ -   ***** Eval results  *****
07/04/2021 09:33:51 - INFO - __main__ -     f1 = 0.22950819672131148
07/04/2021 09:33:51 - INFO - __main__ -     loss = 2.8498342037200928
07/04/2021 09:33:51 - INFO - __main__ -     precision = 0.15217391304347827
07/04/2021 09:33:51 - INFO - __main__ -     recall = 0.4666666666666667

problems encountered

  • When opening the evaluation while training, encountered:[Errno 2] No such file or directory: 'data/dev.txt

The solution is to create a dev file. I soft-linked all test*.txt to dev*.txt.

cd data
ln -s test_box.txt dev_box.txt
ln -s test_image.txt dev_image.txt
ln -s test.txt dev.txt
cd ..

If you submit, you need to handle the data format yourself, you can refer to this link: https://github.com/microsoft/unilm/issues/125

  • When using multi-card, if you use torch>=1.5.0 version, the following problems will occur:StopIteration: Caught StopIteration in replica 0 on device 0.

You can refer to the following link: https://www.pythonf.cn/read/153689 , modify the source code, and next(self.parameters()).dtypechange it to torch.float32(if using fp32 training) or torch.float16(if using fp16 training).

V2 stepping on the pit record

V2 has relatively less documentation and requires additional dependencies. Before installation, you must manually update pip, otherwise there will be problems with many package installations. The specific commands are as follows

pip install --upgrade pip
pip install datasets
  • Note: I never expected that there is a whl package called datasets, and I am still looking for relative paths in various ways. I can find this package in the layoutlm source code. . .

After that, just run according to the training script on the official website. Here I found that the code in the layoutlmft folder has been optimized. Both pretrain and dataset can be downloaded automatically, so it is relatively useless.

Guess you like

Origin blog.csdn.net/u012526003/article/details/118464518