PaddleOCR trains its own data set

table of Contents

1.PPOCRLabel labeling tool

2. Train the text detection model

2.1 Prepare training image data and test image data

2.2 Label.txt for training and label.txt for testing

2.3 Download the pre-trained model

2.4 Modify the configuration file

2.5 start training

2.6 Breakpoint training

2.7 Evaluation of indicators

2.8 Test results

3. Train the text recognition model

3.1 Data preparation

3.2 Prepare the dictionary

3.3 Download the pre-trained model and configuration file

3.4 Modify the configuration file

3.5 Start training

3.6 Evaluation

3.7 Forecast


1.PPOCRLabel labeling tool

The PPOCRLabel labeling tool is in the github folder of PaddleOCR and can be installed with the instructions on github: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.0/PPOCRLabel/README_ch.md

After the installation is complete, use the tool to mark your own data set. After the mark is completed, you will mainly get the following content.

Label.txt: This is the image path name, as well as the four coordinates of the text label and the rectangular box, which are used to train the detection model. Note that all pictures are put in a txt, not a picture corresponds to a txt.

rec_gt.txt: Inside is each cropped sub-image and the corresponding text content, which is used to train the recognition model.

crop_img: The cropped sub-image is stored inside.

2. Train the text detection model

2.1 Prepare training image data and test image data

Here I put all the text detection training data in the det_train_images folder, and the text detection test data in the det_test_images folder.

2.2 Label.txt for training and label.txt for testing

Here I named the txt corresponding to the training picture det_train_label.txt, and the txt corresponding to the test picture det_test_label.txt

2.3 Download the pre-trained model

First download the model ba. First download the pretrain model of the model backbone. The detection model of PaddleOCR currently supports two backbones, namely MobileNetV3 and ResNet50_vd. You can use the model in PaddleClas to replace the backbone according to your needs .

cd PaddleOCR/
# 下载MobileNetV3的预训练模型
wget -P ./pretrain_models/ https://paddle-imagenet-models-name.bj.bcebos.com/MobileNetV3_large_x0_5_pretrained.tar
# 下载ResNet50的预训练模型
wget -P ./pretrain_models/ https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_ssld_pretrained.tar
# 解压预训练模型文件,以MobileNetV3为例
tar -xf ./pretrain_models/MobileNetV3_large_x0_5_pretrained.tar ./pretrain_models/
# 注:正确解压backbone预训练权重文件后,文件夹下包含众多以网络层命名的权重文件,格式如下:
./pretrain_models/MobileNetV3_large_x0_5_pretrained/
  └─ conv_last_bn_mean
  └─ conv_last_bn_offset
  └─ conv_last_bn_scale
  └─ conv_last_bn_variance
  └─ ......

If you can’t connect to the download using wget above, you can directly copy the following URL to the browser and download it in the browser.

2.4 Modify the configuration file

Here we take MobileNetV3 as an example. You need to modify the training data and test data paths in the configs/det/det_mv3_db.yml file to your own paths. If you are using the docker environment, then when you modify the path, you should pay attention to modifying it to the one in the docke environment. path.

Train:
  dataset:
    name: SimpleDataSet
    data_dir: /paddle
    label_file_list:
      - /paddle/det_train_label.txt

2.5 start training

If you are installing the cpu version, please use_gpu modify the field in the configuration file  to false.

python3 tools/train.py -c configs/det/det_mv3_db.yml -o Global.pretrain_weights=./pretrain_models/MobileNetV3_large_x0_5_pretrained/

In the above instructions, use -c to select the configs/det/det_db_mv3.yml configuration file for training. For a detailed explanation of the configuration file, please refer to the link .

You can also use the -o parameter to change the training parameters without modifying the yml file, for example, adjust the training learning rate to 0.0001

python3 tools/train.py -c configs/det/det_mv3_db.yml -o Optimizer.base_lr=0.0001

2.6 Breakpoint training

If the training program is interrupted, if you want to load the interrupted model to resume training, you can specify the path of the model to be loaded by specifying Global.checkpoints:

python3 tools/train.py -c configs/det/det_mv3_db.yml -o Global.checkpoints=./your/trained/model

Note : Global.checkpointsThe priority is higher than Global.pretrain_weightsthe priority, that is, when two parameters are specified at the same time, the Global.checkpointsspecified model Global.checkpointswill be loaded first . If the specified model path is wrong, the Global.pretrain_weightsspecified model will be loaded .

2.7 Evaluation of indicators

python3 tools/eval.py -c configs/det/det_mv3_db.yml  -o Global.checkpoints="./output/db_mv3/best_accuracy" PostProcess.box_thresh=0.6 PostProcess.unclip_ratio=1.5

The above command should add a space in front of the two PostProcesses. There is no space in the official website tutorial, and the following error will be reported.

Traceback (most recent call last):
  File "tools/eval.py", line 70, in <module>
    config, device, logger, vdl_writer = program.preprocess()
  File "/paddle/PaddleOCR/tools/program.py", line 369, in preprocess
    FLAGS = ArgsParser().parse_args()
  File "/paddle/PaddleOCR/tools/program.py", line 49, in parse_args
    args.opt = self._parse_opt(args.opt)
  File "/paddle/PaddleOCR/tools/program.py", line 60, in _parse_opt
    k, v = s.split('=')
ValueError: too many values to unpack (expected 2)

2.8 Test results

I reported an error when using the following test command given on the official website:

python3 tools/infer_det.py -c configs/det/det_mv3_db.yml -o TestReader.infer_img="./doc/imgs_en/img_10.jpg" Global.checkpoints="./output/det_db/best_accuracy"

Report the following error:

Traceback (most recent call last):
  File "tools/infer_det.py", line 114, in <module>
    config, device, logger, vdl_writer = program.preprocess()
  File "/paddle/PaddleOCR/tools/program.py", line 371, in preprocess
    merge_config(FLAGS.opt)
  File "/paddle/PaddleOCR/tools/program.py", line 115, in merge_config
    global_config.keys(), sub_keys[0])
AssertionError: the sub_keys can only be one of global_config: dict_keys(['Global', 'Architecture', 'Loss', 'Optimizer', 'PostProcess', 'Metric', 'Train', 'Eval']), but get: TestReader, please check your running command

After a general glance, it seems that the TestReader.infer_img="./doc/imgs_en/img_10.jpg" parameter is not supported, so I directly modify the infer_img on line 17 of the det_mv3_db.yml configuration file in PaddleOCR/configs/det: doc/imgs_en/img_10.jpg,

  infer_img: doc/imgs_en/img_10.jpg
  save_res_path: ./output/det_db/predicts_db.txt

After modifying det_mv3_db.yml in PaddleOCR/configs/det, remove TestReader.infer_img="./doc/imgs_en/img_10.jpg" in the test command.

python3 tools/infer_det.py -c configs/det/det_mv3_db.yml -o Global.checkpoints="./output/db_mv3/best_accuracy"

If you want to test all the pictures in the folder, then modify infer_img: doc/imgs_en/img_10.jpg in det_mv3_db.yml in PaddleOCR/configs/det to infer_img: doc/imgs_en/.

3. Train the text recognition model

3.1 Data preparation

After we mark with PPOCRLable, the corresponding sub-image and txt file will be generated. Each line in the txt file is the path name of the image and the text content of the image.

3.2 Prepare the dictionary

Finally, a dictionary ({word_dict_name}.txt) needs to be provided, so that when the model is trained, all the characters that appear can be mapped to the index of the dictionary. Therefore, the dictionary needs to contain all the characters that you want to be correctly recognized. {word_dict_name}.txt needs to be written in the following format and utf-8 saved in an  encoding format:

l
d
a
d
r
n

word_dict.txt has a single word in each line, which maps the character and number index together, "and" will be mapped to [2 5 1], which ppocr/utils/ppocr_keys_v1.txt is a Chinese dictionary with 6623 characters and an English dictionary  ppocr/utils/ic15_dict.txt with 36 characters here I put the ppocr_keys_v1.txt和ic15_dict.txt合成一个字典,并命名为chw_dict.txt。然后修改dictionary path configs / rec / rec_chinese_common_train_v2.0.yml file is ppocr/utils/chw_dict.txt,and  character_type to  ch。然后我们want to support recognition "space" category, so the yml file  use_space_char field to T rue. Note: use_space_char Only valid at  character_type=ch time.

  # for data or label process
  character_dict_path: ppocr/utils/chw_dict.txt
  character_type: ch
  max_text_length: 25
  infer_mode: False
  use_space_char: True

3.3 Download the pre-trained model and configuration file

There is only one download link for the model on the official website. I went directly to github to download the pre-trained model, https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.0/doc/doc_ch/models_list.md#ocr%E6 %A8%A1%E5%9E%8B%E5%88%97%E8%A1%A8v202021%E5%B9%B41%E6%9C%8820%E6%97%A5%E6%9B%B4%E6%96 %B0

Download the pre-trained model you want, then unzip it and put it in the PaddleOCR/pretrain_models folder. At the same time download the corresponding configuration file and put it in the PaddleOCR/configs/rec folder. What I use here is rec_chinese_common_train_v2.0.yml and the corresponding pre-training model.

Chinese recognition model

Model name Model introduction Configuration file Inference model size download link
ch_ppocr_mobile_v2.0_rec Original ultra-lightweight model, supporting Chinese and English, number recognition rec_chinese_lite_train_v2.0.yml 3.71M Inference model  /  training model  /  pre-training model
ch_ppocr_server_v2.0_rec General model, support Chinese and English, number recognition rec_chinese_common_train_v2.0.yml 94.8M Inference model  /  training model  /  pre-training model

Note: It 训练模型 is a model obtained by finetune on real data and vertical synthetic text data based on the pre-training model. It has better performance in real application scenarios. It 预训练模型is directly trained on the full amount of real data and synthetic data, which is more suitable for use. Finetune on your own data set.

It is directly trained based on the full amount of real data and synthetic data, which is more suitable for finetune on your own data set.

English recognition model

Model name Model introduction Configuration file Inference model size download link
en_number_mobile_v2.0_rec Original ultra-lightweight model, supporting English and digital recognition rec_en_number_lite_train.yml 2.56M Inference model  /  training model

3.4 Modify the configuration file

You need to modify the training data and test data paths in the ./configs/rec/rec_chinese_common_train_v2.0.yml file to your own path. If you are using the docker environment, then when you modify the path, you must pay attention to the path in the docke environment.

Train:
  dataset:
    name: SimpleDataSet
    data_dir: /paddle
    label_file_list: ["/paddle/rec_gt_train.txt"]
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - RecAug: 
      - CTCLabelEncode: # Class handling label
      - RecResizeImg:
          image_shape: [3, 32, 320]
      - KeepKeys:
          keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
  loader:
    shuffle: True
    batch_size_per_card: 256
    drop_last: True
    num_workers: 8

Eval:
  dataset:
    name: SimpleDataSet
    data_dir: /paddle
    label_file_list: ["/paddle/rec_gt_test.txt"]
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - CTCLabelEncode: # Class handling label
      - RecResizeImg:
          image_shape: [3, 32, 320]
      - KeepKeys:
          keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order

 The following is a modification example of the official website description document:

Global:
  ...
  # 修改 image_shape 以适应长文本
  image_shape: [3, 32, 320]
  ...
  # 修改字符类型
  character_type: ch
  # 添加自定义字典,如修改字典请将路径指向新字典
  character_dict_path: ./ppocr/utils/ppocr_keys_v1.txt
  # 训练时添加数据增强
  distort: true
  # 识别空格
  use_space_char: true
  ...
  # 修改reader类型
  reader_yml: ./configs/rec/rec_chinese_reader.yml
  ...
...
Optimizer:
  ...
  # 添加学习率衰减策略
  decay:
    function: cosine_decay
    # 每个 epoch 包含 iter 数
    step_each_epoch: 20
    # 总共训练epoch数
    total_epoch: 1000

3.5 Start training

# GPU训练 支持单卡,多卡训练,通过CUDA_VISIBLE_DEVICES指定卡号
export CUDA_VISIBLE_DEVICES=0,1,2,3
# 训练icdar15英文数据
python3 tools/train.py -c configs/rec/rec_chinese_lite_train_v2.0.yml

Error:

2021-03-26 03:19:23,583 - ERROR - DataLoader reader thread raised an exception!
Traceback (most recent call last):
  File "tools/train.py", line 124, in <module>
    main(config, device, logger, vdl_writer)
  File "tools/train.py", line 97, in main
    eval_class, pre_best_model_dict, logger, vdl_writer)
  File "/paddle/PaddleOCR/tools/program.py", line 201, in train
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 684, in _get_data
    data = self._data_queue.get(timeout=self._timeout)
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 105, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 616, in _thread_loop
    batch = self._get_data()
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 700, in _get_data
    "pids: {}".format(len(failed_workers), pids))
RuntimeError: DataLoader 8 workers exit unexpectedly, pids: 74939, 74940, 74941, 74942, 74943, 74944, 74945, 74946

    for idx, batch in enumerate(train_dataloader):
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 779, in __next__
    data = self._reader.read_next_var_list()
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
  [Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:158)

Change the bachsize in the rec_chinese_lite_train_v2.0.yml file from 256 to 64. Then I found that it still reported an error.

Traceback (most recent call last):
  File "tools/train.py", line 124, in <module>
    main(config, device, logger, vdl_writer)
  File "tools/train.py", line 97, in main
    eval_class, pre_best_model_dict, logger, vdl_writer)
  File "/paddle/PaddleOCR/tools/program.py", line 201, in train
    for idx, batch in enumerate(train_dataloader):
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 779, in __next__
    data = self._reader.read_next_var_list()
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/multiprocess_utils.py", line 134, in __handler__
    core._throw_error_if_process_failed()
SystemError: (Fatal) DataLoader process (pid   1. If run DataLoader by DataLoader.from_generator(...), queue capacity is set by from_generator(..., capacity=xx, ...).
  2. If run DataLoader by DataLoader(dataset, ...), queue capacity is set as 2 times of the max value of num_workers and len(places).
  3. If run by DataLoader(dataset, ..., use_shared_memory=True), set use_shared_memory=False for not using shared memory.) exited is killed by signal: 78129.
  It may be caused by insufficient shared storage space. This problem usually occurs when using docker as a development environment.
  Please use command `df -h` to check the storage space of `/dev/shm`. Shared storage space needs to be greater than (DataLoader Num * DataLoader queue capacity * 1 batch data size).
  You can solve this problem by increasing the shared storage space or reducing the queue capacity appropriately.
Bus error (at /paddle/paddle/fluid/imperative/data_loader.cc:161)

From the above error message, it is because I am using the docker environment here, and the default value of shared memory /dev/shm in docker is 64m.

#docker环境中df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay         3.5T  1.5T  1.9T  44% /
tmpfs            64M     0   64M   0% /dev
tmpfs           252G     0  252G   0% /sys/fs/cgroup
shm              64M   57M  7.6M  89% /dev/shm
/dev/sda2       3.5T  1.5T  1.9T  44% /paddle
tmpfs           252G   12K  252G   1% /proc/driver/nvidia
udev            252G     0  252G   0% /dev/nvidia0
tmpfs           252G     0  252G   0% /proc/acpi
tmpfs           252G     0  252G   0% /proc/scsi
tmpfs           252G     0  252G   0% /sys/firmware

#非docker环境中 df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            252G     0  252G   0% /dev
tmpfs            51G  2.7M   51G   1% /run
/dev/sda2       3.5T  1.5T  1.9T  44% /
tmpfs           252G     0  252G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           252G     0  252G   0% /sys/fs/cgroup

So delete the current docker container, and then use --shm-size=252G to specify the shared memory size when creating the docker container again.

sudo nvidia-docker run --name ppocr -v $PWD:/paddle --shm-size=252G  --network=host -itd paddlepaddle/paddle:2.0.1-gpu-cuda11.0-cudnn8 /bin/bash

For specific docker commands, see: https://blog.csdn.net/u013171226/article/details/115132594 . After setting the shared memory, the bachsize is changed back to 256 and no error is reported.

3.6 Evaluation

 

python3 tools/eval.py -c ./configs/rec/rec_chinese_lite_train_v2.0.yml  -o Global.checkpoints=./output/rec_chinese_lite_v2.0/latest

 

3.7 Forecast

python3 tools/infer_rec.py -c ./configs/rec/rec_chinese_common_train_v2.0.yml  -o Global.checkpoints=./output/rec_chinese_common_v2.0/best_accuracy Global.infer_img=doc/13_crop_4.jpg

 

references:

    https://www.bookstack.cn/read/PaddleOCR/detection.md

    https://www.bookstack.cn/read/PaddleOCR/recognition.md

Guess you like

Origin blog.csdn.net/u013171226/article/details/115179480