table of Contents
2. Train the text detection model
2.1 Prepare training image data and test image data
2.2 Label.txt for training and label.txt for testing
2.3 Download the pre-trained model
2.4 Modify the configuration file
3. Train the text recognition model
3.3 Download the pre-trained model and configuration file
3.4 Modify the configuration file
1.PPOCRLabel labeling tool
The PPOCRLabel labeling tool is in the github folder of PaddleOCR and can be installed with the instructions on github: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.0/PPOCRLabel/README_ch.md
After the installation is complete, use the tool to mark your own data set. After the mark is completed, you will mainly get the following content.
Label.txt: This is the image path name, as well as the four coordinates of the text label and the rectangular box, which are used to train the detection model. Note that all pictures are put in a txt, not a picture corresponds to a txt.
rec_gt.txt: Inside is each cropped sub-image and the corresponding text content, which is used to train the recognition model.
crop_img: The cropped sub-image is stored inside.
2. Train the text detection model
2.1 Prepare training image data and test image data
Here I put all the text detection training data in the det_train_images folder, and the text detection test data in the det_test_images folder.
2.2 Label.txt for training and label.txt for testing
Here I named the txt corresponding to the training picture det_train_label.txt, and the txt corresponding to the test picture det_test_label.txt
2.3 Download the pre-trained model
First download the model ba. First download the pretrain model of the model backbone. The detection model of PaddleOCR currently supports two backbones, namely MobileNetV3 and ResNet50_vd. You can use the model in PaddleClas to replace the backbone according to your needs .
cd PaddleOCR/
# 下载MobileNetV3的预训练模型
wget -P ./pretrain_models/ https://paddle-imagenet-models-name.bj.bcebos.com/MobileNetV3_large_x0_5_pretrained.tar
# 下载ResNet50的预训练模型
wget -P ./pretrain_models/ https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_ssld_pretrained.tar
# 解压预训练模型文件,以MobileNetV3为例
tar -xf ./pretrain_models/MobileNetV3_large_x0_5_pretrained.tar ./pretrain_models/
# 注:正确解压backbone预训练权重文件后,文件夹下包含众多以网络层命名的权重文件,格式如下:
./pretrain_models/MobileNetV3_large_x0_5_pretrained/
└─ conv_last_bn_mean
└─ conv_last_bn_offset
└─ conv_last_bn_scale
└─ conv_last_bn_variance
└─ ......
If you can’t connect to the download using wget above, you can directly copy the following URL to the browser and download it in the browser.
2.4 Modify the configuration file
Here we take MobileNetV3 as an example. You need to modify the training data and test data paths in the configs/det/det_mv3_db.yml file to your own paths. If you are using the docker environment, then when you modify the path, you should pay attention to modifying it to the one in the docke environment. path.
Train:
dataset:
name: SimpleDataSet
data_dir: /paddle
label_file_list:
- /paddle/det_train_label.txt
2.5 start training
If you are installing the cpu version, please use_gpu
modify the field in the configuration file to false.
python3 tools/train.py -c configs/det/det_mv3_db.yml -o Global.pretrain_weights=./pretrain_models/MobileNetV3_large_x0_5_pretrained/
In the above instructions, use -c to select the configs/det/det_db_mv3.yml configuration file for training. For a detailed explanation of the configuration file, please refer to the link .
You can also use the -o parameter to change the training parameters without modifying the yml file, for example, adjust the training learning rate to 0.0001
python3 tools/train.py -c configs/det/det_mv3_db.yml -o Optimizer.base_lr=0.0001
2.6 Breakpoint training
If the training program is interrupted, if you want to load the interrupted model to resume training, you can specify the path of the model to be loaded by specifying Global.checkpoints:
python3 tools/train.py -c configs/det/det_mv3_db.yml -o Global.checkpoints=./your/trained/model
Note : Global.checkpoints
The priority is higher than Global.pretrain_weights
the priority, that is, when two parameters are specified at the same time, the Global.checkpoints
specified model Global.checkpoints
will be loaded first . If the specified model path is wrong, the Global.pretrain_weights
specified model will be loaded .
2.7 Evaluation of indicators
python3 tools/eval.py -c configs/det/det_mv3_db.yml -o Global.checkpoints="./output/db_mv3/best_accuracy" PostProcess.box_thresh=0.6 PostProcess.unclip_ratio=1.5
The above command should add a space in front of the two PostProcesses. There is no space in the official website tutorial, and the following error will be reported.
Traceback (most recent call last):
File "tools/eval.py", line 70, in <module>
config, device, logger, vdl_writer = program.preprocess()
File "/paddle/PaddleOCR/tools/program.py", line 369, in preprocess
FLAGS = ArgsParser().parse_args()
File "/paddle/PaddleOCR/tools/program.py", line 49, in parse_args
args.opt = self._parse_opt(args.opt)
File "/paddle/PaddleOCR/tools/program.py", line 60, in _parse_opt
k, v = s.split('=')
ValueError: too many values to unpack (expected 2)
2.8 Test results
I reported an error when using the following test command given on the official website:
python3 tools/infer_det.py -c configs/det/det_mv3_db.yml -o TestReader.infer_img="./doc/imgs_en/img_10.jpg" Global.checkpoints="./output/det_db/best_accuracy"
Report the following error:
Traceback (most recent call last):
File "tools/infer_det.py", line 114, in <module>
config, device, logger, vdl_writer = program.preprocess()
File "/paddle/PaddleOCR/tools/program.py", line 371, in preprocess
merge_config(FLAGS.opt)
File "/paddle/PaddleOCR/tools/program.py", line 115, in merge_config
global_config.keys(), sub_keys[0])
AssertionError: the sub_keys can only be one of global_config: dict_keys(['Global', 'Architecture', 'Loss', 'Optimizer', 'PostProcess', 'Metric', 'Train', 'Eval']), but get: TestReader, please check your running command
After a general glance, it seems that the TestReader.infer_img="./doc/imgs_en/img_10.jpg" parameter is not supported, so I directly modify the infer_img on line 17 of the det_mv3_db.yml configuration file in PaddleOCR/configs/det: doc/imgs_en/img_10.jpg,
infer_img: doc/imgs_en/img_10.jpg
save_res_path: ./output/det_db/predicts_db.txt
After modifying det_mv3_db.yml in PaddleOCR/configs/det, remove TestReader.infer_img="./doc/imgs_en/img_10.jpg" in the test command.
python3 tools/infer_det.py -c configs/det/det_mv3_db.yml -o Global.checkpoints="./output/db_mv3/best_accuracy"
If you want to test all the pictures in the folder, then modify infer_img: doc/imgs_en/img_10.jpg in det_mv3_db.yml in PaddleOCR/configs/det to infer_img: doc/imgs_en/.
3. Train the text recognition model
3.1 Data preparation
After we mark with PPOCRLable, the corresponding sub-image and txt file will be generated. Each line in the txt file is the path name of the image and the text content of the image.
3.2 Prepare the dictionary
Finally, a dictionary ({word_dict_name}.txt) needs to be provided, so that when the model is trained, all the characters that appear can be mapped to the index of the dictionary. Therefore, the dictionary needs to contain all the characters that you want to be correctly recognized. {word_dict_name}.txt needs to be written in the following format and utf-8
saved in an encoding format:
l
d
a
d
r
n
word_dict.txt has a single word in each line, which maps the character and number index together, "and" will be mapped to [2 5 1], which ppocr/utils/ppocr_keys_v1.txt
is a Chinese dictionary with 6623 characters and an English dictionary ppocr/utils/ic15_dict.txt
with 36 characters here I put the ppocr_keys_v1.txt和ic15_dict.txt合成一个字典,并命名为chw_dict.txt。然后修改
dictionary path configs / rec / rec_chinese_common_train_v2.0.yml file is ppocr/utils/chw_dict.txt,
and character_type
to ch。然后我们
want to support recognition "space" category, so the yml file use_space_char
field to T rue
. Note: use_space_char
Only valid at character_type=ch
time.
# for data or label process
character_dict_path: ppocr/utils/chw_dict.txt
character_type: ch
max_text_length: 25
infer_mode: False
use_space_char: True
3.3 Download the pre-trained model and configuration file
There is only one download link for the model on the official website. I went directly to github to download the pre-trained model, https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.0/doc/doc_ch/models_list.md#ocr%E6 %A8%A1%E5%9E%8B%E5%88%97%E8%A1%A8v202021%E5%B9%B41%E6%9C%8820%E6%97%A5%E6%9B%B4%E6%96 %B0
Download the pre-trained model you want, then unzip it and put it in the PaddleOCR/pretrain_models folder. At the same time download the corresponding configuration file and put it in the PaddleOCR/configs/rec folder. What I use here is rec_chinese_common_train_v2.0.yml and the corresponding pre-training model.
Chinese recognition model
Model name | Model introduction | Configuration file | Inference model size | download link |
---|---|---|---|---|
ch_ppocr_mobile_v2.0_rec | Original ultra-lightweight model, supporting Chinese and English, number recognition | rec_chinese_lite_train_v2.0.yml | 3.71M | Inference model / training model / pre-training model |
ch_ppocr_server_v2.0_rec | General model, support Chinese and English, number recognition | rec_chinese_common_train_v2.0.yml | 94.8M | Inference model / training model / pre-training model |
Note: It 训练模型
is a model obtained by finetune on real data and vertical synthetic text data based on the pre-training model. It has better performance in real application scenarios. It 预训练模型
is directly trained on the full amount of real data and synthetic data, which is more suitable for use. Finetune on your own data set.
It is directly trained based on the full amount of real data and synthetic data, which is more suitable for finetune on your own data set.
English recognition model
Model name | Model introduction | Configuration file | Inference model size | download link |
---|---|---|---|---|
en_number_mobile_v2.0_rec | Original ultra-lightweight model, supporting English and digital recognition | rec_en_number_lite_train.yml | 2.56M | Inference model / training model |
3.4 Modify the configuration file
You need to modify the training data and test data paths in the ./configs/rec/rec_chinese_common_train_v2.0.yml file to your own path. If you are using the docker environment, then when you modify the path, you must pay attention to the path in the docke environment.
Train:
dataset:
name: SimpleDataSet
data_dir: /paddle
label_file_list: ["/paddle/rec_gt_train.txt"]
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- RecAug:
- CTCLabelEncode: # Class handling label
- RecResizeImg:
image_shape: [3, 32, 320]
- KeepKeys:
keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
loader:
shuffle: True
batch_size_per_card: 256
drop_last: True
num_workers: 8
Eval:
dataset:
name: SimpleDataSet
data_dir: /paddle
label_file_list: ["/paddle/rec_gt_test.txt"]
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- CTCLabelEncode: # Class handling label
- RecResizeImg:
image_shape: [3, 32, 320]
- KeepKeys:
keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
The following is a modification example of the official website description document:
Global:
...
# 修改 image_shape 以适应长文本
image_shape: [3, 32, 320]
...
# 修改字符类型
character_type: ch
# 添加自定义字典,如修改字典请将路径指向新字典
character_dict_path: ./ppocr/utils/ppocr_keys_v1.txt
# 训练时添加数据增强
distort: true
# 识别空格
use_space_char: true
...
# 修改reader类型
reader_yml: ./configs/rec/rec_chinese_reader.yml
...
...
Optimizer:
...
# 添加学习率衰减策略
decay:
function: cosine_decay
# 每个 epoch 包含 iter 数
step_each_epoch: 20
# 总共训练epoch数
total_epoch: 1000
3.5 Start training
# GPU训练 支持单卡,多卡训练,通过CUDA_VISIBLE_DEVICES指定卡号
export CUDA_VISIBLE_DEVICES=0,1,2,3
# 训练icdar15英文数据
python3 tools/train.py -c configs/rec/rec_chinese_lite_train_v2.0.yml
Error:
2021-03-26 03:19:23,583 - ERROR - DataLoader reader thread raised an exception!
Traceback (most recent call last):
File "tools/train.py", line 124, in <module>
main(config, device, logger, vdl_writer)
File "tools/train.py", line 97, in main
eval_class, pre_best_model_dict, logger, vdl_writer)
File "/paddle/PaddleOCR/tools/program.py", line 201, in train
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 684, in _get_data
data = self._data_queue.get(timeout=self._timeout)
File "/usr/lib/python3.7/multiprocessing/queues.py", line 105, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 616, in _thread_loop
batch = self._get_data()
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 700, in _get_data
"pids: {}".format(len(failed_workers), pids))
RuntimeError: DataLoader 8 workers exit unexpectedly, pids: 74939, 74940, 74941, 74942, 74943, 74944, 74945, 74946
for idx, batch in enumerate(train_dataloader):
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 779, in __next__
data = self._reader.read_next_var_list()
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
[Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:158)
Change the bachsize in the rec_chinese_lite_train_v2.0.yml file from 256 to 64. Then I found that it still reported an error.
Traceback (most recent call last):
File "tools/train.py", line 124, in <module>
main(config, device, logger, vdl_writer)
File "tools/train.py", line 97, in main
eval_class, pre_best_model_dict, logger, vdl_writer)
File "/paddle/PaddleOCR/tools/program.py", line 201, in train
for idx, batch in enumerate(train_dataloader):
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 779, in __next__
data = self._reader.read_next_var_list()
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/multiprocess_utils.py", line 134, in __handler__
core._throw_error_if_process_failed()
SystemError: (Fatal) DataLoader process (pid 1. If run DataLoader by DataLoader.from_generator(...), queue capacity is set by from_generator(..., capacity=xx, ...).
2. If run DataLoader by DataLoader(dataset, ...), queue capacity is set as 2 times of the max value of num_workers and len(places).
3. If run by DataLoader(dataset, ..., use_shared_memory=True), set use_shared_memory=False for not using shared memory.) exited is killed by signal: 78129.
It may be caused by insufficient shared storage space. This problem usually occurs when using docker as a development environment.
Please use command `df -h` to check the storage space of `/dev/shm`. Shared storage space needs to be greater than (DataLoader Num * DataLoader queue capacity * 1 batch data size).
You can solve this problem by increasing the shared storage space or reducing the queue capacity appropriately.
Bus error (at /paddle/paddle/fluid/imperative/data_loader.cc:161)
From the above error message, it is because I am using the docker environment here, and the default value of shared memory /dev/shm in docker is 64m.
#docker环境中df -h
Filesystem Size Used Avail Use% Mounted on
overlay 3.5T 1.5T 1.9T 44% /
tmpfs 64M 0 64M 0% /dev
tmpfs 252G 0 252G 0% /sys/fs/cgroup
shm 64M 57M 7.6M 89% /dev/shm
/dev/sda2 3.5T 1.5T 1.9T 44% /paddle
tmpfs 252G 12K 252G 1% /proc/driver/nvidia
udev 252G 0 252G 0% /dev/nvidia0
tmpfs 252G 0 252G 0% /proc/acpi
tmpfs 252G 0 252G 0% /proc/scsi
tmpfs 252G 0 252G 0% /sys/firmware
#非docker环境中 df -h
Filesystem Size Used Avail Use% Mounted on
udev 252G 0 252G 0% /dev
tmpfs 51G 2.7M 51G 1% /run
/dev/sda2 3.5T 1.5T 1.9T 44% /
tmpfs 252G 0 252G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 252G 0 252G 0% /sys/fs/cgroup
So delete the current docker container, and then use --shm-size=252G to specify the shared memory size when creating the docker container again.
sudo nvidia-docker run --name ppocr -v $PWD:/paddle --shm-size=252G --network=host -itd paddlepaddle/paddle:2.0.1-gpu-cuda11.0-cudnn8 /bin/bash
For specific docker commands, see: https://blog.csdn.net/u013171226/article/details/115132594 . After setting the shared memory, the bachsize is changed back to 256 and no error is reported.
3.6 Evaluation
python3 tools/eval.py -c ./configs/rec/rec_chinese_lite_train_v2.0.yml -o Global.checkpoints=./output/rec_chinese_lite_v2.0/latest
3.7 Forecast
python3 tools/infer_rec.py -c ./configs/rec/rec_chinese_common_train_v2.0.yml -o Global.checkpoints=./output/rec_chinese_common_v2.0/best_accuracy Global.infer_img=doc/13_crop_4.jpg
references: