The whole process of flying paddle OCR marking, training, prediction and deployment

Note: This document is all operated in the Windows10 environment

Note: The code version of the flying paddle OCR panorama project used in this document is release/2.4

Query the list of documents:

Paddle Paddle OCR official Chinese document: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/README_ch.md

Paddle Paddle OCR for pdserving deployment official document: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/deploy/pdserving/README_CN.md

Paddle Paddle OCR for hubserving deployment official document: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/deploy/hubserving/readme.md

Paddle Paddle OCR official labeling tool documentation: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/PPOCRLabel/README_ch.md

Official document for Paddle Paddle OCR form recognition: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/ppstructure/table/README_ch.md

Some training and prediction datasets officially given by Flying Paddle OCR: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/doc/doc_ch/datasets.md

Paddle OCR official training document: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/doc/doc_ch/training.md

1. Preparation

1.1. Pull the full code of the flying paddle OCR locally, which can be used for debugging and development

【推荐】git clone https://github.com/PaddlePaddle/PaddleOCR

If the pull fails due to network problems, you can also choose to use the hosting on Code Cloud:

git clone https://gitee.com/paddlepaddle/PaddleOCR

1.2. Create a virtual environment

According to your own computer development environment, create a virtual environment for the project, and activate this virtual environment when using the project

Recommended method for beginners:

  • Install the Python3 development environment. It is recommended to install the Python3.8 version. You can directly download the Windows installation package from the Python official website. If you don’t know how to use Baidu, you can
  • Install the virtualenv tripartite package and open a terminal:pip install virtualenv
  • Enter the project directory:cd .\PaddleOCR\
  • Create a virtual environment:virtualenv venv
  • Activate the virtual environment (under powershell):.\venv\Scripts\activate.ps1

1.3. Install project dependencies in the virtual environment

Due to the lack of dependencies in the requirements.txt maintained in the flying paddle, it is necessary to create a new dependency file newrequirements.txt and copy the following contents into it

aiofiles==0.8.0
astor==0.8.1
Babel==2.9.1
backports.entry-points-selectable==1.1.1
bce-python-sdk==0.8.64
cachetools==5.0.0
certifi==2021.10.8
cffi==1.15.0
cfgv==3.3.1
chardet==4.0.0
charset-normalizer==2.0.9
click==7.1.2
colorama==0.4.4
colorlog==6.6.0
cryptography==36.0.1
cssselect==1.1.0
cssutils==2.3.0
cycler==0.11.0
Cython==0.29.26
decorator==5.1.0
dill==0.3.4
distlib==0.3.4
easydict==1.9
et-xmlfile==1.1.0
fasttext==0.9.1
filelock==3.4.2
flake8==4.0.1
Flask==1.1.4
Flask-Babel==2.0.0
fonttools==4.28.5
func-timeout==4.3.5
future==0.18.2
grpcio==1.33.2
grpcio-tools==1.33.2
h5py==3.6.0
httptools==0.3.0
identify==2.4.0
idna==3.3
imageio==2.13.5
imgaug==0.4.0
iopath==0.1.9
itsdangerous==1.1.0
jieba==0.42.1
Jinja2==2.11.3
joblib==1.1.0
kiwisolver==1.3.2
layoutparser==0.3.2
lmdb==1.2.1
lxml==4.7.1
MarkupSafe==1.1.1
matplotlib==3.5.1
mccabe==0.6.1
multidict==5.2.0
multiprocess==0.70.12.2
networkx==2.6.3
nodeenv==1.6.0
numpy==1.19.3
onnx==1.9.0
opencv-contrib-python==4.4.0.46
opencv-python==4.2.0.32
openpyxl==3.0.9
packaging==21.3
paddle-serving-server==0.5.0
paddle-serving-server-gpu @ file:///D:/aeas/PaddleOCR/paddle_serving_server_gpu-0.7.0.post102-py3-none-any.whl
paddle2onnx==0.9.0
paddlehub==2.2.0
paddlenlp==2.2.2
paddleocr==2.3.0.2
paddlepaddle==2.2.1
pandas==1.3.5
pdf2image==1.16.0
pdfminer.six==20211012
pdfplumber==0.6.0
Pillow==8.4.0
platformdirs==2.4.1
portalocker==2.3.2
pre-commit==2.16.0
premailer==3.10.0
protobuf==3.19.1
pybind11==2.8.1
pyclipper==1.3.0.post2
pycodestyle==2.8.0
pycparser==2.21
pycryptodome==3.12.0
pyflakes==2.4.0
pyparsing==3.0.6
PyQt5==5.15.6
PyQt5-Qt5==5.15.2
PyQt5-sip==12.9.0
python-dateutil==2.8.2
python-Levenshtein==0.12.2
pytz==2021.3
PyWavelets==1.2.0
pywin32==303
PyYAML==6.0
pyzmq==22.3.0
rarfile==4.0
requests==2.26.0
sanic==21.12.0
sanic-routing==0.7.2
scikit-image==0.19.1
scikit-learn==1.0.2
scipy==1.7.3
sentencepiece==0.1.92
seqeval==1.2.2
Shapely==1.8.0
shellcheck-py==0.8.0.3
six==1.16.0
threadpoolctl==3.0.0
tifffile==2021.11.2
toml==0.10.2
tqdm==4.62.3
typing_extensions==4.0.1
urllib3==1.26.7
virtualenv==20.10.0
visualdl==2.2.2
Wand==0.6.7
websockets==10.1
Werkzeug==1.0.1

In order to prevent installation failure due to network reasons, specify pypi as Ali source

pip install -r newrequirements.txt -i https://mirrors.aliyun.com/pypi/simple/

1.3. Download the official model for secondary training, prediction and deployment

The official recommendation and the model used in this document:

Text Detection Model

Text Recognition Model

Orientation Classifier Model

Create a folder in the PaddleOCR directory, name it inference and enter this directory, and copy all the downloaded model compression packages in

Unzip the model:

tar xf ch_PP-OCRv2_det_infer.tar
tar xf ch_PP-OCRv2_rec_infer.tar
tar xf ch_ppocr_mobile_v2.0_cls_infer.tar

2. How to mark data

2.1. Raw data preparation

Here we take the image data of the front and back of the ID card as an example. Download all the data that needs to be marked in advance. When there is a lot of data in a single folder, it is recommended to group every 500 images. Note that there should be no Chinese characters in the name. Directory name example train0-499.

2.2. Tool preparation

Open the cloned project in the first major item, and make sure that all dependencies are installed, enter the PPOCRLabel directory

cd .\PPOCRLabel\

Modify the PPOCRLabel.py file under PPOCRLabel, the modified content is as follows:

1、将159行的 self.autoSaveNum = 5 改为 self.autoSaveNum = 1,目的是设置打标工具每完成一个打标时便自动保存标注
2、将1893行的 if self.noLabelText == shape.label or result[1][0] == shape.label: 改为以下代码,目的是防止闪退
if len(result) < 2:
    print('没有识别到数据')
if self.noLabelText == shape.label or (len(result) > 2 and result[1][0] == shape.label):

Start the marking tool, --lang=ch specifies the language as Chinese, and the default is English

The marking tool will download the official inference model to the system folder when it starts. The default address is generally: C:\Users\Administrator\.paddleocr\2.3.0.2\ocr

It doesn’t matter if you can’t find it. This address will be printed when the marking tool is started. The address here needs to be recorded, and it will be used when checking the recognition rate of the secondary training model later.

python .\PPOCRLabel.py --lang=ch

At this time, a marking window will be opened automatically. At this time, click File --> Open Directory --> select the directory you need to mark and confirm.

After opening, you can see that after the file is loaded, click the automatic marking button in the lower left corner of the marking tool to start the automatic marking process. At this time, the official OCR model will be run to complete the detection, recognition and marking functions.

2.3. Confirm the marked content

After finishing in the automatic table of the marking tool, return to the first file. If a confirmation pops up at this time, you need to click Cancel, otherwise it will be considered as the last file to confirm.

Inspection process:

  • Starting from the first file, check whether the label frame of each image is normal and whether the recognition content is accurate
  • If you find that the label frame is abnormal, you can adjust it to normal, or delete it, and manually label it again. After manual labeling, click the re-identify button in the upper right corner
  • After re-identification, check whether the recognition result is ready. If it is wrong, it needs to be corrected manually.
  • After ensuring that the annotation frame and recognition results are correct, click the confirmation button in the lower right corner to complete the annotation of an image
  • Shortcut key tip: W opens rectangular dimension, Q opens four-point dimension

After the marking is completed, the label file Label.txt that can be used by the training model will be automatically generated, which can be viewed in the labeling directory

The principle of marking the label frame:

  • Line-wrapped text, with a label box per line
  • If the file interval is too large, the marking should be disconnected
  • Keep the text continuous marking, for example, the name field and the real name should not be marked separately in the ID card

Labeled marking reference:

ID card front:
insert image description here

On the back of the ID card:
insert image description here

3. How to conduct secondary training (the example is based on the secondary training of ID card front text detection)

3.1. Data preparation

Get the labeled data in step 2. Assume that your dataset directory name is: idcard_front, which needs to be divided into training images and test images. This proportion is about 80% for training and 20% for testing.

Process training and test data:

  • Create a text_localization folder in idcard_front, then create idcard_front_train_imgs and idcard_front_test_imgs folders in text_localization, and then create two files named train_label.txt and test_label.txt

  • Move the front 80% of the images in idcard_front to the idcard_front_train_imgs directory

  • Move the remaining 20% ​​of images in idcard_front to the idcard_front_test_imgs directory

  • Move the label of the training part of Label.txt in idcard_front to train_label.txt, and change the image name in the file to normal

    The annotation file format provided is as follows, separated by "\t" in the middle:

    " 图像文件名                          json.dumps编码的图像标注信息"
    idcard_front_train_imgs/img_1.jpg	[{
          
          "transcription": "MASA", "points": [[310, 104], [416, 141], [418, 216], [312, 179]]}, {
          
          ...}]
    
  • The label processing of the test part is the same as above

  • Finally delete the Label.txt file, and then move the entire idcard_front to the PaddleOCR/train_data/ directory

After processing, PaddleOCR/train_data/ has two folders and two files, and the idcard_front dataset should be organized as follows:

/PaddleOCR/train_data/idcard_front/text_localization/
  └─ idcard_front_train_imgs/    idcard_front数据集的训练数据
  └─ idcard_front_test_imgs/     idcard_front数据集的测试数据
  └─ train_label.txt             idcard_front数据集的训练标注
  └─ test_label.txt              idcard_front数据集的测试标注

3.2. Preparation for training

Download the pretrained model:

Text detection pre-trained model download address

Prepare the pretrained model:

Create a folder named pretrain_models in PaddleOCR, and then copy the downloaded pre-training model to the pretrain_models directory

Modify training configuration file: configuration file locationPaddleOCR/configs/det/det_mv3_db.yml

Global:
  use_gpu: false # 1、如果是使用CPU训练,要关闭这个选项
  ....
  ....
Train:
  dataset:
    name: SimpleDataSet
    data_dir: ./train_data/idcard_front/text_localization/ # 修改为你的训练目录
    label_file_list:
      - ./train_data/idcard_front/text_localization/train_label.txt # 修改为你的训练标签地址
    ....
	....
Eval:
  dataset:
    name: SimpleDataSet
    data_dir: ./train_data/idcard_front/text_localization/ # 修改为你的测试目录
    label_file_list:
      - ./train_data/idcard_front/text_localization/test_label.txt # 修改为你的测试标签文件
    ....
    ....

3.3. Start training

# 单机单卡训练 mv3_db 模型
python tools/train.py -c configs/det/det_mv3_db.yml -o Global.pretrained_model=./pretrain_models/MobileNetV3_large_x0_5_pretrained

# 单机多卡训练,通过 --gpus 参数设置使用的GPU ID
python -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/det/det_mv3_db.yml \
     -o Global.pretrained_model=./pretrain_models/MobileNetV3_large_x0_5_pretrained

# 多机多卡训练,通过 --ips 参数设置使用的机器IP地址,通过 --gpus 参数设置使用的GPU ID
python -m paddle.distributed.launch --ips="xx.xx.xx.xx,xx.xx.xx.xx" --gpus '0,1,2,3' tools/train.py -c configs/det/det_mv3_db.yml \
     -o Global.pretrained_model=./pretrain_models/MobileNetV3_large_x0_5_pretrained

When the training is completed, the trained model will be output to the ./output/db_mv3/latest directory

3.4. Export the training model as a prediction model

python tools\export_model.py -c configs/det/det_mv3_db.yml -o Global.pretrained_model=./output/db_mv3/latest Global.save_inference_dir=./inference

After the run is complete, the prediction model will be output to the ./inference directory

3.5 How to evaluate the model recognition rate of the secondary training

  • First prepare a batch of original images that have not been trained, tested, and marked, and put them in two folders.
  • Open the first folder with the marking tool, and after automatic marking, close the marking tool.
  • Replace the file detection model used by the marking tool with the secondary training model, the address recorded earlierC:\Users\Administrator\.paddleocr\2.3.0.2\ocr
  • Restart the marking tool, open the second folder for automatic marking.
  • Open an annotation tool window again, then open the first folder, compare the annotations in the two files, and evaluate the recognition rate of the new model through the actual effect.

4. Under Linux, deploy recognition services based on Pdserving

This document actually measures the available deployment code warehouse: https://gitee.com/aeasringnar/pdserving.git

4.1. Model conversion

Export a predictive model as a deployed model

# 进入项目
cd ./deploy/pdserving/
# 导出检测模型
python -m paddle_serving_client.convert --dirname ./ch_PP-OCRv2_det_infer/ --model_filename inference.pdmodel --params_filename inference.pdiparams --serving_server ./ppocrv2_det_serving/ --serving_client ./ppocrv2_det_serving/
# 导出识别模型
python -m paddle_serving_client.convert --dirname ./ch_PP-OCRv2_rec_infer/ --model_filename inference.pdmodel --params_filename inference.pdiparams --serving_server ./ppocrv2_rec_serving/ --serving_client ./ppocrv2_rec_client/

After the export is successful, it will be output to the ppocrv2_det_serving directory

ppocrv2_det_servingAfter the conversion of the detection model is completed, there will be additional and ppocrv2_det_clientfolders in the current folder , with the following format:

|- ppocrv2_det_serving/
  |- __model__  
  |- __params__
  |- serving_server_conf.prototxt  
  |- serving_server_conf.stream.prototxt

|- ppocrv2_det_client
  |- serving_client_conf.prototxt  
  |- serving_client_conf.stream.prototxt

The recognition model is the same.

4.2. Configuration file

Configuration file location: ./config.yml

Adjust the number of concurrency in config.yml to get the maximum QPS. Generally, the number of concurrency for detection and identification is 2:1

#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1
rpc_port: 18091
#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port
http_port: 9998
#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG
##当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num
worker_num: 2
#build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG
build_dag_each_worker: False
dag:
    #op资源类型, True, 为线程模型;False,为进程模型
    is_thread_op: True
    #重试次数
    retry: 10
    #使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用
    use_profile: False
    tracer:
        interval_s: 10
op:
    det:
        #并发数,is_thread_op=True时,为线程并发;否则为进程并发
        concurrency: 2
        #当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
        local_service_conf:
            #client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
            client_type: local_predictor
            #det模型路径
            model_config: ./ppocrv2_det_serving
            #Fetch结果列表,以client_config中fetch_var的alias_name为准
            fetch_list: ["save_infer_model/scale_0.tmp_1"]
            #计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
            devices: ""
            ir_optim: True
    rec:
        #并发数,is_thread_op=True时,为线程并发;否则为进程并发
        concurrency: 1
        #超时时间, 单位ms
        timeout: -1
        #Serving交互重试次数,默认不重试
        retry: 1
        #当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
        local_service_conf:
            #client类型,包括brpc, grpc和local_predictor。local_predictor不启动Serving服务,进程内预测
            client_type: local_predictor
            #rec模型路径
            model_config: ./ppocrv2_rec_serving
            #Fetch结果列表,以client_config中fetch_var的alias_name为准
            fetch_list: ["save_infer_model/scale_0.tmp_1"]  
            #计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
            devices: ""
            ir_optim: True

4.3. Start the service

To start the service, run the following command:

# 启动服务,运行日志保存在log.txt
python web_service.py &>log.txt &

4.4. Analysis and identification service

The analysis service based on Sanic, based on the identification and analysis service process of the front and back of the ID card is as follows

# api_server.py
from sanic import Sanic
from sanic.response import text
from sanic.response import json
import requests
import cv2
import base64
import time
import json as Json
import re


app = Sanic("ocr-server")


def hub_predict(img):
    data = {
    
    'images':[img]}
    headers = {
    
    "Content-type": "application/json"}
    url = "http://127.0.0.1:8866/predict/ocr_system"
    start_time = time.time()
    r = requests.post(url=url, headers=headers, data=Json.dumps(data))
    print(f'识别耗时:{
      
      time.time() - start_time}')
    # 打印预测结果
    return r.json().get('results')

def pd_predict(img):
    data = {
    
    "key": ["image"], "value": [img]}
    headers = {
    
    "Content-type": "application/json"}
    url = "http://127.0.0.1:9998/ocr/prediction"
    r = requests.post(url=url, headers=headers, data=Json.dumps(data))
    values = r.json().get('value')
    return eval(values[0])


@app.post("/hub/idcard/predict")
async def hello_world(request):
    res = {
    
    
        'msg': 'ok',
        'code': 0,
        'data': {
    
    }
    }
    if len(request.files.keys()) > 1:
        res['msg'] = '单次只能传入一个文件'
        res['code'] = 1
        return json(res)
    f = request.files.get(list(request.files.keys())[0])
    img = base64.b64encode(f.body).decode('utf8')
    predict_res = hub_predict(img) 
    return_dict = res['data']
    is_back = False
    for item in ['中华人民共和国', '居民身份证', '签发机关', '有效期限']:
        if item in [obj['text'] for obj in predict_res[0]]:
            is_back = True
            break
    return_dict['is_back'] = is_back
    if is_back:
        for obj in predict_res[0]:
            item = obj['text']
            print(item)
            if '-' in item:
                return_dict['date'] = item.replace('有效期限', '')
            elif item.replace(' ', '') in ['中华人民共和国', '居民身份证', '签发机关', '有效期限']:
                continue
            else:
                return_dict['sign'] = item.replace('签发机关', '')
        return json(res)
    for obj in predict_res[0]:
        item = obj['text']
        print(item)
        if '姓名' in item:
            return_dict['name'] = item.replace('姓名', '')
        elif len(item) == 18 and (item.isnumeric() or item[:-1].isnumeric()):
            return_dict['idNo'] = item
        elif '公民身份号码' in item:
            return_dict['idNo'] = item.replace('公民身份号码', '').replace(' ', '')
        elif '性别' in item:
            return_dict['gender'] = item.replace('性别', '')
        elif '民族' in item:
            return_dict['nation'] = item.replace('民族', '')
        elif '出生' in item:
            return_dict['birthday'] = '-'.join(re.findall( r'\d+', item, re.M|re.I))
        elif '住址' in item:
            return_dict['address'] = item.replace('住址', '')
        elif item == '公民身份号码':
            continue
        elif item.replace(' ', '') in ['中华人民共和国', '居民身份证', '签发机关', '有效期限']:
            continue
        else:
            address = return_dict.get('address', '')
            address += item
            return_dict['address'] = address
    return json(res)

@app.post("/pd/idcard/predict")
async def hello_world(request):
    res = {
    
    
        'msg': 'ok',
        'code': 0,
        'data': {
    
    }
    }
    if len(request.files.keys()) > 1:
        res['msg'] = '单次只能传入一个文件'
        res['code'] = 1
        return json(res)
    f = request.files.get(list(request.files.keys())[0])
    img = base64.b64encode(f.body).decode('utf8')
    predict_res = pd_predict(img)
    return_dict = res['data']
    is_back = False
    for item in ['中华人民共和国', '居民身份证', '签发机关', '有效期限']:
        if item in [obj for obj in predict_res]:
            is_back = True
            break
    return_dict['is_back'] = is_back
    if is_back:
        for item in predict_res:
            print(item)
            if '-' in item:
                return_dict['date'] = item.replace('有效期限', '')
            elif item.replace(' ', '') in ['中华人民共和国', '居民身份证', '签发机关', '有效期限']:
                continue
            else:
                return_dict['sign'] = item.replace('签发机关', '')
        return json(res)
    for item in predict_res:
        print(item)
        if '姓名' in item:
            return_dict['name'] = item.replace('姓名', '')
        elif len(item) == 18 and (item.isnumeric() or item[:-1].isnumeric()):
            return_dict['idNo'] = item
        elif '公民身份号码' in item:
            return_dict['idNo'] = item.replace('公民身份号码', '').replace(' ', '')
        elif '性别' in item:
            return_dict['gender'] = item.replace('性别', '')
        elif '民族' in item:
            return_dict['nation'] = item.replace('民族', '')
        elif '出生' in item:
            return_dict['birthday'] = '-'.join(re.findall( r'\d+', item, re.M|re.I))
        elif '住址' in item:
            return_dict['address'] = item.replace('住址', '')
        elif item == '公民身份号码':
            continue
        elif item.replace(' ', '') in ['中华人民共和国', '居民身份证', '签发机关', '有效期限']:
            continue
        else:
            address = return_dict.get('address', '')
            address += item
            return_dict['address'] = address
    return json(res)


if __name__ == "__main__":
    app.run(host='0.0.0.0', port='8080', debug=True)

Start the service:

python api_server.py

How to test:

Use postman to identify the /pd/idcard/predict interface of the analysis service, send a Post request, and carry a picture in the request body to complete the identification

Example:

insert image description here

Guess you like

Origin blog.csdn.net/haeasringnar/article/details/122936537