Recognize PDFs by PaddleOCR

Recently, I need to make a function to read the content of the document, so I found a useful python thing, which is quite tricky, so make a record

1. Prepare the python environment

2. Ready to depend on the library

# 安装依赖库
# pywt可能要重启内核
pip install pywt -i https://mirror.baidu.com/pypi/simple

pip install "paddleocr>=2.2" --no-deps -r requirements.txt
# 安装依赖库
pip install -U https://paddleocr.bj.bcebos.com/whl/layoutparser-0.0.0-py3-none-any.whl

pip install PyMuPDF==1.20.2

pip install Flask (web才需要用到)

pip install paddleocr==2.6.0.1

版本号这里安装了指定的版本,是因为前面因为几个版本的问题导致了不少坑

Finally, py attaches the code

# -*- coding=utf-8 -*-
from flask import Flask, jsonify
from flask import request
import fitz
from paddleocr import PaddleOCR
import time
app = Flask(__name__)
ocr = PaddleOCR(use_angle_cls=True, use_gpu=False)


@app.route("/resume", methods=['POST'])
def convertText():
    start_time = time.time()
    # function()   运行的程序
    file = request.files.get('file')
    result = []
    pdfdoc = fitz.open("pdf",file.read())
    for pg in range(pdfdoc.page_count):
        page = pdfdoc[pg]
        rotate = int(0)
        # 每个尺寸的缩放系数为2,这将为我们生成分辨率提高四倍的图像。
        zoom_x = 2.0
        zoom_y = 2.0
        trans = fitz.Matrix(zoom_x, zoom_y).prerotate(rotate)
        pm = page.get_pixmap(matrix=trans, alpha=False)
        pm._writeIMG('temp.jpg', 1)

        # ocr识别
        list = ocr.ocr('temp.jpg', cls=True)
        result.append(list)
        end_time = time.time()  # 程序结束时间
        run_time = end_time - start_time  # 程序的运行时间,单位为秒
        print(run_time)
    return jsonify({"data": result})


if __name__ == "__main__":
    app.config['JSON_AS_ASCII'] = False
    app.run(host='0.0.0.0',port=8059)

1. Flask is used here as the basis for building the web, which can be built quickly

The specified config is due to an encoding problem. It will be garbled when converted to JSON. You need to specify not to enable ASCII encoding

There is also a host that needs to specify 0.0.0.0, if not specified, it cannot be accessed in the LAN

Guess you like

Origin blog.csdn.net/qq_38623939/article/details/128000652