Source code: OCG layer written to PDF based on borb's recognition of PDF images (optional content group)

The main purpose is to further follow up the source code of bord and understand its layer drawing process.
It is necessary to understand some PDF format specifications . Related content
Preparation: (mac environment)

  1. Generate a python virtual environment
$ python3.9 -m venv ./venv
$ cd venv/bin
#进入bin目录,激活环境
$ source activate
  1. Download some libraries used
    cat requirements.txt
borb==2.1.7
certifi==2022.12.7
charset-normalizer==2.1.1
fonttools==4.38.0
idna==3.4
lxml==4.9.2
packaging==22.0
Pillow==9.3.0
pytesseract==0.3.10
python-barcode==0.14.0
qrcode==7.3.1
requests==2.28.1
urllib3==1.26.13
$ pip install -r requirements.txt -i https://mirrors.ustc.edu.cn/pypi/web/simple/
  1. This
    article is used to run the demo , and it has been rearranged here
import typing
from pathlib import Path
import requests
from decimal import Decimal
from io import BytesIO

from PIL import Image as PILImage  # Type: ignore [import]
from PIL import ImageDraw, ImageFont
from pathlib import Path
from borb.toolkit.ocr.ocr_as_optional_content_group import OCRAsOptionalContentGroup
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction
from borb.pdf.canvas.layout.layout_element import Alignment
import typing
# New imports
from borb.pdf.canvas.layout.image.image import Image
from borb.pdf import (
    Document,
    SingleColumnLayout,
    Paragraph,
    PageLayout,
    Page,
    PDF,
)
def download_image() -> PILImage:
    req=requests.get("https://xxx/2022/11/ba916990aaa049a78fc1a5cb7a606924.png")
    image: PILImage = PILImage.open(BytesIO(req.content))
    w, h = image.size
    lower = image.format.lower()
    image.save('pic/'+'42345234'+'.'+lower)
    return image

def load_image() -> PILImage:
    image = PILImage.open('pic/44444.png')
    w, h = image.size
    print(image.size)
    print(int(w/3), int(h/3))
    image = image.resize((int(image.width/3), int(image.height/3)), PILImage.ANTIALIAS)
    return image

def create_image() -> PILImage:
    # Create new Image
    img = PILImage.new("RGB", (256, 256), color=(255, 255, 255))

    # Create ImageFont
    # CAUTION: you may need to adjust the path to your particular font directory
    font = ImageFont.truetype("Arial.ttf", 24)

    # Draw text
    draw = ImageDraw.Draw(img)
    draw.text((10, 10),
              "Hello World!",
              fill=(0, 0, 0),
              font=font)

    # Return
    return img
# Main method to create the document
def create_document():

    # Create Document
    d: Document = Document()

    # Create/add Page
    p: Page = Page()
    d.add_page(p)

    # Set PageLayout
    l: PageLayout = SingleColumnLayout(p)

    # Add Paragraph
    l.add(Paragraph("Lorem Ipsum"))

    # Add Image
    l.add(Image(create_image()))
    # l.add(Image(load_image()))
    # l.add(Image(download_image()))
    # l.add(Image(
    #             "https://xxx/licenseTest/2022/11/ba916990aaa049a78fc1a5cb7a606924.png",
    #             width=Decimal(256),
    #             height=Decimal(256),
    #             horizontal_alignment=Alignment.CENTERED,
    #         ))

    # Write
    with open("output_001.pdf", "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, d)

def apply_ocr_to_document():

    # Set up everything for OCR
    tesseract_data_dir: Path = Path("tessdata/")
    assert tesseract_data_dir.exists()
    l: OCRAsOptionalContentGroup = OCRAsOptionalContentGroup(tesseract_data_dir)
    # Read Document
    doc: typing.Optional[Document] = None
    with open("output_001.pdf", "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle, [l])

    assert doc is not None
    # print(doc)
    # Store Document
    with open("output_002.pdf", "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)

def read_modified_document():

    doc: typing.Optional[Document] = None
    l: SimpleTextExtraction = SimpleTextExtraction()
    with open("output_002.pdf", "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle, [l])

    print(l.get_text()[0])



def main():
    # load_image()
    # download_image()
    create_document()
    apply_ocr_to_document()
    # read_modified_document()

    
if __name__ == "__main__":
    main()

  1. When loading images from the outside, the width/height cannot be too large, otherwise bord will report an error
Image("https://xxx/licenseTest/2022/11/ba916990aaa049a78fc1a5cb7a606924.png",
     width=Decimal(256),
     height=Decimal(256),
     horizontal_alignment=Alignment.CENTERED,)
  1. The tessdata in the above demo needs to be pulled from github

  2. If the operation is successful, two pdf files will be generated. After the second output_002.pdf is opened, the text of the picture can be copied.

  3. The OCRAsOptionalContentGroup object processes the layers.
    When OCRAsOptionalContentGroup is initialized, in addition to tessdata, it also defaults to a minimal_confidence=0.75 minimum confidence level. After the image is recognized, the confidence level will be judged

ChunkOfText(e.get_text(),
                            e.get_font(),
                            e.get_font_size(),
                            e.get_font_color()).paint(page, e.get_bounding_box())

ChunkOfText format: borb/pdf/canvas/layout/text/chunk_of_text.py
paint format: borb/pdf/canvas/layout/layout_element.py

The log printed before rendering: The information includes the text content (recognition result), the location of rendering, and other necessary information in PDF format

 q
BT
0.937255 0.937255 0.937255 rg
/F1 1.000000 Tf
22.000000 0 0 22.000000 76.500000 691.554000 Tm
(Hello) Tj
ET
Q 

 q
BT
0.937255 0.937255 0.937255 rg
/F1 1.000000 Tf
23.000000 0 0 23.000000 135.500000 690.761000 Tm
(World!) Tj
ET
Q 

_add_ocr_optional_content_group method, see the name: add OCG layer to PDF

  1. bord does not support Chinese drawing, in the following lang="eng", after adding Chinese, after the recognition is successful, the drawing will fail
data = pytesseract.image_to_data(
                event.get_image(),
                lang="eng",
                config='--tessdata-dir "%s"' % str(self._tesseract_data_dir.absolute()),
                output_type=Output.DICT,
            )

The data format is as follows:

{
    
    'level': [1, 2, 3, 4, 5, 5], 'page_num': [1, 1, 1, 1, 1, 1], 'block_num': [0, 1, 1, 1, 1, 1], 'par_num': [0, 0, 1, 1, 1, 1], 'line_num': [0, 0, 0, 1, 1, 1], 'word_num': [0, 0, 0, 0, 1, 2], 'left': [0, 12, 12, 12, 12, 71], 'top': [0, 15, 15, 15, 15, 15], 'width': [256, 127, 127, 127, 52, 68], 'height': [256, 17, 17, 17, 17, 17], 'conf': [-1, -1, -1, -1, 84, 68], 'text': ['', '', '', '', 'Hello', 'Worldl']}
  1. Call chain: In the _event_occurred function, execute pytesseract and return the result as a document
# 1. pdf.py
doc = PDF.loads(pdf_file_handle, [l])  
# 2. borb/pdf/pdf.py, line 56
return ReadAnyObjectTransformer().transform(
            file,
            parent_object=None,
            context=ReadTransformerState(password=password),
            event_listeners=event_listeners,
        )
#3. borb/io/read/any_object_transformer.py, line 100,
return super().transform(
                object_to_transform, parent_object, context, event_listeners
            )
#4. borb/io/read/transformer.py, line 124
out = h.transform(
    object_to_transform,
    parent_object=parent_object,
    context=context,
    event_listeners=event_listeners,
)
#5. borb/io/read/reference/xref_transformer.py, line 77
l._event_occurred(BeginDocumentEvent())
#6. borb/toolkit/ocr/ocr_as_optional_content_group.py, line 145
def _event_occurred(self, event: Event) -> None:
	...
  1. Details to be added...

Guess you like

Origin blog.csdn.net/zoeou/article/details/128360740