[Python] Easy-to-use office expert: Use OCR to parse PDF documents (with tutorial)

Article directory

Preface
- Document parsing involves examining data in documents and extracting useful information. It can reduce a lot of manual work through automation. A popular parsing strategy is to convert documents into images and use computer vision for recognition. While Document Image Analysis refers to the technology of obtaining information from the pixel data of an image of a document, in some cases there is no clear answer to what the expected results should be (text, images, charts, numbers, Tables, formulas...).
1. Environment settings
2. Detection
- This page starts with a title, has a text block, then a figure and a table, so we need a trained model to recognize these objects. Luckily Detectron is able to do this, we just need to select a model from here and specify its path in code.
3. Extraction
Summarize

Preface

Document parsing involves examining data in documents and extracting useful information. It can reduce a lot of manual work through automation. A popular parsing strategy is to convert documents into images and use computer vision for recognition. While Document Image Analysis refers to the technology of obtaining information from the pixel data of an image of a document, in some cases there is no clear answer to what the expected results should be (text, images, charts, numbers, Tables, formulas...).

Insert image description here
OCR (Optical Character Recognition, Optical Character Recognition) is a process of detecting and extracting text in images through computer vision. It was invented during World War I, when Israeli scientist Emanuel Goldberg created a machine that could read characters and convert them into telegraph codes. By now the field has reached a very sophisticated level, mixing image processing, text localization, character segmentation and character recognition. Basically an object detection technique for text.

In this article I will show how to use OCR for document parsing. I'll show some useful Python code that can be easily used in other similar situations (just copy, paste, run) and provide a full source code download.

Here we will take the financial statements in PDF format of a listed company as an example (link below). https://s2.q4cdn.com/470004039/files/doc_financials/2021/q4/_10-K-2021-(As-Filed).pdf

1. Environment settings

The annoying part about document parsing is that there are so many tools for different types of data (text, graphics, tables) and none of them work perfectly. Here are some of the most popular methods and software packages:

Process documents as text: extract text with PyPDF2, tables with Camelot or TabulaPy, and graphics with PyMuPDF. Convert documents to images (OCR): use pdf2image for conversion, PyTesseract and many other libraries to extract data, or just use LayoutParser.

You may ask: "Why not just process the PDF file directly, but convert the pages into images?" You can do this. The main disadvantage of this strategy is the encoding issue: documents can be in multiple encodings (i.e. UTF-8, ASCII, Unicode), so conversion to text may result in data loss. So in order to avoid this problem, I will use OCR and convert the page to an image with pdf2image. Note that the PDF rendering library Poppler is required.

# with pip
pip install python-poppler
# with conda
conda install -c conda-forge poppler

You can read the file easily:

# READ AS IMAGE
import pdf2imagedoc = pdf2image.convert_from_path("doc_apple.pdf")
len(doc) #<-- check num pages
doc[0]   #<-- visualize a page

Exactly the same as our screenshot, if you want to save the page image locally, you can use the following code:

# Save imgs
import osfolder = "doc"
if folder not in os.listdir():
  os.makedirs(folder)p = 1
for page in doc:
  image_name = "page_"+str(p)+".jpg"  
  page.save(os.path.join(folder, image_name), "JPEG")
  p = p+1

Finally, we need to set up the CV engine we will use. LayoutParser appears to be the first general-purpose package for OCR based on deep learning. It uses two well-known models to accomplish the task:

Detection: Facebook's most advanced object detection library (the second version Detectron2 will be used here).

Tesseract：最著名的OCR系统，由惠普公司在1985年创建，目前由谷歌开发。

pip install "layoutparser[ocr]"

现在已经准备好开始OCR程序进行信息检测和提取了。

import layoutparser as lp
import cv2
import numpy as np
import io
import pandas as pd
import matplotlib.pyplot as plt

2. Detection

Object detection is the process of finding pieces of information in an image and then surrounding them with rectangular borders. For document parsing, this information is titles, text, graphics, tables...

Let's look at a complex page that contains a few things:
` Insert image description here
`

This page starts with a title, has a text block, then a figure and a table, so we need a trained model to recognize these objects. Luckily Detectron is able to do this, we just need to select a model from here and specify its path in code.

Insert image description here
The model I'm going to use can only detect 4 objects (text, title, list, table, graph). Therefore, if you need to identify other things (like equations), you have to use other models.

## load pre-trained model
model = lp.Detectron2LayoutModel(
  "lp://PubLayNet/mask_rcnn_X_101_32x8d_FPN_3x/config",
  extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8],
  label_map={
    
    0:"Text", 1:"Title", 2:"List", 3:"Table", 4:"Figure"})

## turn img into array
i = 21
img = np.asarray(doc[i])

## predict
detected = model.detect(img)

## plot
lp.draw_box(img, detected, box_width=5, box_alpha=0.2,
          show_element_type=True)

Insert image description here
The results contain details for each detected layout, such as the coordinates of the bounding box. It is useful to sort the output according to the order it appears on the page:

## sort
new_detected = detected.sort(key=lambda x: x.coordinates[1])
## assign ids
detected = lp.Layout([block.set(id=idx) for idx,block in
                    enumerate(new_detected)])## check
for block in detected:
  print("---", str(block.id)+":", block.type, "---")
  print(block, end='\n\n')

The next step in completing OCR is to correctly extract useful information from the detected content.

3. Extraction

Now that we have segmented the image, we need to use another model to process the segmented image and save the extracted output into a dictionary.

Since there are different types of output (text, titles, graphics, tables), a function is prepared here to display the results.

'''
{'0-Title': '...',
'1-Text': '...',
'2-Figure': array([[ [0,0,0], ...]]),
'3-Table': pd.DataFrame,
}
'''
def parse_doc(dic):
  for k,v in dic.items():
      if "Title" in k:
          print('\x1b[1;31m'+ v +'\x1b[0m')
      elif "Figure" in k:
          plt.figure(figsize=(10,5))
          plt.imshow(v)
          plt.show()
      else:
          print(v)
      print(" ")

首先看看文字：

# load model
model = lp.TesseractAgent(languages='eng')
dic_predicted = {
    
    }
for block in [block for block in detected if block.type in ["Title","Text"]]:
  ## segmentation
  segmented = block.pad(left=15, right=15, top=5,
              bottom=5).crop_image(img)
  ## extraction
  extracted = model.detect(segmented)
  ## save
  dic_predicted[str(block.id)+"-"+block.type] =
                extracted.replace('\n',' ').strip()

# check
parse_doc(dic_predicted)

Take another look at the graphical report

for block in [block for block in detected if block.type == "Figure"]:
  ## segmentation
  segmented = block.pad(left=15, right=15, top=5,
                        bottom=5).crop_image(img)
  ## save
  dic_predicted[str(block.id)+"-"+block.type] = segmented

# check
parse_doc(dic_predicted)

Insert image description here
The above two look good because these two types are relatively simple, but the table is much more complex. Especially the one we are looking at, because its rows and columns are merged.


for block in [block for block in detected if block.type == "Table"]:
  ## segmentation
  segmented = block.pad(left=15, right=15, top=5,
              bottom=5).crop_image(img)
  ## extraction
  extracted = model.detect(segmented)
  ## save
  dic_predicted[str(block.id)+"-"+block.type] = pd.read_csv(
                io.StringIO(extracted) )
# check
parse_doc(dic_predicted)

Insert image description here
As expected the extracted form is not very good. Fortunately, Python has a package specifically for processing tables, and we can process them directly without converting them into images. The TabulaPy package is used here:

import tabula
tables = tabula.read_pdf(“doc_apple.pdf”, pages=i+1)
tables[0]
Insert image description here

The result is better, but the name is still wrong, but the effect is much better than direct OCR.

Summarize

This article is a simple tutorial that demonstrates how to use OCR for document parsing. The entire detection and extraction process was performed using the Layoutpars package. And shows how to handle text, numbers and tables in PDF documents.