Python--Summary of methods for extracting text from PDF

Table of contents

foreword

1. pdfplumber

2. pdfminer

3. fitz / pymupdf

4. Performance comparison


foreword

During this period of time, I have made several demands on the annual report, all of which need to extract the text from the PDF of the annual report before proceeding to the next step. In order to improve efficiency, the efficiency of various methods that can realize this function in python is compared and analyzed .

1. pdfplumber

Introduction:

  • Details of text characters, rectangles and lines can be inserted for pdf files
  • Works best for non-scanned pdf parsing
  • Build on pdfminer.six
  • The code is concise and easy to understand

Install:

pip install pdfplumber

Example:

import pdfplumber

def pdf2txt(pdf_path):
    txt = ''
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            txt = txt + page.extract_text()
    return txt

2. pdfminer

Introduction:

  • You can get the extracted position of the text and other layout information
  • Can convert pdf to other formats (HTML/XML)
  • Supports basic encryption methods (RC4 and AES)

Install:

pip install pdfminer

Example:

from pdfminer.converter import TextConverter
from pdfminer.pdfdocument import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

def parsePDF(PDF_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager,fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager,converter)
    with open(PDF_path,'rb') as fh:
        for page in PDFPage.get_pages(fh,caching=True,check_extractable=True):
            page_interpreter.process_page(page)
        text = fake_file_handle.getvalue()
    converter.close()
    fake_file_handle.close()
    if text:
        return text

3. fitz / pymupdf

Introduction:

  • Multiple formats supported
  • Can extract text and images
  • search text

Install:

There will be more problems when installing fitz directly, it is recommended to install pymupdf

pip install pymupdf

Example:

import fitz

def parsePDF(filePath):
    with fitz.open(filePath) as doc:
        text = ""
        for page in doc.pages():
            text += page.get_text()
        if text:
            return text

Official example:

https://github.com/pymupdf/PyMuPDF/tree/master/tests

4. Performance comparison

Use the above three methods to extract text from the same PDF, and record the length and running time of the extracted text results respectively. The results are as follows:

The length of the text extracted by fitz is not only longer, but also takes more than 10 times faster!

It can be said that fitz has surpassed other libraries in terms of the performance of extracting text, but there are few articles introducing this library, which is a bit strange.

The next article will compare the accuracy of the text extracted by the three methods, welcome to pay attention~

Guess you like

Origin blog.csdn.net/Achernar0208/article/details/129199937
Recommended