Table of contents
foreword
During this period of time, I have made several demands on the annual report, all of which need to extract the text from the PDF of the annual report before proceeding to the next step. In order to improve efficiency, the efficiency of various methods that can realize this function in python is compared and analyzed .
1. pdfplumber
Introduction:
- Details of text characters, rectangles and lines can be inserted for pdf files
- Works best for non-scanned pdf parsing
- Build on pdfminer.six
- The code is concise and easy to understand
Install:
pip install pdfplumber
Example:
import pdfplumber
def pdf2txt(pdf_path):
txt = ''
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
txt = txt + page.extract_text()
return txt
2. pdfminer
Introduction:
- You can get the extracted position of the text and other layout information
- Can convert pdf to other formats (HTML/XML)
- Supports basic encryption methods (RC4 and AES)
Install:
pip install pdfminer
Example:
from pdfminer.converter import TextConverter
from pdfminer.pdfdocument import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
def parsePDF(PDF_path):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager,fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager,converter)
with open(PDF_path,'rb') as fh:
for page in PDFPage.get_pages(fh,caching=True,check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
converter.close()
fake_file_handle.close()
if text:
return text
3. fitz / pymupdf
Introduction:
- Multiple formats supported
- Can extract text and images
- search text
Install:
There will be more problems when installing fitz directly, it is recommended to install pymupdf
pip install pymupdf
Example:
import fitz
def parsePDF(filePath):
with fitz.open(filePath) as doc:
text = ""
for page in doc.pages():
text += page.get_text()
if text:
return text
Official example:
https://github.com/pymupdf/PyMuPDF/tree/master/tests
4. Performance comparison
Use the above three methods to extract text from the same PDF, and record the length and running time of the extracted text results respectively. The results are as follows:
The length of the text extracted by fitz is not only longer, but also takes more than 10 times faster!
It can be said that fitz has surpassed other libraries in terms of the performance of extracting text, but there are few articles introducing this library, which is a bit strange.
The next article will compare the accuracy of the text extracted by the three methods, welcome to pay attention~