pdf related python library

This article explains

I recently came into contact with PDF information extraction and used several python libraries to operate PDF. I will briefly record them here.

pypdf

pypdf is a free, open source pure python PDF library that can split, merge, crop and convert pages of PDF files. It can also add custom data, viewing options and passwords to PDF files. pypdf can also retrieve text and metadata from PDFs.

Summary: It is mainly used to modify PDFs, especially splitting and merging. It is very convenient to use.

PyPDF2 is no longer maintained after version 3.0.1, and the project became pypdf.

Install:pip install pypdf

pdfplumber

This library can extract text and tables from PDFs. Support visual debugging function.

There are many libraries that support text extraction, but not many support table extraction.

I tried the effect of extracting tables, and it was barely usable, with a lot of errors. You may need to try different extraction setting options to get better results.

installation method:pip install pdfplumber

pdfservices-python-sdk

This is the SDK library of Adobe's official PDF conversion service. You need to register an account on the Adobe website to use it. Free for small amounts, charged for large amounts.

I tried its pdf table extraction function, and the effect was amazing, much better than the open source library. However, large calls are charged, which is also a disadvantage.

Adobe officially has a visualization page that displays PDF information extraction, which is used to visually display the effect of its API for information extraction:https://acrobatservices.adobe.com/dc -visualizer-app/index.html

installation method:pip install pdfservices-sdk

###PyMuPDF
The function of this library is similar to the combination of pypdf and pdfplumber. It can edit pdf and extract pdf information. However, extracting tables is not supported.

I have never used this library, but I saw that the h2ogpt project used PyMuPDF to extract text content when doing document knowledge extraction, so I wrote it down.

Compare

Here is an excerpt from a comparative text in the github description of pdfplumber:

pdfminer.six provides the foundation for pdfplumber. It primarily focuses on parsing PDFs, analyzing PDF layouts and object positioning, and extracting text. It does not provide tools for table extraction or visual debugging.

PyPDF2 is a pure-Python library "capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files." It can extract page text, but does not provide easy access to shape objects (rectangles, lines, etc.), table-extraction, or visually debugging tools.

pymupdf is substantially faster than pdfminer.six (and thus also pdfplumber) and can generate and modify PDFs, but the library requires installation of non-Python software (MuPDF). It also does not enable easy access to shape objects (rectangles, lines, etc.), and does not provide table-extraction or visual debugging tools.

camelot, tabula-py, and pdftables all focus primarily on extracting tables. In some cases, they may be better suited to the particular tables you are trying to extract.

Guess you like

Origin blog.csdn.net/yuanlulu/article/details/134018251