Python third-party library to extract PDF text information

pdf manipulation library

Five pdf operation libraries: PyPDF2, Textract, tika, pdfPlumber,pdfMiner

PyPDF3

Download through the pip package management tool, other libraries are the same

pip install PyPDF3

The advantage of this library is that it is easy to install, but although it can accurately extract the text information in the file, it will break each word in a line of text into multiple lines, or even cut the complete word, and the recognition accuracy is not very good. high.

import PyPDF3
fhandle = open(r'国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf', 'rb')
pdfReader = PyPDF3.PdfFileReader(fhandle)
for i in range(54):
    pagehandle = pdfReader.getPage(i)
    print(pagehandle.extractText())
N
ATIONAL 
S
TRATEGY FOR 
 
A
DVANCED 
M
ANUFACTURING

text extract

textractIt can accurately recognize English, and the text recognized by textract directly is a byte stream, and a decodenormal text string can be obtained through it. From the actual effect, the extraction accuracy is very high.

# some python file
import textract
text = textract.process("国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf")
string = text.decode("utf-8")
print(string)
NATIONAL STRATEGY FOR ADVANCED MANUFACTURING
A Report by the SUBCOMMITTEE ON ADVANCED MANUFACTURING
COMMITTEE ON TECHNOLOGY of the
NATIONAL SCIENCE AND TECHNOLOGY COUNCIL
October 2022

Apache Tika

installation method

pip install tika

  • Apache TikaPython port of the library
  • Since the server tika-pythonwill be started in the background tika rest, the system needs to install Java 7+the version to use this library normally.
from tika import parser
file = "国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf"
file_data = parser.from_file(file)
text = file_data['content']
print(text)
NATIONAL STRATEGY FOR  
ADVANCED MANUFACTURING  

 

A Report by the 

SUBCOMMITTEE ON ADVANCED MANUFACTURING 

COMMITTEE ON TECHNOLOGY 

 

of the 

NATIONAL SCIENCE AND TECHNOLOGY COUNCIL 

October 2022 October 2022 

The final text extraction works quite well.

pdfPlumber

pdfPlumber is easy to install, easy to operate, and has a good effect on text extraction.

import pdfplumber
with pdfplumber.open(r'国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf') as pdf:
    for i in range(54):
        page = pdf.pages[i]
        print(page.extract_text())
N S
ATIONAL TRATEGY FOR
A M
DVANCED ANUFACTURING
A Report by the
SUBCOMMITTEE ON ADVANCED MANUFACTURING
COMMITTEE ON TECHNOLOGY
of the
NATIONAL SCIENCE AND TECHNOLOGY COUNCIL
OOccttoobbeerr 22002222

pdfminer

The official instructions are very detailed, but it is a bit complicated to use. You need to read the sample code carefully to get started, but the text extraction accuracy is also quite good!

from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io

resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)

with open('国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf', 'rb') as fh:
    for page in PDFPage.get_pages(fh,
                                  caching=True,
                                  check_extractable=True):
        page_interpreter.process_page(page)

    text = fake_file_handle.getvalue()

# close open handles
converter.close()
fake_file_handle.close()

print(text)
NATIONAL STRATEGY FOR  

ADVANCED MANUFACTURING  

 

A Report by the 

SUBCOMMITTEE ON ADVANCED MANUFACTURING 

COMMITTEE ON TECHNOLOGY 

 

of the 

NATIONAL SCIENCE AND TECHNOLOGY COUNCIL 

October 2022 
October 2022 

Guess you like

Origin blog.csdn.net/weixin_46530492/article/details/131677105