Article Directory
pdf manipulation library
Five pdf operation libraries: PyPDF2
, Textract
, tika
, pdfPlumber
,pdfMiner
PyPDF3
Download through the pip package management tool, other libraries are the same
pip install PyPDF3
The advantage of this library is that it is easy to install, but although it can accurately extract the text information in the file, it will break each word in a line of text into multiple lines, or even cut the complete word, and the recognition accuracy is not very good. high.
import PyPDF3
fhandle = open(r'国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf', 'rb')
pdfReader = PyPDF3.PdfFileReader(fhandle)
for i in range(54):
pagehandle = pdfReader.getPage(i)
print(pagehandle.extractText())
N
ATIONAL
S
TRATEGY FOR
A
DVANCED
M
ANUFACTURING
text extract
textract
It can accurately recognize English, and the text recognized by textract directly is a byte stream, and a decode
normal text string can be obtained through it. From the actual effect, the extraction accuracy is very high.
# some python file
import textract
text = textract.process("国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf")
string = text.decode("utf-8")
print(string)
NATIONAL STRATEGY FOR ADVANCED MANUFACTURING
A Report by the SUBCOMMITTEE ON ADVANCED MANUFACTURING
COMMITTEE ON TECHNOLOGY of the
NATIONAL SCIENCE AND TECHNOLOGY COUNCIL
October 2022
Apache Tika
installation method
pip install tika
Apache Tika
Python port of the library- Since the server
tika-python
will be started in the backgroundtika rest
, the system needs to installJava 7+
the version to use this library normally.
from tika import parser
file = "国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf"
file_data = parser.from_file(file)
text = file_data['content']
print(text)
NATIONAL STRATEGY FOR
ADVANCED MANUFACTURING
A Report by the
SUBCOMMITTEE ON ADVANCED MANUFACTURING
COMMITTEE ON TECHNOLOGY
of the
NATIONAL SCIENCE AND TECHNOLOGY COUNCIL
October 2022 October 2022
The final text extraction works quite well.
pdfPlumber
pdfPlumber is easy to install, easy to operate, and has a good effect on text extraction.
import pdfplumber
with pdfplumber.open(r'国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf') as pdf:
for i in range(54):
page = pdf.pages[i]
print(page.extract_text())
N S
ATIONAL TRATEGY FOR
A M
DVANCED ANUFACTURING
A Report by the
SUBCOMMITTEE ON ADVANCED MANUFACTURING
COMMITTEE ON TECHNOLOGY
of the
NATIONAL SCIENCE AND TECHNOLOGY COUNCIL
OOccttoobbeerr 22002222
pdfminer
The official instructions are very detailed, but it is a bit complicated to use. You need to read the sample code carefully to get started, but the text extraction accuracy is also quite good!
from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with open('国际文件/National-Strategy-for-Advanced-Manufacturing-10072022.pdf', 'rb') as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
print(text)
NATIONAL STRATEGY FOR
ADVANCED MANUFACTURING
A Report by the
SUBCOMMITTEE ON ADVANCED MANUFACTURING
COMMITTEE ON TECHNOLOGY
of the
NATIONAL SCIENCE AND TECHNOLOGY COUNCIL
October 2022
October 2022