Easily extract PDF text (and other advanced operations) with the Python PyPDF2 library

The PyPDF2 library in Python is a very useful tool when you need to extract text from PDF files. Whether you need to analyze the content in a PDF document or search for specific information in a document, PyPDF2 can help you accomplish these tasks with ease. In this article, we will explore how to extract text from PDF files using the PyPDF2 library and provide some sample code to get you started.

Install PyPDF2 library

First, you need to install the PyPDF2 library. You can use pip to install it:

pip install PyPDF2


Open PDF files and read content

Let's start with a simple example. Suppose we have a PDF file named "sample.pdf" and we want to extract the text content in it.

import PyPDF2


# 打开PDF文件
pdf_file = open('YOLOv1.pdf', 'rb')


# 创建一个PDF对象
pdf_reader = PyPDF2.PdfReader(pdf_file)


# 获取PDF文件中的页面数量
num_pages = len(pdf_reader.pages)


# 创建一个空字符串,用于存储提取的文本
text = ""


# 循环遍历每一页并提取文本
for page_num in range(num_pages):
    page = pdf_reader.pages[page_num]
    text += page.extract_text()


# 关闭PDF文件
pdf_file.close()


# 打印提取的文本
print(text)

The above code will open a PDF file named "YOLOv1.pdf", iterate through each page and extract the text content into a string. Finally, it prints the extracted text.

453a85d1c65cfa045b7c6e035b28cb48.png

Extract results

Advanced usage

In addition to basic text extraction, PyPDF2 also provides other functions, such as merging multiple PDF files, rotating pages, adding bookmarks, etc. Let's discuss some advanced usage in detail and provide corresponding code examples.


Merge multiple PDF files

Sometimes, you may need to merge multiple PDF files into one. PyPDF2 allows you to do this.

from PyPDF2 import PdfWriter


merger = PdfWriter()


for pdf in ["M:\YOLOv1.pdf", "M:\YOLOv2.pdf"]:
    merger.append(pdf)


merger.write("M:\merged.pdf")
merger.close()

The above code will open two PDF files named 'YOLOv1.pdf' and 'YOLOv1.pdf' and merge their contents into a new PDF file 'merged.pdf'.

Rotate page

Sometimes, pages in a PDF file may need to be rotated. With PyPDF2 you can rotate pages to suit your needs.

import PyPDF2


# 打开PDF文件
pdf_file = open('M:\YOLOv1.pdf', 'rb')


# 创建PDF对象
pdf_reader = PyPDF2.PdfReader(pdf_file)


# 创建一个新的PDF对象
pdf_writer = PyPDF2.PdfWriter()


# 旋转第一页90度
page = pdf_reader.pages[0]
page.rotate(90)
pdf_writer.add_page(page)


# 将未旋转的页面添加到新文件中
for page_num in range(1, len(pdf_reader.pages)):
    page = pdf_reader.pages[page_num]
    pdf_writer.add_page(page)


# 创建一个新的PDF文件并保存旋转后的内容
output_pdf = open('M:\YOLOv1-rd.pdf', 'wb')
pdf_writer.write(output_pdf)


# 关闭所有打开的文件
pdf_file.close()
output_pdf.close()

The above code will open a PDF file named 'YOLOv1.pdf', rotate the first page 90 degrees, and save the rotated page to a new PDF file 'YOLOv1-rd.pdf'.

Add bookmark

You can also use PyPDF2 to add bookmarks to PDF files to make it easier to navigate and find content.

The above code will open a PDF file named 'YOLOv1.pdf', copy its contents into a new PDF file 'YOLOv1-copy.pdf', and add two bookmarks on the first and sixth pages.

import PyPDF2


# 打开PDF文件
pdf_file = open('M:\YOLOv1.pdf', 'rb')


# 创建PDF对象
pdf_reader = PyPDF2.PdfReader(pdf_file)


# 创建一个新的PDF对象
pdf_writer = PyPDF2.PdfWriter()


# 循环遍历每一页并将页面添加到新文件中
for page_num in range(len(pdf_reader.pages)):
    page = pdf_reader.pages[page_num]
    pdf_writer.add_page(page)


# 添加书签
pdf_writer.add_bookmark('Chapter 1', 0)  # 在第一页添加一个名为"Chapter 1"的书签
pdf_writer.add_bookmark('Chapter 2', 5)  # 在第六页添加一个名为"Chapter 2"的书签


# 创建一个新的PDF文件并保存带有书签的内容
output_pdf = open('M:\YOLOv1-copy.pdf', 'wb')
pdf_writer.write(output_pdf)


# 关闭所有打开的文件
pdf_file.close()
output_pdf.close()

in conclusion

Using the PyPDF2 library, you can easily extract text from PDF files, which is useful for data analysis, information retrieval, and automation tasks. Hopefully this article and sample code will help you get started using PyPDF2 for PDF text extraction. If you need other advanced usage, such as proportion adjustment, zooming and other operations, you can visit the official website of PyPDF2 to view other examples.

·  END  ·

HAPPY LIFE

6b65be853e2398f4ea448a02f941da7a.png

This article is for learning and communication only. If there is any infringement, please contact the author to delete it.

Guess you like

Origin blog.csdn.net/weixin_38739735/article/details/132893519
Recommended