Several methods of Python operation PDF

Author | Chen Xi

Source | Early Python (ID: zaoqi-python)

Head picture | CSDN download from Visual China

Preface

Hello, everyone, I have written a case about Python operating PDF before ???? PDF batch merge , the original intention of this case is to provide you with a convenient script, and there is not much explanation of the principle, which involves the very practical PDF processing Module PyPDF2 , this article will analyze this module, mainly will involve

Comprehensive application of os module
Comprehensive application of glob module
PyPDF2 module operation

Basic operation

The code for PyPDF2 import module is often:

from PyPDF2 import PdfFileReader, PdfFileWriter

Two methods are imported here:

PdfFileReader can be understood as a reader
PdfFileWriter can be understood as a writer

Next, we will further understand the wonders of these two tools through a few cases. The sample file used is the pdf of 5 invoices.

The PDF of each invoice consists of two pages:

merge

The first job is to merge 5 invoice pdfs into 10 pages . How should the reader and writer work together here?

The logic is as follows:

The reader reads all pdfs once
The reader passes the read content to the writer
Writer unified output to a new pdf

There is also an important point of knowledge here: the reader can only deliver the read content to the writer page by page.

Therefore, the first and second steps in the logic are not actually independent steps , but after the reader has read a pdf, it will loop all the pages of the pdf and hand them to the writer page by page. Finally, wait until all the reading work is finished before outputting.

Looking at the code can make the idea clearer:

from PyPDF2 import PdfFileReader, PdfFileWriter


path = r'C:\Users\xxxxxx'
pdf_writer = PdfFileWriter()


for i in range(1, 6):
    pdf_reader = PdfFileReader(path + '/INV{}.pdf'.format(i))
    for page in range(pdf_reader.getNumPages()):
        pdf_writer.addPage(pdf_reader.getPage(page))


with open(path + r'\合并PDF\merge.pdf', 'wb') as out:
    pdf_writer.write(out)

Since all content needs to be delivered to the same writer for final output, the initialization of the writer must be outside the loop body.

If it is in the loop body, it will become a new writer for each access to read a pdf , so that the content of each reader handed over to the writer will be overwritten repeatedly , and our merge requirements cannot be achieved!

The code at the beginning of the loop body:

for i in range(1, 6):
    pdf_reader = PdfFileReader(path + '/INV{}.pdf'.format(i))

The purpose is to read a new pdf file in each loop and pass it to the reader for subsequent operations. In fact, this writing method is not very recommended, because the naming of each pdf happens to be very regular, so you can directly specify the number for looping. A better way is to use the glob module:

import glob
for file in glob.glob(path + '/*.pdf'):
    pdf_reader = PdfFileReader(path)

In the code, pdf_reader.getNumPages(): can get the page number of the reader, and can traverse all the pages of the reader with range .

pdf_writer.addPage(pdf_reader.getPage(page)) can give the current page to the writer.

Finally, use with to create a new pdf and output it by the pdf_writer.write(out) method of the writer .

Split

If you understand the cooperation of the reader and the writer in the merge operation, then the splitting is easy to understand. Here we take the split INV1.pdf into two separate pdf documents as an example, and we will also start with a stroke. logic:

Reader to read PDF documents
Reader handed over to writer page by page
Writer immediately outputs every time it gets a page

Through this code logic, we can also understand that the initialization and output positions of the writer must be in the loop body of each page of the read PDF loop, not outside the loop.

The code is simple:

from PyPDF2 import PdfFileReader, PdfFileWriter
path = r'C:\Users\xxx'
pdf_reader = PdfFileReader(path + '\INV1.pdf')


for page in range(pdf_reader.getNumPages()):
    # 遍历到每一页挨个生成写入器
    pdf_writer = PdfFileWriter()
    pdf_writer.addPage(pdf_reader.getPage(page))
    # 写入器被添加一页后立即输出产生pdf
    with open(path + '\INV1-{}.pdf'.format(page + 1), 'wb') as out:
        pdf_writer.write(out)

Watermark

The job is added to the following figure as a watermark INV1.pdf in

The first is the preparation work. Insert the picture that needs to be a watermark into Word, adjust the appropriate position and save it as a PDF file . Then the code can be coded, and the copy module needs to be additionally used . The specific explanation is shown in the figure below:

It is to initialize the reader and writer, and read the watermarked PDF page first. The core code is a little harder to understand:

Watermarking is essentially to merge the watermarked PDF page with every page that needs to be watermarked.

Since the PDF that needs to be watermarked may have many pages, and the watermarked PDF has only one page, if the watermarked PDF is merged directly, it can be abstractly understood as the first page is added, and the watermarked PDF page is gone .

Therefore can not be directly used to merge , and watermark PDF page should continue to copy it into a new one spare new_page , then use .mergePage method to complete the merger with each page, the page after the merger to be finally unified output writer !

About the use of .mergePage : appears on the following page .mergePage (appears on the upper page) , the final effect is as shown in the figure:

encryption

Encryption is very simple, just remember: "Encryption is for writer encryption"

Therefore only need to call pdf_writer.encrypt (password) after the relevant operation is completed

Take the encryption of a single PDF as an example:

Write at the end

Of course, in addition to PDF merging, splitting, encryption, and watermarking, we can also use Python to combine Excel and Word to achieve more automation requirements , which are left to the readers to develop themselves.

Finally, I hope everyone can understand that one of the cores of Python office automation is batch operation-freeing your hands and automating complex tasks!

更多精彩推荐
☞中小企业数智化转型，这个百万级客户市场差点被遗忘
☞CSDN 创始人蒋涛解读鸿蒙：对开发者究竟意味着什么？
☞华为 HarmonyOS 2.0 全面升级，构建中国软件的“根”！
☞程序员找 Bug 福音！微软全新开源查找修复 Bug 工具——Project OneFuzz
☞B 站神曲damedane：精髓在于换脸，五分钟就能学会
☞可怕！公司部署了一个东西，悄悄盯着你……

点分享点点赞点在看

Several methods of Python operation PDF

Write at the end

Guess you like