The most complete summary! Talk about several methods of Python operation PDF

Author | Chen Xi

Source | Early Python

Preface

This article mainly involves:

Comprehensive application of os module
Comprehensive application of glob module
PyPDF2 module operation

Basic operation

The code for PyPDF2 import module is often:

from PyPDF2 import PdfFileReader, PdfFileWriter

Two methods are imported here:

PdfFileReader can be understood as a reader
PdfFileWriter can be understood as a writer

Next, we will further understand the wonders of these two tools through a few cases. The sample file used is the pdf of 5 invoices.

The PDF of each invoice consists of two pages:

merge

The first job is to merge 5 invoice pdfs into 10 pages. How should the reader and writer work together here?

The logic is as follows:

The reader reads all pdfs once
The reader passes the read content to the writer
Writer unified output to a new pdf

There is also an important point of knowledge here: the reader can only deliver the read content to the writer page by page.

Therefore, the first step and the second step in the logic are not actually independent steps, but after the reader reads a pdf, it loops all the pages of the pdf and passes them to the writer page by page. Finally, wait until all the reading work is finished before outputting.

Looking at the code can make the idea clearer:

from PyPDF2 import PdfFileReader, PdfFileWriterpath = r'C:\Users\xxxxxx'pdf_writer = PdfFileWriter()for i in range(1, 6):    pdf_reader = PdfFileReader(path + '/INV{}.pdf'.format(i))for page in range(pdf_reader.getNumPages()):        pdf_writer.addPage(pdf_reader.getPage(page))with open(path + r'\合并PDF\merge.pdf', 'wb') as out:    pdf_writer.write(out)

Since all content needs to be delivered to the same writer for final output, the initialization of the writer must be outside the loop body.

If it is in the loop body, it will become a new writer generated every time a pdf is accessed and read, so that the content of each reader handed over to the writer will be overwritten repeatedly, and our merge requirements cannot be achieved!

The code at the beginning of the loop body:

for i in range(1, 6):    pdf_reader = PdfFileReader(path + '/INV{}.pdf'.format(i))

The purpose is to read a new pdf file in each cycle and hand it over to the reader for subsequent operations. In fact, this writing method is not very recommended, because each pdf name happens to be very regular, so you can directly manually specify the number to cycle. A better way is to use the glob module:

import globfor file in glob.glob(path + '/*.pdf'):    pdf_reader = PdfFileReader(path)

In the code, pdf_reader.getNumPages(): can get the number of pages of the reader, and can traverse all the pages of the reader with range.

pdf_writer.addPage(pdf_reader.getPage(page)) can give the current page to the writer.

Finally, use with to create a new pdf and output it by the pdf_writer.write(out) method of the writer

Split

If you understand the cooperation of the reader and the writer in the merge operation, then the splitting is easy to understand. Here we take the split INV1.pdf into two separate pdf documents as an example, and we will also start with a stroke. logic:

Reader to read PDF documents
Reader handed over to writer page by page
Writer immediately outputs every time it gets a page

Through this code logic, we can also understand that the initialization and output positions of the writer must be in the loop body that reads each page of the PDF loop, not outside the loop.

The code is simple:

from PyPDF2 import PdfFileReader, PdfFileWriterpath = r'C:\Users\xxx'pdf_reader = PdfFileReader(path + '\INV1.pdf')for page in range(pdf_reader.getNumPages()):# 遍历到每一页挨个生成写入器    pdf_writer = PdfFileWriter()    pdf_writer.addPage(pdf_reader.getPage(page))# 写入器被添加一页后立即输出产生pdfwith open(path + '\INV1-{}.pdf'.format(page + 1), 'wb') as out:        pdf_writer.write(out)

Watermark

This work is to add the following picture as a watermark to INV1.pdf

The first is the preparation work. Insert the picture that needs to be a watermark into Word, adjust the appropriate position and save it as a PDF file. Then the code can be coded, and the copy module needs to be additionally used. The specific explanation is shown in the figure below:

It is to initialize the reader and writer, and read the watermark PDF page first for backup. The core code is a little harder to understand:

Watermarking is essentially to merge the watermarked PDF page with every page that needs to be watermarked.

Since the PDF that needs to be watermarked may have many pages, and the watermarked PDF has only one page, if the watermarked PDF is merged directly, it can be abstractly understood as the first page is added, and the watermarked PDF page is gone.

Therefore, it cannot be merged directly. Instead, the watermarked PDF pages must be continuously copied into a new page standby new_page, and then the .mergePage method is used to complete the merge with each page, and the merged page is handed over to the writer for final unified output !

About the use of .mergePage: appears on the following page .mergePage (appears on the upper page), the final effect is as shown in the figure:

encryption

Encryption is very simple, just remember: "Encryption is for writer encryption"

Therefore only need to call pdf_writer.encrypt (password) after the relevant operation is completed

Take the encryption of a single PDF as an example:

Of course, in addition to PDF merging, splitting, encryption, and watermarking, we can also use Python to combine Excel and Word to achieve more automation requirements, which are left to the readers to develop themselves.



更多精彩推荐

The most complete summary! Talk about several methods of Python operation PDF

Author | Chen Xi

It is the first time to implement BERT real-time inference without sacrificing accuracy on the mobile phone, which is nearly 8 times faster than TensorFlow-Lite and only takes 45ms per frame.

Guess you like