The most complete summary! Talk about several methods of Python operation PDF (merge, split, watermark, encryption)

I. Introduction

Hello everyone, I have written a case about Python operating PDF before. The original intention of this case is to provide you with a convenient script. It does not explain the principle too much. It involves a very practical module for PDF processing PyPDF2. This article will analyze it. This module will mainly involve

  • os Comprehensive application of modules
  • glob Comprehensive application of modules
  • PyPDF2 Module operation

2. Basic operation

The code for PyPDF2 import module is often:

from PyPDF2 import PdfFileReader, PdfFileWriter

Two methods are imported here:

  • PdfFileReader Can be understood as a reader
  • PdfFileWriterCan be understood as a writer

Next, we will further understand the wonders of these two tools through a few cases. The sample file used is the pdf of 5 invoices. imageEach invoice PDF consists of two pages:image

Three, merge

The first job is to merge 5 invoice pdfs into 10 pages . How should the reader and writer work together here?

The logic is as follows:

  1. The reader reads all pdfs once
  2. The reader passes the read content to the writer
  3. Writer unified output to a new pdf

There is also an important point of knowledge here: the reader can only deliver the read content to the writer page by page.

Therefore, the first and second steps in the logic are not actually independent steps , but after the reader has read a pdf, it will loop all the pages of the pdf and hand them to the writer page by page. Finally, wait until all the reading work is finished before outputting.

Looking at the code can make the idea clearer:

from PyPDF2 import PdfFileReader, PdfFileWriter

path = r'C:\Users\xxxxxx'
pdf_writer = PdfFileWriter()

for i in range(16):
    pdf_reader = PdfFileReader(path + '/INV{}.pdf'.format(i))
    for page in range(pdf_reader.getNumPages()):
        pdf_writer.addPage(pdf_reader.getPage(page))

with open(path + r'\合并PDF\merge.pdf''wb'as out:
    pdf_writer.write(out)

Since all content needs to be delivered to the same writer for final output, the initialization of the writer must be outside the loop body.

If it is in the loop body, it will become a new writer generated every time a pdf is accessed and read , so that the content of each reader handed over to the writer will be overwritten repeatedly , and our merge requirements cannot be achieved!

The code at the beginning of the loop body:

for i in range(16):
    pdf_reader = PdfFileReader(path + '/INV{}.pdf'.format(i))

目的就是每次循环读取一个新的pdf文件交给读取器进行后续操作。实际上这种写法不是很提倡,由于各pdf命名恰好很规则,所以可以直接人为指定数字进行循环。更好的方法是用 glob 模块:

import glob
for file in glob.glob(path + '/*.pdf'):
    pdf_reader = PdfFileReader(path)

代码中 pdf_reader.getNumPages(): 能够获取读取器的页数,配合range就能遍历读取器的所有页。

pdf_writer.addPage(pdf_reader.getPage(page))能够将当前页交给写入器。

最后,用with新建一个pdf并由写入器的 pdf_writer.write(out)方法输出即可

四、拆分

如果明白了合并操作中读取器和写入器的配合,那么拆分就很好理解了,这里我们以拆分INV1.pdf为2个单独的pdf文档为例,同样也先来捋一捋逻辑:

  1. 读取器读取PDF文档
  2. 读取器一页一页交给写入器
  3. 写入器每获取一页就立即输出

通过这个代码逻辑我们也可以明白,写入器初始化和输出的位置一定都在读取PDF循环每一页的循环体内,而不是在循环体外

代码很简单:

from PyPDF2 import PdfFileReader, PdfFileWriter
path = r'C:\Users\xxx'
pdf_reader = PdfFileReader(path + '\INV1.pdf')

for page in range(pdf_reader.getNumPages()):
    # 遍历到每一页挨个生成写入器
    pdf_writer = PdfFileWriter()
    pdf_writer.addPage(pdf_reader.getPage(page))
    # 写入器被添加一页后立即输出产生pdf
    with open(path + '\INV1-{}.pdf'.format(page + 1), 'wb'as out:
        pdf_writer.write(out)

五、水印

本次的工作是将下图作为水印添加到INV1.pdfimage首先是准备工作,将需要作为水印的图片插入word中调整合适位置后保存为PDF文件。然后就可以码代码了,需要额外用到copy模块,具体解释见下图:image就是把读取器和写入器初始化,并且把水印PDF页先读取好备用,核心代码稍微比较难理解:image加水印本质上就是把水印PDF页和需要加水印的每一页都合并一遍

由于需要加水印的PDF可能有很多页,而水印PDF只有一页,因此如果直接把水印PDF拿来合并,可以抽象理解成加完第一页,水印PDF页就没有了

Therefore, it cannot be merged directly . Instead, the watermarked PDF pages must be continuously copygenerated into new pages new_pagefor later use, and then use the .mergePagemethod to complete the merge with each page , and then hand the merged pages to the writer for unified output!

About .mergePagethe use: appears on the following page. mergePage (appears on the upper page) , the final effect is as shown in the figure:image

Six, encryption

Encryption is very simple, just remember: "Encryption is for writer encryption"

So only need to call after the relevant operation is completedpdf_writer.encrypt(密码)

Take the encryption of a single PDF as an example:image

Write at the end

Of course, in addition to the PDF merge, split, encrypt, watermark, we can use Python knot together Excel and Word to achieve more automation needs , it is left to the reader to develop these.

Finally, I hope everyone can understand that one of the core of Python office automation is batch operation-freeing your hands and automating complex tasks!


Guess you like

Origin blog.51cto.com/15064626/2598021
Recommended