Python office automation-basic use of PyPDF2 library

Hello everyone, I’m Xiao Zhang~, today’s article is related to automated office. At present, I personally think that there are three Python libraries that handle PDF quite well, namely PyPDF2, Pdfplumer and PDFminer;

image-20210313210858337

Today's tutorial content mainly focuses on PyPDF2, with the help of it to achieve the following basic operations on PDF

  • 1. Split a single PDF into multiple PDF files;
  • 2. Combine multiple PDFs into one PDF file;
  • 3. Rotate a page in the PDF;
  • 4. Add watermark to PDF;
  • 5. Encrypt PDF;
  • 6. Decrypt the PDF;
  • 6. Get basic information of PDF, such as author, title, number of pages, etc.;

PyPDF2 history

Before the text begins, let’s talk about the development history of PyPDF2. The predecessor of PyPDF was the pyPDf package, which was released in 2005. The last version of the package was released in 2010. After about a year, a company named Phasit sponsored a branch of PyPdf. Later named PyPDF2, the functions of the two versions are basically the same, the biggest difference is that PyPDF2 has added support for Python3 features;

PyPDF2 has not been updated recently. The latest version was released in 2016, but the popularity of use has not faded; although different versions such as PyPDF3 and PyPDF4 appeared later, these packages are not fully backward compatible with PyPDF2 functions. The welcome level is certainly not as good as PyPDF2

PyPDF2 installation

Like other Python libraries, installation can be done through pip or conda tools

pip install pypdf2

PDF information extraction

Use PyPDF2 to extract some metadata and text information from PDF, and have a general understanding of PDF

The data that can be extracted with PyPDF2 is as follows

  • Author
  • creator;
  • maker;
  • Subject;
  • title;
  • Number of pages;

Here I downloaded the PDF sample "Seige_of_Vicksburg_Sample_OCR" provided by the official website, a total of six pages, as the test data

image-20210313230206113

from  PyPDF2 import PdfFileReader


# # pdf 文档
pdf_path = "D:/Data/自动化办公/PDF/Seige_of_Vicksburg_Sample_OCR.pdf"

with open(pdf_path,'rb') as f:
    pdf = PdfFileReader(f)
    infomation = pdf.getDocumentInfo()
    number_of_pages = pdf.getNumPages()

    txt = f'''{pdf_path} information:
    Author : {infomation.author},
    Creator : {infomation.creator},
    Producer : {infomation.producer},
    Subject : {infomation.subject},
    Title : {infomation.title},
    Number of pages : {number_of_pages}
    '''
    print(txt)

The following is the print result

D:/Data/自动化办公/PDF/Seige_of_Vicksburg_Sample_OCR.pdf information:
    Author : DSI,
    Creator : LuraDocument PDF Compressor Server 5.5.46.38,
    Producer : LuraDocument PDF v2.38,
    Subject : None,
    Title : Binder1.pdf,
    Number of pages : 6

In the above example, the PdfFileReader class is used to interact with pdf files; calling the getDocumentInfo() method in this class returns an instance of DocumentInformation, which stores the information we need; calling the getNumPages method on the reader object can also return Document pages;

Personal views, and there's also the data pages of a little value, when the bulk of the statistical method is applicable

PDF page rotation

Each page of pdf in PyPDF2 exists as a page object, and an instance of a page can be returned through the get_Page(page_index) method in the reader object, where page_index represents the index

There are two ways to rotate a page

  • rotateClockwise(90), rotate 90 degrees clockwise;
  • rotateCounterClockwise(90), rotate 90 degrees counterclockwise;

The following code means that the first page in the target PDF is rotated 90 degrees clockwise, the second page is rotated 90 degrees counterclockwise, and the position and angle of other pages remain unchanged;

from  PyPDF2 import PdfFileReader,PdfFileWriter

pdf_writer = PdfFileWriter()
pdf_reader = PdfFileReader(pdf_path)
# Rotate page 90 degrees to the right
page_1 = pdf_reader.getPage(0).rotateClockwise(90)
pdf_writer.addPage(page_1)
# Rotate page 90 degrees to the left
page_2 = pdf_reader.getPage(1).rotateCounterClockwise(90)
pdf_writer.addPage(page_2)
# 之后的正常写出
for i in range(2,pdf_reader.getNumPages()):
    pdf_writer.addPage(pdf_reader.getPage(i))

with open(pdf_path, 'wb') as fh:
     pdf_writer.write(fh)

The result is as follows

image-20210313232532349

PdfFileReader,PdfFileWriterBoth of these classes are used in the code. The page rotation does not operate on the basis of the original PDF, but creates a new PDF stream object in the memory, and adds each page after the operation to this through the addPage() method. Object, and then write the object in memory to the file;

I write to you, in fact, to tell the truth page rotation this function is not basically no role in here just trying to act as some of the words, ha ha ha

Split a single PDF into multiple PDFs

from  PyPDF2 import PdfFileReader,PdfFileWriter

# # pdf 文档
pdf_path = "D:/Data/自动化办公/PDF/Seige_of_Vicksburg_Sample_OCR.pdf"
save_path = 'D:/Data/自动化办公/PDF/'

# Split Pages of PDF

pdf_reader = PdfFileReader(pdf_path)
for i in range(0,pdf_reader.getNumPages()):
    pdf_writer = PdfFileWriter()
    pdf_writer.addPage(pdf_reader.getPage(i))
    # Every page write to a path
    with open(save_path+'{}.pdf'.format(str(i)), 'wb') as fh:
         pdf_writer.write(fh)
    print('{} Save Sucessfully !\n'.format(str(i)))

The code splits each page in the original PDF file into each PDF file, where the file name is named by the page index;

image-20210313235957539

It can also be extracted to a fixed page number range in the pdf file by splitting. For example, I only want to extract pages 2-5 in the pdf and not other parts, so the code will be written in the form

pdf_writer = PdfFileWriter()
pdf_reader = PdfFileReader(pdf_path)
for i in range(1,5):
    # pdf_writer = PdfFileWriter()
    pdf_writer.addPage(pdf_reader.getPage(i))
    # Every page write to a path
with open(save_path+'2_5.pdf', 'wb') as fh:
        pdf_writer.write(fh)

Combine multiple PDF files into a single

Although the pdf splitting and merging directions are opposite, the classes and principles used are the same

PdfFileReaderRead each pdf, get the page object of each page recursively, PdfFileWritecreate a new stream object, write the page objects read in the previous memory to this stream object in order, and finally write to the disk file

from  PyPDF2 import PdfFileReader,PdfFileWriter

p1_pdf = "D:/Data/自动化办公/PDF/Seige_of_Vicksburg_Sample_OCR.pdf"
p2_pdf = "D:/Data/自动化办公/PDF/Seige_of_Vicksburg_Sample_OCR.pdf"


merge_pdf = 'D:/Data/自动化办公/PDF/merge.pdf'

p1_reader = PdfFileReader(p1_pdf)
p2_reader = PdfFileReader(p2_pdf)

merge = PdfFileWriter()
# Write p1
for i in range(0,p1_reader.getNumPages()):
    merge.addPage(p1_reader.getPage(i))
# Write p2
for j in range(0,p2_reader.getNumPages()):
    merge.addPage(p2_reader.getPage(j))

# Write out
with open(merge_pdf,'wb') as f:
    merge.write(f)

The results are as follows

image-20210314002536754

PDF add watermark

Among the functions listed today, I think this function is the most useful. Adding watermarks in batches mainly uses the margePage() method in the page object. The effect of adding watermarks is achieved by merging two pages.

Because PyPDF2 can only manipulate pdf objects, before adding a watermark, you need to store the watermark to be added in a pdf file

from  PyPDF2 import PdfFileReader,PdfFileWriter
watermark = 'D:/Data/自动化办公/PDF/watermark.pdf'
input_pdf = 'D:/Data/自动化办公/PDF/merge.pdf'
output = 'D:/Data/自动化办公/PDF/merge_watermark.pdf'


watermark_obj = PdfFileReader(watermark)
watermark_page = watermark_obj.getPage(0)

pdf_reader = PdfFileReader(input_pdf)
pdf_writer = PdfFileWriter()

# Watermark all the pages
for page in range(pdf_reader.getNumPages()):
    page = pdf_reader.getPage(page)
    page.mergePage(watermark_page)
    pdf_writer.addPage(page)

with open(output, 'wb') as out:
    pdf_writer.write(out)

The effect is as follows, from left to right, it is the original image, the watermark, and the original image after adding the watermark [External link image transfer failed, the source site may have an anti-leech link mechanism, it is recommended to save the image and upload it directly (img-XP0ELXsk-1615825692823) )(https://images.zeroingpython.top/img/image-20210314005417135.png)]

The above effect is not good because the page layout problem was not considered when the watermark was made, so a part of it was missing when merging;

The advantage of using the above code to add a watermark is that you can add a field watermark to the specified pages of the pdf, for example, only add even-numbered pages to odd-numbered pages, which is not only flexible and efficient, but of course, you can also perform batch operations on multiple files

PDF encryption and decryption

pdf encryption

For a pdf file, if we don’t want others to be able to read the content, we can set a password for it through pypdf2. If it’s only a single file, it’s better to find a tool for manual operation and it will be more efficient, but if it is Multiple files, the following method is highly recommended

watermark = 'D:/Data/自动化办公/PDF/Seige_of_Vicksburg_Sample_OCR.pdf'
input_pdf = 'D:/Data/自动化办公/PDF/merge.pdf'
output = 'D:/Data/自动化办公/PDF/merge_watermark1.pdf'


watermark_obj = PdfFileReader(watermark)
watermark_page = watermark_obj.getPage(0)

pdf_reader = PdfFileReader(input_pdf)
pdf_writer = PdfFileWriter()

# Watermark all the pages
for page in range(pdf_reader.getNumPages()):
    page = pdf_reader.getPage(page)
    page.mergePage(watermark_page)
    pdf_writer.addPage(page)
pdf_writer.encrypt(user_pwd='123456',
                       use_128bit=True)
with open(output, 'wb') as out:
    pdf_writer.write(out)

image-20210314092935806

The encrypt function is mainly used, and three parameters need to be paid attention to

  • user_pwd, str, user password, used to restrict opening and reading files;

  • owner_pwd, str, is one level higher than the user password. When provided, the file can be opened without any restrictions. If not specified, the default owner_pwd and user_pwd are the same;

  • use_128bit Boolean value, used to indicate whether to use 128 bits as a password, False means to use a 40-bit password, the default is True;

pdf decryption

Decryption is used when reading the file, and the decrypt() function is used

rom PyPDF2 import PdfFileWriter, PdfFileReader

input_pdf='reportlab-encrypted.pdf'
output_pdf='reportlab.pdf'
password='twofish'

pdf_writer = PdfFileWriter()
pdf_reader = PdfFileReader(input_pdf)
pdf_reader = pdf_reader.decrypt(password)

for page in range(pdf_reader.getNumPages()):
     pdf_writer.addPage(pdf_reader.getPage(page))

with open(output_pdf, 'wb') as fh:
      pdf_writer.write(fh)

The principle of decryption in the above example is to read an encrypted file and write it to a non-encrypted pdf

summary

This article introduces the basic usage of the PyPDF2 library, and implements some basic operations with the help of it plus code examples; but here is a reminder that all the above operations are only applicable to batch operation scenarios. If the object is a single file, it is recommended to use conventional methods. Show off skills will only waste time

I have not covered the extraction and writing of graphic content in pdf, because pypdf2 is not good at this aspect, and Pdfplumber and PDFminer are much better in text extraction. If you want to do well , you must first sharpen your tools ; I will introduce this aspect in the following tutorials, and look forward to your attention!

Well, the above is all the content of this article. Finally, thank you all for reading. See you in the next issue~

Guess you like

Origin blog.csdn.net/weixin_42512684/article/details/114860216
Recommended