Introduction:

Using PyPDF2 it can easily process pdf files, and it provides multiple operations such as reading, cutting, merging, and file conversion.

installation:

pip install pypdf2

Core class

PdfFileReader

Construction method:

PyPDF2.PdfFileReader(stream,strict = True,warndest = None,overwriteWarnings = True)

Initialize a PdfFileReader object. This operation may take some time because the cross-reference table of the PDF stream is read into memory.

parameter:

stream: *File object or an object that supports standard read and search methods similar to the File object. It can also be a string representing the path of the PDF file. *
strict(bool) : Determine whether the user should be warned of the problem, and also cause some correctable problems to be fatal, the default is True
warndest : the target to record warnings (default is sys.stderr)
overwriteWarnings(bool) : Determine whether warnings.py will overwrite the Python module with a custom implementation (default is True)

Properties and methods of the PdfFileReader object

Properties and methods	description
getDestinationPageNumber(destination)	Retrieve the page number of a given target object
getDocumentInfo()	Document information dictionary for retrieving PDF files
getFields(tree = None,retval = None,fileObj= None)	If this PDF contains interactive form fields, extract the field data,
getFormTextFields()	Retrieve form fields with text data (input, drop-down list) from the document
getNameDestinations(tree = None,retval= None)	Retrieve the specified target in the document
getNumPages()	Count the number of pages in this PDF file
getOutlines(node = None,outline = None,)	Retrieve the document outline that appears in the document
getPage(pageNumber)	Retrieve the specified number of pages from this PDF file
getPageLayout()	Get page layout
getPageMode ()	Get page mode
getPageNumber(pageObject)	Retrieve the page number of the given pageObject
getXmpMetadata ()	Retrieve XMP data from the root directory of the PDF document
isEncrypted	Read-only boolean attribute showing whether the PDF file is encrypted
namedDestinations	Access the `getNamedDestinations()`read-only properties of the function

Function: Use PyPDF2 to read the basic information of pdf files

from PyPDF2 import PdfFileReader


readFile = 'D:\\1.pdf'
# 获取 PdfFileReader 对象
pdfFileReader = PdfFileReader(readFile)  # 或者这个方式：pdfFileReader = PdfFileReader(open(readFile, 'rb'))
# 获取 PDF 文件的文档信息
documentInfo = pdfFileReader.getDocumentInfo()
print('documentInfo = %s' % documentInfo)
# 获取页面布局
pageLayout = pdfFileReader.getPageLayout()
print('pageLayout = %s ' % pageLayout)

# 获取页模式
pageMode = pdfFileReader.getPageMode()
print('pageMode = %s' % pageMode)

xmpMetadata = pdfFileReader.getXmpMetadata()
print('xmpMetadata  = %s ' % xmpMetadata)

# 获取 pdf 文件页数
pageCount = pdfFileReader.getNumPages()

print('pageCount = %s' % pageCount)
for index in range(0, pageCount):
    # 返回指定页编号的 pageObject
    pageObj = pdfFileReader.getPage(index)
    print('index = %d , pageObj = %s' % (index, type(pageObj)))  # <class 'PyPDF2.pdf.PageObject'>
    # 获取 pageObject 在 PDF 文档中处于的页码
    pageNumber = pdfFileReader.getPageNumber(pageObj)
    print('pageNumber = %s ' % pageNumber)

PdfFileWriter

This class supports PDF files and gives pages generated by other classes.

Properties and methods	description
addAttachment (fname, trusted)	Embed files in PDF
addBlankPage(width= None,height=None)	Append a blank page to this PDF file and return it
addBookmark(title,pagenum,parent=None, color=None,bold=False,italic=False,fit=’/fit,*args’)
addJS(javascript)	Add javascript that will be activated when opening this PDF
addLink(pagenum,pagedest,rect,border=None,fit=’/fit’,*args)	Add an internal link from a rectangular area to the specified page
addPage(page)	Add a page to this PDF file, the page is usually obtained from the PdfFileReader instance
getNumpages()	Number of pages
getPage(pageNumber)	Retrieve a numbered page from this PDF file
insertBlankPage(width=None,height=None,index=0)	Insert a blank page into this PDF file and return it. If the page size is not specified, the last page size is used
insertPage(page,index=0)	Insert a page in this PDF file, the page is usually obtained from the PdfFileReader instance
removeLinks()	Delete the connection box comment from the output
removeText(ignoreByteStringObject = False)	Remove the image from this output
write(stream)	Write the collection of pages added to this object into a PDF file

Function: Use PyPDF2 to write the specified pdf file into another specified pdf file

# encoding:utf-8
from PyPDF2 import PdfFileReader, PdfFileWriter

readFile = 'D:\\1.pdf'
outFile = 'D:\\2.pdf'
pdfFileWriter = PdfFileWriter()

# 获取 PdfFileReader 对象
pdfFileReader = PdfFileReader(readFile)  # 或者这个方式：pdfFileReader = PdfFileReader(open(readFile, 'rb'))
# 文档总页数
numPages = pdfFileReader.getNumPages()


for index in range(0, numPages):
    pageObj = pdfFileReader.getPage(index)
    pdfFileWriter.addPage(pageObj)
    # 添加完每页，再一起保存至文件中
    pdfFileWriter.write(open(outFile, 'wb'))
pdfFileWriter.addBlankPage()
pdfFileWriter.addBlankPage()
pdfFileWriter.write(open(outFile, "wb"))

PageObject

PageObject(pdf=None,indirectRef=None)

This class represents a single page in the PDF file, usually this object is obtained by accessing the getPage() method of the PdfFileReader object, or you can use the createBlankPage() static method to create an empty page.

parameter:

pdf: The PDF file to which the page belongs.
indirectRef: Store the original indirect reference of the source object in its source PDF.

Properties and methods of the PageObject object

Attribute or method	description
static createBlankPage(pdf=None,width=None,height=None)	Return to a new blank page
extractText()	Find all text drawing commands, in the order they are provided in the content stream, and extract the text
getContents()	Access page content, return Contents object or None
rotateClockwise(angle)	Rotate 90 degrees clockwise
scale (sx, sy)	By applying a transformation matrix to its content and updating the page size

Function: Specify two pdf files to merge into one pdf file

# encoding:utf-8
from PyPDF2 import PdfFileReader, PdfFileWriter

pdfFileWriter = PdfFileWriter()
inFileList =['D:\\1.pdf',
             'D:\\2.pdf']
outFile = "D:\\3.pdf"
for inFile in inFileList:
    # 依次循环打开要合并文件
    pdfReader = PdfFileReader(open(inFile, 'rb'))
    numPages = pdfReader.getNumPages()
    for index in range(0, numPages):
        pageObj = pdfReader.getPage(index)
        pdfFileWriter.addPage(pageObj)

    # 最后,统一写入到输出文件中
    pdfFileWriter.write(open(outFile, 'wb'))

python3 integrates PyPDF2