Introduction:
Using PyPDF2 it can easily process pdf files, and it provides multiple operations such as reading, cutting, merging, and file conversion.
installation:
pip install pypdf2
Core class
PdfFileReader
Construction method:
PyPDF2.PdfFileReader(stream,strict = True,warndest = None,overwriteWarnings = True)
- 1
Initialize a PdfFileReader object. This operation may take some time because the cross-reference table of the PDF stream is read into memory.
parameter:
- stream: *File object or an object that supports standard read and search methods similar to the File object. It can also be a string representing the path of the PDF file. *
- strict(bool) : Determine whether the user should be warned of the problem, and also cause some correctable problems to be fatal, the default is True
- warndest : the target to record warnings (default is sys.stderr)
- overwriteWarnings(bool) : Determine whether warnings.py will overwrite the Python module with a custom implementation (default is True)
Properties and methods of the PdfFileReader object
Properties and methods | description |
---|---|
getDestinationPageNumber(destination) | Retrieve the page number of a given target object |
getDocumentInfo() | Document information dictionary for retrieving PDF files |
getFields(tree = None,retval = None,fileObj= None) | If this PDF contains interactive form fields, extract the field data, |
getFormTextFields() | Retrieve form fields with text data (input, drop-down list) from the document |
getNameDestinations(tree = None,retval= None) | Retrieve the specified target in the document |
getNumPages() | Count the number of pages in this PDF file |
getOutlines(node = None,outline = None,) | Retrieve the document outline that appears in the document |
getPage(pageNumber) | Retrieve the specified number of pages from this PDF file |
getPageLayout() | Get page layout |
getPageMode () | Get page mode |
getPageNumber(pageObject) | Retrieve the page number of the given pageObject |
getXmpMetadata () | Retrieve XMP data from the root directory of the PDF document |
isEncrypted | Read-only boolean attribute showing whether the PDF file is encrypted |
namedDestinations | Access the getNamedDestinations() read-only properties of the function |
Function: Use PyPDF2 to read the basic information of pdf files
from PyPDF2 import PdfFileReader
readFile = 'D:\\1.pdf'
# 获取 PdfFileReader 对象
pdfFileReader = PdfFileReader(readFile) # 或者这个方式:pdfFileReader = PdfFileReader(open(readFile, 'rb'))
# 获取 PDF 文件的文档信息
documentInfo = pdfFileReader.getDocumentInfo()
print('documentInfo = %s' % documentInfo)
# 获取页面布局
pageLayout = pdfFileReader.getPageLayout()
print('pageLayout = %s ' % pageLayout)
# 获取页模式
pageMode = pdfFileReader.getPageMode()
print('pageMode = %s' % pageMode)
xmpMetadata = pdfFileReader.getXmpMetadata()
print('xmpMetadata = %s ' % xmpMetadata)
# 获取 pdf 文件页数
pageCount = pdfFileReader.getNumPages()
print('pageCount = %s' % pageCount)
for index in range(0, pageCount):
# 返回指定页编号的 pageObject
pageObj = pdfFileReader.getPage(index)
print('index = %d , pageObj = %s' % (index, type(pageObj))) # <class 'PyPDF2.pdf.PageObject'>
# 获取 pageObject 在 PDF 文档中处于的页码
pageNumber = pdfFileReader.getPageNumber(pageObj)
print('pageNumber = %s ' % pageNumber)
PdfFileWriter
This class supports PDF files and gives pages generated by other classes.
Properties and methods | description |
---|---|
addAttachment (fname, trusted) | Embed files in PDF |
addBlankPage(width= None,height=None) | Append a blank page to this PDF file and return it |
addBookmark(title,pagenum,parent=None, color=None,bold=False,italic=False,fit=’/fit,*args’) |
|
addJS(javascript) | Add javascript that will be activated when opening this PDF |
addLink(pagenum,pagedest,rect,border=None,fit=’/fit’,*args) | Add an internal link from a rectangular area to the specified page |
addPage(page) | Add a page to this PDF file, the page is usually obtained from the PdfFileReader instance |
getNumpages() | Number of pages |
getPage(pageNumber) | Retrieve a numbered page from this PDF file |
insertBlankPage(width=None,height=None,index=0) | Insert a blank page into this PDF file and return it. If the page size is not specified, the last page size is used |
insertPage(page,index=0) | Insert a page in this PDF file, the page is usually obtained from the PdfFileReader instance |
removeLinks() | Delete the connection box comment from the output |
removeText(ignoreByteStringObject = False) | Remove the image from this output |
write(stream) | Write the collection of pages added to this object into a PDF file |
Function: Use PyPDF2 to write the specified pdf file into another specified pdf file
# encoding:utf-8
from PyPDF2 import PdfFileReader, PdfFileWriter
readFile = 'D:\\1.pdf'
outFile = 'D:\\2.pdf'
pdfFileWriter = PdfFileWriter()
# 获取 PdfFileReader 对象
pdfFileReader = PdfFileReader(readFile) # 或者这个方式:pdfFileReader = PdfFileReader(open(readFile, 'rb'))
# 文档总页数
numPages = pdfFileReader.getNumPages()
for index in range(0, numPages):
pageObj = pdfFileReader.getPage(index)
pdfFileWriter.addPage(pageObj)
# 添加完每页,再一起保存至文件中
pdfFileWriter.write(open(outFile, 'wb'))
pdfFileWriter.addBlankPage()
pdfFileWriter.addBlankPage()
pdfFileWriter.write(open(outFile, "wb"))
PageObject
PageObject(pdf=None,indirectRef=None)
- 1
This class represents a single page in the PDF file, usually this object is obtained by accessing the getPage() method of the PdfFileReader object, or you can use the createBlankPage() static method to create an empty page.
parameter:
- pdf: The PDF file to which the page belongs.
- indirectRef: Store the original indirect reference of the source object in its source PDF.
Properties and methods of the PageObject object
Attribute or method | description |
---|---|
static createBlankPage(pdf=None,width=None,height=None) | Return to a new blank page |
extractText() | Find all text drawing commands, in the order they are provided in the content stream, and extract the text |
getContents() | Access page content, return Contents object or None |
rotateClockwise(angle) | Rotate 90 degrees clockwise |
scale (sx, sy) | By applying a transformation matrix to its content and updating the page size |
Function: Specify two pdf files to merge into one pdf file
# encoding:utf-8
from PyPDF2 import PdfFileReader, PdfFileWriter
pdfFileWriter = PdfFileWriter()
inFileList =['D:\\1.pdf',
'D:\\2.pdf']
outFile = "D:\\3.pdf"
for inFile in inFileList:
# 依次循环打开要合并文件
pdfReader = PdfFileReader(open(inFile, 'rb'))
numPages = pdfReader.getNumPages()
for index in range(0, numPages):
pageObj = pdfReader.getPage(index)
pdfFileWriter.addPage(pageObj)
# 最后,统一写入到输出文件中
pdfFileWriter.write(open(outFile, 'wb'))