In practical applications, it may involve processing pdf files. PyPDF2 is such a library, which can easily process pdf files. It provides various operations such as reading, cutting, merging, and file conversion.
Document address: http://pythonhosted.org/PyPDF2/
PyPDF2 installation
PyCharm 安装:File -> Default Settings -> Project Interpreter
PdfFileReader
Construction method:
PyPDF2.PdfFileReader(stream,strict = True,warndest = None,overwriteWarnings = True)
Initializes a PdfFileReader object, which may take some time as the PDF stream's cross-reference table is read into memory.
parameter:
- stream: A *File object or an object that supports standard read and lookup methods similar to the File object, or a string representing the path to the PDF file. *
- strict (bool) : Determines whether the user should be warned about problems used, also causing some correctable problems to be fatal, defaults to True
- warndest : destination to log warnings to (default is sys.stderr)
- overwriteWarnings(bool) : Determines whether warnings.py overwrites Python modules with custom implementations (default True)
Properties and methods of the PdfFileReader object
properties and methods | describe |
---|---|
getDestinationPageNumber(destination) | Retrieve the page number for the given target object |
getDocumentInfo() | Retrieve document information dictionary for PDF files |
getFields(tree = None,retval = None,fileObj= None) | If this PDF contains interactive form fields, extract field data, |
getFormTextFields() | Retrieve form fields with text data (inputs, dropdowns) from document |
getNameDestinations(tree = None,retval= None) | Retrieves the specified target in the document |
getNumPages() | Count the number of pages in this PDF file |
getOutlines(node = None,outline = None,) | Retrieve document outlines that appear in a document |
getPage(pageNumber) | Retrieve the specified numbered page from this PDF file |
getPageLayout() | get page layout |
getPageMode() | get page mode |
getPageNumber(pageObject) | Retrieves the page number on which the given pageObject is located |
getXmpMetadata () | Retrieve XMP data from PDF document root |
isEncrypted | Read-only boolean property showing whether the PDF file is encrypted |
namedDestinations | Access the getNamedDestinations() read-only property of the function |
PDF read operation:
# encoding:utf-8
from PyPDF2 import PdfFileReader, PdfFileWriter
readFile = 'C:/Users/Administrator/Desktop/RxJava 完全解析.pdf'
# 获取 PdfFileReader 对象
pdfFileReader = PdfFileReader(readFile) # 或者这个方式:pdfFileReader = PdfFileReader(open(readFile, 'rb'))
# 获取 PDF 文件的文档信息
documentInfo = pdfFileReader.getDocumentInfo()
print('documentInfo = %s' % documentInfo)
# 获取页面布局
pageLayout = pdfFileReader.getPageLayout()
print('pageLayout = %s ' % pageLayout)
# 获取页模式
pageMode = pdfFileReader.getPageMode()
print('pageMode = %s' % pageMode)
xmpMetadata = pdfFileReader.getXmpMetadata()
print('xmpMetadata = %s ' % xmpMetadata)
# 获取 pdf 文件页数
pageCount = pdfFileReader.getNumPages()
print('pageCount = %s' % pageCount)
for index in range(0, pageCount):
# 返回指定页编号的 pageObject
pageObj = pdfFileReader.getPage(index)
print('index = %d , pageObj = %s' % (index, type(pageObj))) # <class 'PyPDF2.pdf.PageObject'>
# 获取 pageObject 在 PDF 文档中处于的页码
pageNumber = pdfFileReader.getPageNumber(pageObj)
print('pageNumber = %s ' % pageNumber)
Output result:
documentInfo = {'/Title': IndirectObject(157, 0), '/Producer': IndirectObject(158, 0), '/Creator': IndirectObject(159, 0), '/CreationDate': IndirectObject(160, 0), '/ModDate': IndirectObject(160, 0), '/Keywords': IndirectObject(161, 0), '/AAPL:Keywords': IndirectObject(162, 0)}
pageLayout = None
pageMode = None
xmpMetadata = None
pageCount = 3
index = 0 , pageObj = <class 'PyPDF2.pdf.PageObject'>
pageNumber = 0
index = 1 , pageObj = <class 'PyPDF2.pdf.PageObject'>
pageNumber = 1
index = 2 , pageObj = <class 'PyPDF2.pdf.PageObject'>
pageNumber = 2
PdfFileWriter
This class supports PDF files, giving pages generated by other classes.
properties and methods | describe |
---|---|
addAttachment (fname, trusted) | Embed files in PDF |
addBlankPage(width= None,height=None) | Append a blank page to this PDF file and return it |
addBookmark(title,pagenum,parent=None, color=None,bold=False,italic=False,fit=’/fit,*args’) |
|
addJS(javascript) | Add the javascript that will be started when opening this PDF |
addLink(pagenum,pagedest,rect,border=None,fit=’/fit’,*args) | Adds an internal link to the specified page from a rectangular area |
addPage(page) | Adds a page to this PDF file, usually obtained from a PdfFileReader instance |
getNumpages() | number of pages |
getPage(pageNumber) | Retrieve a numbered page from this PDF file |
insertBlankPage(width=None,height=None,index=0) | Inserts a blank page into this PDF file and returns it, using the size of the last page if no page size is specified |
insertPage(page,index=0) | Inserts a page in this PDF file, usually obtained from a PdfFileReader instance |
removeLinks() | Remove junction box annotations from times out |
removeText(ignoreByteStringObject = False) | remove images from this output |
write(stream) | Writes the collection of pages added to this object to the PDF file |
PDF write operation:
def addBlankpage():
readFile = 'C:/Users/Administrator/Desktop/RxJava 完全解析.pdf'
outFile = 'C:/Users/Administrator/Desktop/copy.pdf'
pdfFileWriter = PdfFileWriter()
# 获取 PdfFileReader 对象
pdfFileReader = PdfFileReader(readFile) # 或者这个方式:pdfFileReader = PdfFileReader(open(readFile, 'rb'))
numPages = pdfFileReader.getNumPages()
for index in range(0, numPages):
pageObj = pdfFileReader.getPage(index)
pdfFileWriter.addPage(pageObj) # 根据每页返回的 PageObject,写入到文件
pdfFileWriter.write(open(outFile, 'wb'))
pdfFileWriter.addBlankPage() # 在文件的最后一页写入一个空白页,保存至文件中
pdfFileWriter.write(open(outFile,'wb'))
The result: a blank page is written on the last page of the copy.pdf document written.
Split the document (take the pages after the fifth page)
def splitPdf():
readFile = 'C:/Users/Administrator/Desktop/RxJava 完全解析.pdf'
outFile = 'C:/Users/Administrator/Desktop/copy.pdf'
pdfFileWriter = PdfFileWriter()
# 获取 PdfFileReader 对象
pdfFileReader = PdfFileReader(readFile) # 或者这个方式:pdfFileReader = PdfFileReader(open(readFile, 'rb'))
# 文档总页数
numPages = pdfFileReader.getNumPages()
if numPages > 5:
# 从第五页之后的页面,输出到一个新的文件中,即分割文档
for index in range(5, numPages):
pageObj = pdfFileReader.getPage(index)
pdfFileWriter.addPage(pageObj)
# 添加完每页,再一起保存至文件中
pdfFileWriter.write(open(outFile, 'wb'))
Merge documents
def mergePdf(inFileList, outFile):
'''
合并文档
:param inFileList: 要合并的文档的 list
:param outFile: 合并后的输出文件
:return:
'''
pdfFileWriter = PdfFileWriter()
for inFile in inFileList:
# 依次循环打开要合并文件
pdfReader = PdfFileReader(open(inFile, 'rb'))
numPages = pdfReader.getNumPages()
for index in range(0, numPages):
pageObj = pdfReader.getPage(index)
pdfFileWriter.addPage(pageObj)
# 最后,统一写入到输出文件中
pdfFileWriter.write(open(outFile, 'wb'))
PageObject
PageObject(pdf=None,indirectRef=None)
This class represents a single page in a PDF file, usually this object is obtained by accessing the getPage() method of the PdfFileReader object, or an empty page can be created using the createBlankPage() static method.
parameter:
- pdf : The PDF file the page belongs to.
- indirectRef: Store the original indirect reference of the source object in its source PDF.
Properties and methods of the PageObject object
property or method | describe |
---|---|
static createBlankPage(pdf=None,width=None,height=None) | Return to a new blank page |
extractText() | Finds all text drawing commands, in the order they are provided in the content stream, and extracts the text |
getContents() | Access page content, return Contents object or None |
rotateClockwise(angle) | Rotate 90 degrees clockwise |
scale (sx, sy) | By applying a transformation matrix to its content and updating the page size |
Roughly read PDF text content
def getPdfContent(filename):
pdf = PdfFileReader(open(filename, "rb"))
content = ""
for i in range(0, pdf.getNumPages()):
pageObj = pdf.getPage(i)
extractedText = pageObj.extractText()
content += extractedText + "\n"
# return content.encode("ascii", "ignore")
return content