Python in simple language - PyPDF2 processing PDF files

In practical applications, it may involve processing pdf files. PyPDF2 is such a library, which can easily process pdf files. It provides various operations such as reading, cutting, merging, and file conversion.

Document address: http://pythonhosted.org/PyPDF2/

PyPDF2 installation

PyCharm 安装:File -> Default Settings -> Project Interpreter

write picture description here

PdfFileReader

Construction method:

PyPDF2.PdfFileReader(stream,strict = True,warndest = None,overwriteWarnings = True)

Initializes a PdfFileReader object, which may take some time as the PDF stream's cross-reference table is read into memory.

parameter:

  • stream: A *File object or an object that supports standard read and lookup methods similar to the File object, or a string representing the path to the PDF file. *
  • strict (bool) : Determines whether the user should be warned about problems used, also causing some correctable problems to be fatal, defaults to True
  • warndest : destination to log warnings to (default is sys.stderr)
  • overwriteWarnings(bool) : Determines whether warnings.py overwrites Python modules with custom implementations (default True)

Properties and methods of the PdfFileReader object

properties and methods describe
getDestinationPageNumber(destination) Retrieve the page number for the given target object
getDocumentInfo() Retrieve document information dictionary for PDF files
getFields(tree = None,retval = None,fileObj= None) If this PDF contains interactive form fields, extract field data,
getFormTextFields() Retrieve form fields with text data (inputs, dropdowns) from document
getNameDestinations(tree = None,retval= None) Retrieves the specified target in the document
getNumPages() Count the number of pages in this PDF file
getOutlines(node = None,outline = None,) Retrieve document outlines that appear in a document
getPage(pageNumber) Retrieve the specified numbered page from this PDF file
getPageLayout() get page layout
getPageMode() get page mode
getPageNumber(pageObject) Retrieves the page number on which the given pageObject is located
getXmpMetadata () Retrieve XMP data from PDF document root
isEncrypted Read-only boolean property showing whether the PDF file is encrypted
namedDestinations Access the getNamedDestinations()read-only property of the function

PDF read operation:

# encoding:utf-8
from PyPDF2 import PdfFileReader, PdfFileWriter

readFile = 'C:/Users/Administrator/Desktop/RxJava 完全解析.pdf'
# 获取 PdfFileReader 对象
pdfFileReader = PdfFileReader(readFile)  # 或者这个方式:pdfFileReader = PdfFileReader(open(readFile, 'rb'))
# 获取 PDF 文件的文档信息
documentInfo = pdfFileReader.getDocumentInfo()
print('documentInfo = %s' % documentInfo)
# 获取页面布局
pageLayout = pdfFileReader.getPageLayout()
print('pageLayout = %s ' % pageLayout)

# 获取页模式
pageMode = pdfFileReader.getPageMode()
print('pageMode = %s' % pageMode)

xmpMetadata = pdfFileReader.getXmpMetadata()
print('xmpMetadata  = %s ' % xmpMetadata)

# 获取 pdf 文件页数
pageCount = pdfFileReader.getNumPages()

print('pageCount = %s' % pageCount)
for index in range(0, pageCount):
    # 返回指定页编号的 pageObject
    pageObj = pdfFileReader.getPage(index)
    print('index = %d , pageObj = %s' % (index, type(pageObj)))  # <class 'PyPDF2.pdf.PageObject'>
    # 获取 pageObject 在 PDF 文档中处于的页码
    pageNumber = pdfFileReader.getPageNumber(pageObj)
    print('pageNumber = %s ' % pageNumber)

Output result:

documentInfo = {'/Title': IndirectObject(157, 0), '/Producer': IndirectObject(158, 0), '/Creator': IndirectObject(159, 0), '/CreationDate': IndirectObject(160, 0), '/ModDate': IndirectObject(160, 0), '/Keywords': IndirectObject(161, 0), '/AAPL:Keywords': IndirectObject(162, 0)}
pageLayout = None 
pageMode = None
xmpMetadata  = None 
pageCount = 3
index = 0 , pageObj = <class 'PyPDF2.pdf.PageObject'>
pageNumber = 0 
index = 1 , pageObj = <class 'PyPDF2.pdf.PageObject'>
pageNumber = 1 
index = 2 , pageObj = <class 'PyPDF2.pdf.PageObject'>
pageNumber = 2 

PdfFileWriter

This class supports PDF files, giving pages generated by other classes.

properties and methods describe
addAttachment (fname, trusted) Embed files in PDF
addBlankPage(width= None,height=None) Append a blank page to this PDF file and return it
addBookmark(title,pagenum,parent=None,
color=None,bold=False,italic=False,fit=’/fit,*args’)
addJS(javascript) Add the javascript that will be started when opening this PDF
addLink(pagenum,pagedest,rect,border=None,fit=’/fit’,*args) Adds an internal link to the specified page from a rectangular area
addPage(page) Adds a page to this PDF file, usually obtained from a PdfFileReader instance
getNumpages() number of pages
getPage(pageNumber) Retrieve a numbered page from this PDF file
insertBlankPage(width=None,height=None,index=0) Inserts a blank page into this PDF file and returns it, using the size of the last page if no page size is specified
insertPage(page,index=0) Inserts a page in this PDF file, usually obtained from a PdfFileReader instance
removeLinks() Remove junction box annotations from times out
removeText(ignoreByteStringObject = False) remove images from this output
write(stream) Writes the collection of pages added to this object to the PDF file

PDF write operation:

def addBlankpage():
    readFile = 'C:/Users/Administrator/Desktop/RxJava 完全解析.pdf'
    outFile = 'C:/Users/Administrator/Desktop/copy.pdf'
    pdfFileWriter = PdfFileWriter()

    # 获取 PdfFileReader 对象
    pdfFileReader = PdfFileReader(readFile)  # 或者这个方式:pdfFileReader = PdfFileReader(open(readFile, 'rb'))
    numPages = pdfFileReader.getNumPages()

    for index in range(0, numPages):
        pageObj = pdfFileReader.getPage(index)
        pdfFileWriter.addPage(pageObj)  # 根据每页返回的 PageObject,写入到文件
        pdfFileWriter.write(open(outFile, 'wb'))

    pdfFileWriter.addBlankPage()   # 在文件的最后一页写入一个空白页,保存至文件中
    pdfFileWriter.write(open(outFile,'wb'))

The result: a blank page is written on the last page of the copy.pdf document written.

Split the document (take the pages after the fifth page)

def splitPdf():
    readFile = 'C:/Users/Administrator/Desktop/RxJava 完全解析.pdf'
    outFile = 'C:/Users/Administrator/Desktop/copy.pdf'
    pdfFileWriter = PdfFileWriter()

    # 获取 PdfFileReader 对象
    pdfFileReader = PdfFileReader(readFile)  # 或者这个方式:pdfFileReader = PdfFileReader(open(readFile, 'rb'))
    # 文档总页数
    numPages = pdfFileReader.getNumPages()

    if numPages > 5:
        # 从第五页之后的页面,输出到一个新的文件中,即分割文档
        for index in range(5, numPages):
            pageObj = pdfFileReader.getPage(index)
            pdfFileWriter.addPage(pageObj)
        # 添加完每页,再一起保存至文件中
        pdfFileWriter.write(open(outFile, 'wb'))

Merge documents


def mergePdf(inFileList, outFile):
    '''
    合并文档
    :param inFileList: 要合并的文档的 list
    :param outFile:    合并后的输出文件
    :return:
    '''
    pdfFileWriter = PdfFileWriter()
    for inFile in inFileList:
        # 依次循环打开要合并文件
        pdfReader = PdfFileReader(open(inFile, 'rb'))
        numPages = pdfReader.getNumPages()
        for index in range(0, numPages):
            pageObj = pdfReader.getPage(index)
            pdfFileWriter.addPage(pageObj)

        # 最后,统一写入到输出文件中
        pdfFileWriter.write(open(outFile, 'wb'))

PageObject

PageObject(pdf=None,indirectRef=None)

This class represents a single page in a PDF file, usually this object is obtained by accessing the getPage() method of the PdfFileReader object, or an empty page can be created using the createBlankPage() static method.

parameter:

  • pdf : The PDF file the page belongs to.
  • indirectRef: Store the original indirect reference of the source object in its source PDF.

Properties and methods of the PageObject object

property or method describe
static createBlankPage(pdf=None,width=None,height=None) Return to a new blank page
extractText() Finds all text drawing commands, in the order they are provided in the content stream, and extracts the text
getContents() Access page content, return Contents object or None
rotateClockwise(angle) Rotate 90 degrees clockwise
scale (sx, sy) By applying a transformation matrix to its content and updating the page size

Roughly read PDF text content


def getPdfContent(filename):
    pdf = PdfFileReader(open(filename, "rb"))
    content = ""
    for i in range(0, pdf.getNumPages()):
        pageObj = pdf.getPage(i)

        extractedText = pageObj.extractText()
        content += extractedText + "\n"
        # return content.encode("ascii", "ignore")
    return content

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325509231&siteId=291194637
Recommended