Operation PDF file in Python

You are certainly familiar with what PDF Yes. In fact, they are one of the most important and most widely used digital media. PDF behalf of Portable Document Format. It uses the .pdf extension. It is used to display and exchange documents securely, and is independent of the operating system or software, hardware.

PDF was invented by Adobe, is now the International Organization for Standardization (ISO) to maintain open standards. PDF may contain links and buttons, form fields, audio, video, and business logic.

In this article, we will learn how to perform various operations, such as:

  • Extract text from PDF
  • Rotate PDF pages
  • Split PDF

installation

We will use the third-party modules PyPDF2.

PyPDF2 is to build a python library PDF toolkit. It has the following capabilities:

  • Extracting document information (title, author, etc.)
  • Split the document page by page
  • Merge document page by page
  • Crop Pages
  • The multiple pages into one page
  • Encrypt and decrypt PDF files
  • And much more!

To install PyPDF2, run the following command line:

pip install PyPDF2

PDF files in Python

This module names are case sensitive, so make sure to lowercase y, all other content is capitalized. All code and PDF files for this tutorial / article are used here.

1. Extract text from PDF files

Let's try to understand the form of a block of code above:

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

Here we create objects PdfFileReader class PyPDF2 module, passing pdf file object and get the pdf reader object.

print(pdfReader.numPages)

numPages property provides the number of pages in the pdf file. For example, in our case, it was 455 (see the first line of output).

pageObj = pdfReader.getPage(0)

Now, we create an object of a class PageObject PyPDF2 module. pdf reader object has the getPage () function, which the page number (starting with index 0) as an argument and returns a page object.

print(pageObj.extractText())

Page object having a function extractText (), for extracting text from a pdf page.

pdfFileObj.close()

Finally, we close the pdf file object.

Note: Although PDF files are very easy for people to read and print way to arrange text, but to use the software to parse plain text is not an easy task. Therefore, extract text from PDF, PyPDF2 may be wrong or impossible to open some PDF. Unfortunately, you can not do anything. PyPDF2 may not be able to use certain PDF files.

2, rotating PDF pages

PDF files in Python

Some important points relating to the above code:

  • For the rotation, we first create pdf reader pdf of the original objects.

pdfWriter = PyPDF2.PdfFileWriter()

Rotate the page will be written to the new pdf. In order to write pdf, we use the object PdfFileWriter class PyPDF2 module.

for page in range(pdfReader.numPages):
        pageObj = pdfReader.getPage(page)
        pageObj.rotateClockwise(rotation)
        pdfWriter.addPage(pageObj)

Now, we iterate each page of the original pdf. We get page object by getPage pdf reader class () method. Now, we () method to rotate the page by page rotationClockwise object class. Then, the page is transmitted by the rotation of the object, using the addPage pdf writer class () method of a page object is added to the pdf writer.

newFile = open(newFileName, 'wb')
pdfWriter.write(newFile)
pdfFileObj.close()
newFile.close()

Now, we have to write a new page pdf pdf file. First, we open a new file object, and use the pdf writer object's write () method writes pdf pages to it. Finally, we close the pdf file of the original object and the new file object.

3, split PDF file

The output will be three new PDF file, respectively, split 1 (p. 0,1), the split 2 (pages 2 and 3), split (the bottom of page 4) 3.

The above procedure does not use Python new functions or classes. Using simple logic and iteration, we create pdf split transmission according to the list split delivery.

Guess you like

Origin www.linuxidc.com/Linux/2019-12/161741.htm