Python Office Automation PDF Operation Detailed Explanation

9784ba127dfa4264b61c830f9f687884.png


 

1. Introduction to PyMuPDF

1 Introduction

Before introducing it PyMuPDF, let's first understand it MuPDF. As you can see from the naming form, it PyMuPDFis the interface form.MuPDFPython

In PDF

MuPDF Is a lightweight  PDF、XPSe-book viewer. MuPDF Consists of software libraries, command line tools, and viewers for various platforms.

MuPDF The renderer in is tailor-made for high-quality anti-aliased graphics. It renders text with measurements and spacing accurate to within a fraction of a pixel for maximum fidelity in reproducing the appearance of a printed page on the screen.

This viewer is small, fast, but complete. It supports multiple document formats such as PDF, XPS, OpenXPS, CBZ, EPUBand FictionBook 2. You can use the mobile viewer to PDFannotate documents and fill out forms (this feature will soon be available for desktop viewers as well).

The command line tool allows you to annotate, edit, and convert documents to other formats such as HTML、SVG、PDFand CBZ. You can also use Javascriptscripting to manipulate documents.

PyMuPDF

PyMuPDF(current version 1.18.17) is MuPDFa Python binding that supports (current version 1.18.*).

Using PyMuPDF, you can access the extension with “.pdf”、“.xps”、“.oxps”、“.cbz”、“.fb2”or “.epub”. In addition, about 10 popular image formats can also be processed like documents: “.png”,“.jpg”,“.bmp”,“.tiff”etc.

2. Function

For all supported document types you can:

NEW: Layout saving text extraction!

The script provides text extraction in different formats fitzcliy .pythrough subcommands “gettext”. Particularly interesting is of course layout saving, which generates text as close as possible to the original physical layout, with areas surrounding images, or copies of text in tables and multi-column text.

  • Decrypt files

  • Access meta information, links and bookmarks

  • Render pages in raster format ( PNGand other formats) or vector formatSVG

  • Search text

  • Extract text and images

  • Convert to other formats:PDF, (X)HTML, XML, JSON, text

    For PDFdocuments, there are a large number of additional functions: they can be created, merged or split . Pages can be inserted, deleted, rearranged or modified in a variety of ways (including comments and form fields).

  • Images and fonts can be extracted or inserted

  • Full support for embedded files

  • PDF files can be reformatted to support double-sided printing, posterization, and application of logos or watermarks

  • Full support for password protection: decryption, encryption, encryption method selection, permission levels and user/owner password settings

  • PDF optional content concept supporting images, text and drawings

  • Can access and modify low-level PDF structures

  • The command line module "python \-m fitz…"is a multifunctional utility with the following features

    • Encryption/decryption/optimization

    • Create subdocument

    • Document connection

    • Image/font extraction

    • Full support for embedded files

    • Text extraction of saved layouts (all documents)

2. Installation

PyMuPDFYou can install from source code or wheelsinstall from .

For Windows, Linuxand Mac OSXplatforms, PyPIit is available in the Downloads section wheels. This includes Python 64位版本3.6到3.9. There is also a 32-bit version for Windows. Since recently, there have also been some problems with the Linux ARM architecture - look for platform tags manylinux2014_aarch64.

Apart from the standard library, it has no mandatory external dependencies. There are some nice methods that only work if certain packages are installed:

  • Pillow: Required when using Pixmap.pil_save()and Pixmap.pil_tobytes()

  • fontTools: Document.subset_fonts()Required when using

  • pymupdf-fonts is a good font choice for text output methods

Use pipthe installation command :

pip install PyMuPDF

Import library:

import fitz

fitzA note on naming

PythonThe standard import statement for this library is import fitz. There are historical reasons for this:
MuPDFthe original rendering library was called Libart.

After acquiring the project at Artifex Software MuPDF, the focus of development shifted to writing a new modern graphics library called “Fitz”. FitzOriginally started as an R&D project to replace an aging Ghostscriptgraphics library, but became the rendering engine for MuPDF (quoted from Wikipedia).

3. How to use

1. Import the library and check the version

import fitz
print(fitz.__doc__)
PyMuPDF 1.18.16: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-08-05 00:00:01.
Built for Python 3.8 on linux (64-bit).

2. Open the document

doc = fitz.open(filename)

This will create Documentthe object doc. The filename must be a Python string that already exists.
It is also possible to open a document from memory data , or create a new empty PDF. You can also use documents as context managers.

3. Document methods and properties

 

methods/properties describe
Document.page_count Number of pages (int)
Document.metadata metadata(dict)
Document.get_toc() Get directory (list)
Document.load_page() read page

 

Example:

>>> doc.count_page
1
>>> doc.metadata
{'format': 'PDF 1.7',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': '福昕阅读器PDF打印机 版本 10.0.130.3456',
 'creationDate': "D:20210810173328+08'00'",
 'modDate': "D:20210810173328+08'00'",
 'trapped': '',
 'encryption': None}

4. Get metadata

PyMuPDFStandard metadata is fully supported. is a Python dictionaryDocument.metadata with the following keys .

It works with all document types, but not all entries always contain data. The metadata field is a string, or None if not indicated otherwise. Also note that not all data will always contain meaningful data - even if not all of them will.

 

Key Value
producer producer (producing software)
format format: ‘PDF-1.4’, ‘EPUB’, etc.
encryption encryption method used if any
author author
modDate date of last modification
keywords keywords
title title
creationDate date of creation
creator creating application
subject subject

 

5. Get an outline of goals

toc = doc.get_toc()

6. Page( Page)

Page handling is MuPDFat the heart of the functionality.

  • You can render the page as a raster or vector ( SVG) image, with the option to scale, rotate, move, or shear the page.

  • You can extract page text and images in multiple formats and search for text strings.

  • For PDFdocuments, there are more ways to add text or images to the page.

First, a page must be created Page. This is Documentone way:

page = doc.load_page(pno) # loads page number 'pno' of the document (0-based)
page = doc[pno] # the short form

Any integer can be used here -inf<pno<page_count. Negative numbers count down from the end, so doc[-1]the last page, just like Python sequences.

A more advanced approach is to use the document as an iterator over the pages:

for page in doc:
    # do something with 'page'
    
# ... or read backwards
for page in reversed(doc):
    # do something with 'page'
    
# ... or even use 'slicing'
for page in doc.pages(start, stop, step):
    # do something with 'page'

Next, we mainly introduce Pagethe common operations!

a. Check the page for links, comments, or form fields

When displaying documents using some viewer software, links appear as == "hot areas" ==. If you click while your cursor displays the hand symbol , you will usually be taken to the marker coded in that hotspot area. Here's how to get all links:

# get all links on a page
links = page.get_links()

linksis a list of Pythondictionaries .

Can also be used as an iterator:

for link in page.links():
    # do something with 'link'

If you are dealing with PDF document pages, there may also be comments ( Annot) or form fields ( Widget), each with its own iterator:

for annot in page.annots():
    # do something with 'annot'
    
for field in page.widgets():
    # do something with 'field'

b. Render the page

This example creates a raster image of the page content:

pix = page.get_pixmap()

pixis an Pixmapobject that (in this case) contains an RGB image of the page and can be used for a variety of purposes.

Methods Page.get_pixmap()provide many variants for controlling images: resolution, color space (for example, generating a grayscale image or an image with a subtractive color scheme), transparency, rotation, mirroring, shifting, shearing, etc.

For example: Create an RGBA image (i.e., contain an alpha channel), specified pix=page.get_pixmap(alpha=True). \

PixmapContains many of the methods and properties referenced below. These include integer width , height (per pixel), and stride (number of bytes for one horizontal image line). Property example represents a rectangular region of bytes (Python bytes object) representing image data.

You can also use page.get_svg_image()vector images to create pages.

c. Save the page image to a file

We can simply store the image in PNGa file:

pix.save("page-%i.png" % page.number)

d. Extract text and images

We can also extract all text, images and other information of a page in many different forms and levels of detail:

text = page.get_text(opt)

Use optone of the following strings for different formats:

  • "text": (Default) Plain text with newlines. No formatting, no text location details, no images

  • "blocks": Generate a list of text blocks (paragraphs)

  • "words": Generate a list of words (a string without spaces)

  • "html": Create a complete visual version of the page, including any images. This can be displayed via an internet browser

  • "dict"/"json": HTMLSame information level, but as a Python dictionary or resp.JSONstring.

  • "rawdict"/"rawjson": "dict"/"json"A super collection of. It also provides XMLcharacter details such as.

  • "xhtml": The text information level is the same as the text version, but includes images.

  • "xml": Does not contain images, but contains complete position and font information for each text character . Use XMLmodules for explanation.

e. Search text

You can find the exact location of a text string on the page:

areas = page.search_for("mupdf")

This will provide a list of rectangles , each containing a string “mupdf”(case insensitive). You can use this information to highlight these areas (PDF only) or create cross-references to the document.

7. PDF operations

PDFis the only document type that can use PyMuPDFmodification . Other file types are read-only.

However, you can convert any document, including images, to PDF and then PyMuPDFapply all features to the conversion result Document.convert_to_pdf().

Document.save()PDFs are always stored on disk in their current (possibly modified) state.

You can usually choose whether to save to a new file or just append the modifications to the existing file ("incremental save"), which is usually much faster.

Here's how to operate PDF documents.

a. Modify, create, rearrange and delete pages

There are several ways to manipulate the so-called page tree (which describes the structure of all pages):

The new saved document will contain links, comments and bookmarks that are still valid (iaw pointing to the selected page or some external resource).

  • PDF:Document.delete_page()and Document.delete_pages()delete pages

  • Document.copy_page(), Document.fullcopy_page()and copy or moveDocument.move_page() pages to other locations in the same document.

  • Document.select()Compress PDF to selected pages, the parameter is the sequence of page numbers to preserve. These integers must all be in 0<=i<page_ countrange. When executed, any pages missing from this list will be removed. The remaining pages will appear in order, the same number of times (!) as you specified.

    So you can easily create a new PDF using:

    • First page or last 10 pages

    • Odd or even pages only (for duplex printing)

    • Pages that contain or do not contain the given text

    • Reverse page order

  • Document.insert_page()and Document.new_page()insert new pages.

    Additionally, the page itself can be modified through a range of methods (e.g. page rotation, annotation and link maintenance, text and image insertion).

b. Join and split PDF documents

Method Document.insert_pdf()to copy pages between different pdf documents. Here's a simple joinerexample (doc1 and doc2 open in PDF):

# append complete doc2 to the end of doc1
doc1.insert_pdf(doc2)

Below is a snippet that splits doc1 . It will create a new document with the first and last 10 pages:

doc2 = fitz.open() # new empty PDF
doc2.insert_pdf(doc1, to_page = 9) # first 10 pages
doc2.insert_pdf(doc1, from_page = len(doc1) - 10) # last 10 pages
doc2.save("first-and-last-10.pdf")

c. Save

Document.save()The document will always be saved in its current state.

incremental=TrueYou can write changes back to the original PDF by specifying options . This process is (usually) very fast, as the changes are appended to the original file without completely rewriting it.

d. close

While the program continues to run, it is often necessary to "close" the document to relinquish control of the underlying file to the operating system.

This can Document.close()be achieved through methods. In addition to closing the underlying file, the buffers associated with the document are also released.

 

That’s it for today’s sharing, thank you all for reading.


 

Guess you like

Origin blog.csdn.net/Rocky006/article/details/132916360