1. Introduction to PyMuPDF
1 Introduction
Before introducing it PyMuPDF
, let's first understand it MuPDF
. As you can see from the naming form, it PyMuPDF
is the interface form.MuPDF
Python
In PDF
MuPDF
Is a lightweight PDF、XPS
e-book viewer. MuPDF
Consists of software libraries, command line tools, and viewers for various platforms.
MuPDF
The renderer in is tailor-made for high-quality anti-aliased graphics. It renders text with measurements and spacing accurate to within a fraction of a pixel for maximum fidelity in reproducing the appearance of a printed page on the screen.
This viewer is small, fast, but complete. It supports multiple document formats such as PDF
, XPS
, OpenXPS
, CBZ
, EPUB
and FictionBook 2
. You can use the mobile viewer to PDF
annotate documents and fill out forms (this feature will soon be available for desktop viewers as well).
The command line tool allows you to annotate, edit, and convert documents to other formats such as HTML、SVG、PDF
and CBZ
. You can also use Javascript
scripting to manipulate documents.
PyMuPDF
PyMuPDF
(current version 1.18.17) is MuPDF
a Python binding that supports (current version 1.18.*).
Using PyMuPDF
, you can access the extension with “.pdf”、“.xps”、“.oxps”、“.cbz”、“.fb2”
or “.epub”
. In addition, about 10 popular image formats can also be processed like documents: “.png”,“.jpg”,“.bmp”,“.tiff”
etc.
2. Function
For all supported document types you can:
NEW: Layout saving text extraction!
The script provides text extraction in different formats fitzcliy .py
through subcommands “gettext”
. Particularly interesting is of course layout saving, which generates text as close as possible to the original physical layout, with areas surrounding images, or copies of text in tables and multi-column text.
-
Decrypt files
-
Access meta information, links and bookmarks
-
Render pages in raster format (
PNG
and other formats) or vector formatSVG
-
Search text
-
Extract text and images
-
Convert to other formats:
PDF, (X)HTML, XML, JSON, text
For
PDF
documents, there are a large number of additional functions: they can be created, merged or split . Pages can be inserted, deleted, rearranged or modified in a variety of ways (including comments and form fields). -
Images and fonts can be extracted or inserted
-
Full support for embedded files
-
PDF files can be reformatted to support double-sided printing, posterization, and application of logos or watermarks
-
Full support for password protection: decryption, encryption, encryption method selection, permission levels and user/owner password settings
-
PDF optional content concept supporting images, text and drawings
-
Can access and modify low-level PDF structures
-
The command line module
"python \-m fitz…"
is a multifunctional utility with the following features-
Encryption/decryption/optimization
-
Create subdocument
-
Document connection
-
Image/font extraction
-
Full support for embedded files
-
Text extraction of saved layouts (all documents)
-
2. Installation
PyMuPDF
You can install from source code or wheels
install from .
For Windows, Linux
and Mac OSX
platforms, PyPI
it is available in the Downloads section wheels
. This includes Python 64位版本3.6到3.9
. There is also a 32-bit version for Windows. Since recently, there have also been some problems with the Linux ARM architecture - look for platform tags manylinux2014_aarch64
.
Apart from the standard library, it has no mandatory external dependencies. There are some nice methods that only work if certain packages are installed:
-
Pillow
: Required when usingPixmap.pil_save()
andPixmap.pil_tobytes()
-
fontTools
:Document.subset_fonts()
Required when using -
pymupdf-fonts
is a good font choice for text output methods
Use pip
the installation command :
pip install PyMuPDF
Import library:
import fitz
fitz
A note on naming
Python
The standard import statement for this library is import fitz
. There are historical reasons for this: MuPDF
the original rendering library was called Libart
.
After acquiring the project at Artifex Software MuPDF
, the focus of development shifted to writing a new modern graphics library called “Fitz”
. Fitz
Originally started as an R&D project to replace an aging Ghostscript
graphics library, but became the rendering engine for MuPDF (quoted from Wikipedia).
3. How to use
1. Import the library and check the version
import fitz
print(fitz.__doc__)
PyMuPDF 1.18.16: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-08-05 00:00:01.
Built for Python 3.8 on linux (64-bit).
2. Open the document
doc = fitz.open(filename)
This will create Document
the object doc
. The filename must be a Python string that already exists.
It is also possible to open a document from memory data , or create a new empty PDF. You can also use documents as context managers.
3. Document methods and properties
methods/properties | describe |
---|---|
Document.page_count |
Number of pages (int) |
Document.metadata |
metadata(dict) |
Document.get_toc() |
Get directory (list) |
Document.load_page() |
read page |
Example:
>>> doc.count_page
1
>>> doc.metadata
{'format': 'PDF 1.7',
'title': '',
'author': '',
'subject': '',
'keywords': '',
'creator': '',
'producer': '福昕阅读器PDF打印机 版本 10.0.130.3456',
'creationDate': "D:20210810173328+08'00'",
'modDate': "D:20210810173328+08'00'",
'trapped': '',
'encryption': None}
4. Get metadata
PyMuPDF
Standard metadata is fully supported. is a Python dictionaryDocument.metadata
with the following keys .
It works with all document types, but not all entries always contain data. The metadata field is a string, or None if not indicated otherwise. Also note that not all data will always contain meaningful data - even if not all of them will.
Key | Value |
---|---|
producer | producer (producing software) |
format | format: ‘PDF-1.4’, ‘EPUB’, etc. |
encryption | encryption method used if any |
author | author |
modDate | date of last modification |
keywords | keywords |
title | title |
creationDate | date of creation |
creator | creating application |
subject | subject |
5. Get an outline of goals
toc = doc.get_toc()
6. Page( Page
)
Page handling is MuPDF
at the heart of the functionality.
-
You can render the page as a raster or vector (
SVG
) image, with the option to scale, rotate, move, or shear the page. -
You can extract page text and images in multiple formats and search for text strings.
-
For
PDF
documents, there are more ways to add text or images to the page.
First, a page must be created Page
. This is Document
one way:
page = doc.load_page(pno) # loads page number 'pno' of the document (0-based)
page = doc[pno] # the short form
Any integer can be used here -inf<pno<page_count
. Negative numbers count down from the end, so doc[-1]
the last page, just like Python sequences.
A more advanced approach is to use the document as an iterator over the pages:
for page in doc:
# do something with 'page'
# ... or read backwards
for page in reversed(doc):
# do something with 'page'
# ... or even use 'slicing'
for page in doc.pages(start, stop, step):
# do something with 'page'
Next, we mainly introduce
Page
the common operations!
a. Check the page for links, comments, or form fields
When displaying documents using some viewer software, links appear as == "hot areas" ==. If you click while your cursor displays the hand symbol , you will usually be taken to the marker coded in that hotspot area. Here's how to get all links:
# get all links on a page
links = page.get_links()
links
is a list of Python
dictionaries .
Can also be used as an iterator:
for link in page.links():
# do something with 'link'
If you are dealing with PDF document pages, there may also be comments ( Annot
) or form fields ( Widget
), each with its own iterator:
for annot in page.annots():
# do something with 'annot'
for field in page.widgets():
# do something with 'field'
b. Render the page
This example creates a raster image of the page content:
pix = page.get_pixmap()
pix
is an Pixmap
object that (in this case) contains an RGB image of the page and can be used for a variety of purposes.
Methods Page.get_pixmap()
provide many variants for controlling images: resolution, color space (for example, generating a grayscale image or an image with a subtractive color scheme), transparency, rotation, mirroring, shifting, shearing, etc.
For example: Create an RGBA image (i.e., contain an alpha channel), specified pix=page.get_pixmap(alpha=True)
. \
Pixmap
Contains many of the methods and properties referenced below. These include integer width , height (per pixel), and stride (number of bytes for one horizontal image line). Property example represents a rectangular region of bytes (Python bytes object) representing image data.
You can also use page.get_svg_image()
vector images to create pages.
c. Save the page image to a file
We can simply store the image in PNG
a file:
pix.save("page-%i.png" % page.number)
d. Extract text and images
We can also extract all text, images and other information of a page in many different forms and levels of detail:
text = page.get_text(opt)
Use opt
one of the following strings for different formats:
-
"text"
: (Default) Plain text with newlines. No formatting, no text location details, no images -
"blocks"
: Generate a list of text blocks (paragraphs) -
"words"
: Generate a list of words (a string without spaces) -
"html"
: Create a complete visual version of the page, including any images. This can be displayed via an internet browser -
"dict"/"json"
:HTML
Same information level, but as a Python dictionary orresp.JSON
string. -
"rawdict"/"rawjson"
:"dict"/"json"
A super collection of. It also providesXML
character details such as. -
"xhtml"
: The text information level is the same as the text version, but includes images. -
"xml"
: Does not contain images, but contains complete position and font information for each text character . UseXML
modules for explanation.
e. Search text
You can find the exact location of a text string on the page:
areas = page.search_for("mupdf")
This will provide a list of rectangles , each containing a string “mupdf”
(case insensitive). You can use this information to highlight these areas (PDF only) or create cross-references to the document.
7. PDF operations
PDF
is the only document type that can use PyMuPDF
modification . Other file types are read-only.
However, you can convert any document, including images, to PDF and then PyMuPDF
apply all features to the conversion result Document.convert_to_pdf()
.
Document.save()
PDFs are always stored on disk in their current (possibly modified) state.
You can usually choose whether to save to a new file or just append the modifications to the existing file ("incremental save"), which is usually much faster.
Here's how to operate PDF documents.
a. Modify, create, rearrange and delete pages
There are several ways to manipulate the so-called page tree (which describes the structure of all pages):
The new saved document will contain links, comments and bookmarks that are still valid (iaw pointing to the selected page or some external resource).
-
PDF:Document.delete_page()
andDocument.delete_pages()
delete pages -
Document.copy_page()
,Document.fullcopy_page()
and copy or moveDocument.move_page()
pages to other locations in the same document. -
Document.select()
Compress PDF to selected pages, the parameter is the sequence of page numbers to preserve. These integers must all be in0<=i<page_ count
range. When executed, any pages missing from this list will be removed. The remaining pages will appear in order, the same number of times (!) as you specified.So you can easily create a new PDF using:
-
First page or last 10 pages
-
Odd or even pages only (for duplex printing)
-
Pages that contain or do not contain the given text
-
Reverse page order
-
-
Document.insert_page()
andDocument.new_page()
insert new pages.Additionally, the page itself can be modified through a range of methods (e.g. page rotation, annotation and link maintenance, text and image insertion).
b. Join and split PDF documents
Method Document.insert_pdf()
to copy pages between different pdf documents. Here's a simple joiner
example (doc1 and doc2 open in PDF):
# append complete doc2 to the end of doc1
doc1.insert_pdf(doc2)
Below is a snippet that splits doc1 . It will create a new document with the first and last 10 pages:
doc2 = fitz.open() # new empty PDF
doc2.insert_pdf(doc1, to_page = 9) # first 10 pages
doc2.insert_pdf(doc1, from_page = len(doc1) - 10) # last 10 pages
doc2.save("first-and-last-10.pdf")
c. Save
Document.save()
The document will always be saved in its current state.
incremental=True
You can write changes back to the original PDF by specifying options . This process is (usually) very fast, as the changes are appended to the original file without completely rewriting it.
d. close
While the program continues to run, it is often necessary to "close" the document to relinquish control of the underlying file to the operating system.
This can Document.close()
be achieved through methods. In addition to closing the underlying file, the buffers associated with the document are also released.
That’s it for today’s sharing, thank you all for reading.