Extract pictures from pdf through Python's fitz library

foreword

Hello everyone, I am Kongkong star, and I will share this article with you 《通过Python的fitz库提取pdf中的图片》.

1. What is the fitz library?

The Fitz library is a Python image processing library mainly used to open, edit and save images in PDF, TIFF and JPEG formats. It helps users read and write PDF files, extract PDF pages, and mark and annotate pages. In addition, the Fitz library also provides some image processing functions, such as rotation, cropping, scaling, adjusting brightness, contrast and color balance, etc. These features make the Fitz library a very useful tool for image processing.

2. Install the fitz library

pip install fitz

3. Check the fitz library version

pip show fitz

Name: fitz
Version: 0.0.1.dev2
Summary: Fitz: Workflow Mangement for neuroimaging data.
Home-page: http://github.com/kastman/fitz
Author: Erik Kastman
Author-email: [email protected]
License: BSD (3-clause)
Requires: configobj, configparser, httplib2, nibabel, nipype, numpy, pandas, pyxnat, scipy
Required-by:

4. What is the pymupdf library?

To use fitz, you need to install the pymupdf library.

PyMuPDF is a Python-based open source PDF processing library that provides a series of PDF document processing functions, such as reading, editing, creating, converting, etc. It is a Python binding for MuPDF, a lightweight open source PDF document rendering engine that supports multiple platforms and multiple file formats.
PyMuPDF is fast, efficient, and easy to use, and can be used for automated processing and batch processing of PDF documents, such as extracting text, extracting images, adding or modifying bookmarks, adding or modifying comments, merging PDF files, cutting PDF files, extracting PDF pages, etc. At the same time, it also supports PDF rendering into pictures, which is convenient for quick preview and thumbnail generation.
In short, PyMuPDF is a very practical Python PDF processing library, suitable for a variety of scenarios, such as data processing, document processing, automated office, etc.

5. Install the pymupdf library

pip install pymupdf

6. View the version of the pymupdf library

pip show pymupdf

Name: PyMuPDF
Version: 1.22.3
Summary: Python bindings for the PDF toolkit and renderer MuPDF
Home-page: https://github.com/pymupdf/PyMuPDF
Author: Artifex
Author-email: [email protected]
License: GNU AFFERO GPL 3.0
Requires:
Required-by:

7. What is the relationship between fitz and pymupdf?

fitz is a module of the Pymupdf library, it is one of the main modules of Pymupdf and the most commonly used module. The fitz module provides basic operations on PDF documents, such as opening, reading, editing, saving, etc.

Eight, extract the pictures in the pdf

1. Import library

import fitz

2. Define the pdf path

local = '/Users/kkstar/Downloads/'

3. Open the PDF file

pdf_doc = fitz.open(local+'demo_pic.pdf')

4. Traverse all pages

for pg in range(pdf_doc.page_count):
    page = pdf_doc[pg]

5. Get all images on the page

    image_list = page.get_images()

6. Iterate over all images

    for img in image_list:

7. Get the XREF number and image data of the image

        xref = img[0]
        pix = fitz.Pixmap(pdf_doc, xref)

8. If the image is in RGB color space, save it as a PNG file

        if str(fitz.csRGB) == str(pix.colorspace):
            img_path = local + f'image{
      
      pg+1}_{
      
      xref}.png'
            pix.save(img_path)

Summarize

pdf

Extract image effects

Guess you like

Origin blog.csdn.net/weixin_38093452/article/details/130950144