[python elementary] pdf conversion image

[python elementary] pdf conversion image

1. Background

Recently, a colleague issued a pdf drawing and found that it needs to be cropped, and completed the pdf cropping code through ChatGPT.

ChatGPT is indeed powerful. Through continuous chatting with it and continuous trial and error, colleagues finally completed the development of the program.

This is a magical and amazing experience, but after experiencing this kind of generative algorithm, it is obvious that it will not replace programmers in the short term as advertised by the media.

The pdf cropped by ChatGPT was put into the pdf editor and found that the pdf was not cropped. I gave an idea to my colleague, first convert it into a picture, after cropping the picture, and then convert the picture into pdf.

2. About PyMuPDF and fitz

The pdf conversion image uses the fitz module, and fitz is a submodule of pymupdf.
The import fitz import module in the code needs to install the three-party package pymupdf in advance, for example, through the Tsinghua mirror source installation:

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple/  PyMuPDF==1.18.14

PyMuPDF is a Python binding for MuPDF.
MuPDF is a lightweight PDF, XPS and eBook viewer. MuPDF consists of software libraries, command line tools, and viewers for various platforms.

A note on naming fitz:
Fitz started out as an R&D project to replace the aging Ghost graphics library, but instead became MuPDF's rendering engine.

For more pdf operation functions, please refer to the interface of the PyMuPDF three-party package.

3. PDF conversion image code

Sample code for converting pdf to png using python3.

# -*- coding: utf-8 -*-
# pip install -i https://pypi.tuna.tsinghua.edu.cn/simple/  PyMuPDF==1.18.14
# jn10010537

import fitz

import glob
import datetime
import os

def pyMuPDF_fitz(pdfPath, imagePath, count):
    pdfDoc = fitz.open(pdfPath)    # 打开文档
    pageCount = pdfDoc.page_count  # 获得PDF页码数量
    for i in range(pageCount):   # 逐页读取PDF
        page = pdfDoc[i]

        # 设置缩放zoom和旋转系数rotate
        # zoom_x, zoom_y取相同值,表示等比例缩放

        rotate = int(0)
        zoom_x = 4
        zoom_y = 4

        # 使用matrix参数来控制输出图像的精度
        matrix = fitz.Matrix(zoom_x, zoom_y).preRotate(rotate)
        print("matrix:",matrix)

        rect = page.rect                               # 示例:Rect(0.0, 0.0, 882.17138671875, 635.3142700195312)
        clip = fitz.Rect(rect.tl + 15, rect.br - 13)   # 裁剪尺寸,Rect(15.0, 15.0, 869.17138671875, 622.3142700195312)

        # 将其转化为光栅文件(位数)
        pix = page.getPixmap(matrix=matrix, alpha=False, clip=clip)

        if not os.path.exists(imagePath):  # 判断存放图片的文件夹是否存在
            os.makedirs(imagePath)         # 若图片文件夹不存在就创建

        # 设置jpg/tif文件的分辨率,程序默认分辨率为96
        image_dpi = 300
        pix.setResolution(image_dpi, image_dpi)

        # 生成PNG文件
        pix.writePNG(imagePath + '/' + 'pdf2images_%s_%s.png' % (count, i))

if __name__ == "__main__":
    # 1、PDF地址
    pdfPath = r"backup"
    # 2、需要储存图片的目录
    imagePath = r"new"
    # 3.文件后缀数字
    count = 1
    # 4.获取文件
    files = glob.glob(pdfPath + r'\*.pdf')
    # 5.循环调用函数进行转化
    for file in files:
        pyMuPDF_fitz(file, imagePath, count)
        count = count + 1

Guess you like

Origin blog.csdn.net/jn10010537/article/details/130757138