Quickly Convert PDF Files: Python and PyMuPDF Tutorial

  • Solve the problem

Sometimes uploading documents to Claude2 for analysis has a size limit, so it is necessary to cut the pdf document into several smaller documents, so this article is published.

How to make a PDF of the size you want with Python and PyMuPDF?

PDF is a widely used file format that can be viewed and printed on any device. However, sometimes you may only need to view the first few pages in a PDF file rather than the entire file. In such cases, it may be useful to convert the PDF file to a new file containing only the specified number of pages. This article describes how to use Python and the PyMuPDF module to achieve this task.

  • Install the PyMuPDF module

Before using PyMuPDF, we need to install it first. PyMuPDF can be installed with the following command:

pip install PyMuPDF
  •  Import PyMuPDF and wxPython modules

Next, we need to import the PyMuPDF and wxPython modules:

import fitz
import wx
  • Create a GUI interface

In order to facilitate users to input the number of PDF files and page numbers, we will create a simple GUI interface. We will use the wxPython module to create the GUI interface. Here is a code example:

class PDFExtractorFrame(wx.Frame):
    def __init__(self, *args, **kw):
        super(PDFExtractorFrame, self).__init__(*args, **kw)

        panel = wx.Panel(self)
        vbox = wx.BoxSizer(wx.VERTICAL)

        self.file_picker = wx.FilePickerCtrl(panel, message="选择PDF文件", wildcard="PDF Files (*.pdf)|*.pdf",
                                            style=wx.FLP_DEFAULT_STYLE | wx.FLP_USE_TEXTCTRL)
        vbox.Add(self.file_picker, 0, wx.EXPAND | wx.ALL, 10)

        self.page_input = wx.TextCtrl(panel, value="1", style=wx.TE_PROCESS_ENTER)
        vbox.Add(self.page_input, 0, wx.EXPAND | wx.ALL, 10)

        extract_button = wx.Button(panel, label="提取", size=(70, 30))
        extract_button.Bind(wx.EVT_BUTTON, self.on_extract)
        vbox.Add(extract_button, 0, wx.ALIGN_CENTER | wx.ALL, 10)

        panel.SetSizer(vbox)
        self.Bind(wx.EVT_TEXT_ENTER, self.on_extract, self.page_input)

 This code creates a wx.Frame class called PDFExtractorFrame and creates GUI interface elements in its constructor. It creates a wx.Panel object and two wx.BoxSizer objects to place the GUI elements. In this GUI interface, users can select a PDF file and enter the number of pages to keep.

  • Implement the conversion function

Next, we need to implement the conversion function. We will use the PyMuPDF module to open a PDF file and use it to copy a specified number of pages. Here is a code example:

def extract_pages(self, input_pdf, page_number, output_pdf):
        # 打开PDF文档
        pdf_document = fitz.open(input_pdf)
        total_pages = pdf_document.page_count

        # 确保页码不超过文档的总页数
        page_number = min(page_number, total_pages)

        # 创建新的PDF文档,只包含指定页码之前的内容
        pdf_writer = fitz.open()
        for page in range(page_number):
            pdf_writer.insert_pdf(pdf_document, from_page=page, to_page=page)

        # 保存新的PDF文档到指定路径
        pdf_writer.save(output_pdf)
        pdf_writer.close()
        pdf_document.close()

This code uses the PyMuPDF module's functions to convert a PDF file into a new PDF file containing only the first N pages. This function takes the source PDF file path, the number of pages to extract and the output path of the new PDF file as parameters, and returns None. The following is a detailed description of the function:

  • input_pdf: The path of the source PDF file.
  • page_number: The page number to fetch.
  • output_pdf: The output path of the new PDF file.

This function uses the fitz.open() function to open the input PDF file and get its total number of pages. If the specified number of page numbers exceeds the total number of pages in the document, it is set to the total number of pages in the document.

Before creating a new PDF document, this function creates an empty PDF document object. It then copies each page from the source PDF file using the insert_pdf() function and inserts it into a new PDF document object. The function copies only the specified number of pages.

Finally, the function uses the save() function to save the new PDF document to the specified output path, and uses the close() function to close all open PDF document objects to release resources.

  • run the application

  • full code

import fitz  # PyMuPDF
import wx

class PDFExtractorApp(wx.App):
    def OnInit(self):
        self.frame = PDFExtractorFrame(None, title="PDF页面提取工具")
        self.SetTopWindow(self.frame)
        self.frame.Show()
        return True

class PDFExtractorFrame(wx.Frame):
    def __init__(self, *args, **kw):
        super(PDFExtractorFrame, self).__init__(*args, **kw)

        panel = wx.Panel(self)
        vbox = wx.BoxSizer(wx.VERTICAL)

        self.file_picker = wx.FilePickerCtrl(panel, message="选择PDF文件", wildcard="PDF Files (*.pdf)|*.pdf",
                                            style=wx.FLP_DEFAULT_STYLE | wx.FLP_USE_TEXTCTRL)
        vbox.Add(self.file_picker, 0, wx.EXPAND | wx.ALL, 10)

        self.page_input = wx.TextCtrl(panel, value="1", style=wx.TE_PROCESS_ENTER)
        vbox.Add(self.page_input, 0, wx.EXPAND | wx.ALL, 10)

        extract_button = wx.Button(panel, label="提取", size=(70, 30))
        extract_button.Bind(wx.EVT_BUTTON, self.on_extract)
        vbox.Add(extract_button, 0, wx.ALIGN_CENTER | wx.ALL, 10)

        panel.SetSizer(vbox)
        self.Bind(wx.EVT_TEXT_ENTER, self.on_extract, self.page_input)

    def on_extract(self, event):
        input_pdf = self.file_picker.GetPath()
        output_pdf = "output.pdf"
        try:
            page_number = int(self.page_input.GetValue())
            self.extract_pages(input_pdf, page_number, output_pdf)
            wx.MessageBox("PDF页面提取完成!", "成功", wx.OK | wx.ICON_INFORMATION)
        except ValueError:
            wx.MessageBox("无效的页码输入!", "错误", wx.OK | wx.ICON_ERROR)

    def extract_pages(self, input_pdf, page_number, output_pdf):
        # 打开PDF文档
        pdf_document = fitz.open(input_pdf)
        total_pages = pdf_document.page_count

        # 确保页码不超过文档的总页数
        page_number = min(page_number, total_pages)

        # 创建新的PDF文档,只包含指定页码之前的内容
        pdf_writer = fitz.open()
        for page in range(page_number):
            pdf_writer.insert_pdf(pdf_document, from_page=page, to_page=page)

        # 保存新的PDF文档到指定路径
        pdf_writer.save(output_pdf)
        pdf_writer.close()
        pdf_document.close()

if __name__ == '__main__':
    app = PDFExtractorApp()
    app.MainLoop()

C:\pythoncode\new\copypdfsaveas.py

Guess you like

Origin blog.csdn.net/winniezhang/article/details/132034201