fitz cuts the pdf into left, middle and right parts, and extracts the content and saves it as excel

Introduction

fitz official website
fitz is a package of PyMuPDF library, which is a Python library for processing PDF files. PyMuPDF is a powerful and flexible PDF processing library, while fitz streamlines and encapsulates it to make it easier to use.

fitz provides many functions including opening, creating, editing and saving PDF files. Here are some key features of fitz:

Open and read PDF files: Use the fitz.open method to open a PDF file and return a fitz.Document object. You can access and manipulate PDF pages, bookmarks, metadata and other information through this object.

Page operation: You can get the number of PDF pages through the fitz.Document object, and you can get a single page according to the index or page number. You can crop, rotate, scale, and extract pages.

Text extraction: Use the get_text method to extract the text content in the page. You can also extract the text in the specified area according to the coordinates.

Image extraction: Use the get_pixmap method to get the image on the page, and you can save the image as a file or perform further processing.

Create and edit PDF: fitz allows you to create a new blank PDF, add pages, bookmarks, and comments to it, and edit and modify existing PDFs, such as deleting, replacing, and merging pages.

Advanced features: fitz also provides some advanced features, such as searching text, adding callouts and links, extracting embedded font information, etc.

Overall, fitz is a powerful PDF processing library that can help you perform various operations on PDFs in Python. You can use fitz to read, create, edit and extract the content of PDF files according to your needs.

The difference between fitz and pdfplumber

Reference: chatgpt
fitz and pdfplumber are commonly used Python libraries for extracting PDF content, and they have some differences in usage and functions.

Applicable libraries: fitz is a package of PyMuPDF library, and pdfplumber is a library based on PyPDF2 and pdftotext.

Richness of functions: fitz provides more advanced functions, such as support for page cropping, rendering, text search and annotation, etc. It can also handle images and vector objects. pdfplumber is more focused on extracting text content and provides some auxiliary functions, such as obtaining the size and position information of the page.

Text extraction accuracy: Due to the complexity of PDF files, different libraries may have different text extraction accuracy for some special cases. In general, both fitz and pdfplumber have relatively high accuracy in text extraction.

Ease of use of the library: The API design of pdfplumber is more concise and clear, and it is easier to get started. fitz is the package of PyMuPDF library, in comparison, its use may be a little more complicated.

To sum up, if your main requirement is to extract text content, pdfplumber may be more suitable for you. If you need more advanced features or do more complex processing on PDFs, such as page cropping or image manipulation, then fitz is more suitable for you. Selecting a suitable library according to actual needs can better meet your needs.

the code

Pay attention to the pdf permission issue, and the pdf typesetting is line by line instead of block, so the format is more regular

 import fitz
 import pandas as pd
 
def clip_pdf(filepath):
    # 打开原始 PDF 文件
    total_df = pd.DataFrame([],columns=['info','page','table_index','index'])
    src = fitz.open(filepath)
    src.select(list(range(6, 515)))
    doc = fitz.open()
    text_list = []

   
    for spage in src:

        print('spage')# for each page in input
        r = spage.rect  # input page rectangle
        pdf_width = 700
        rx = fitz.Rect(r[2] * 40 /pdf_width, r[2] * 85 /pdf_width, r[2] * 240 /pdf_width, r[3] ) 
        page = doc.new_page(-1, width=r[2] * 200 /pdf_width, height=r[3])  # 创建新页面
        page.show_pdf_page(page.rect, src, spage.number, clip=rx)  # 插入裁剪后的部分
        text  = page.get_text()


        rx = fitz.Rect(r[2] * 240 /pdf_width, r[2] * 85 /pdf_width, r[2] * 440 /pdf_width, r[3] )
        page = doc.new_page(-1, width=r[2] * 200 /pdf_width, height=r[3])  # 创建新页面
        page.show_pdf_page(page.rect, src, spage.number, clip=rx)  # 插入裁剪后的部分
        text2 = page.get_text()

        rx = fitz.Rect(r[2] * 440 /pdf_width, r[2] * 85 /pdf_width, r[2] * 640 /pdf_width, r[3] )
        page = doc.new_page(-1, width=r[2] * 200 /pdf_width, height=r[3])  # 创建新页面
        page.show_pdf_page(page.rect, src, spage.number, clip=rx)  # 插入裁剪后的部分
        text3 = page.get_text()

        df = pd.DataFrame(text.split('\n') [:-2],columns=['info'])
        df['table_index'] = 0
        df['index'] = df.index
        df2 = pd.DataFrame(text2.split('\n')[1:-1],columns=['info'])
        df2['table_index'] = 1
        df2['index'] = df2.index
        df3 = pd.DataFrame(text3.split('\n')[:-2] ,columns=['info'])
        df3['table_index'] = 2
        df3['index'] = df3.index
        page_df = pd.concat([df,df2] )
        page_df = pd.concat([page_df,df3])
        page_df['page'] = spage.number
        if total_df.empty:
            total_df = page_df
        else:
			total_df = pd.concat([total_df, page_df], axis=0)
# 保存输出文件
total_df.to_excel('output.xlsx')
doc.save("output.pdf", garbage=3, deflate=True)

Guess you like

Origin blog.csdn.net/weixin_38235865/article/details/131401065