PDF text, picture and table information extraction based on pymupdf

table of Contents

Preface

One, install PyMuPDF

Second, the use of PyMuPDF

1. Introduce the library

2. Read the pdf

3. Extract pdf information

to sum up


Preface

Because the work needs to extract the relevant information in the pdf, check the pymupdf tool. The official document introduces " lightweight PDF, XPS and e-book viewer ". Judging by the use of the situation, the library is more convenient to use, and powerful, basically able to meet the current work needs.


One, install PyMuPDF

pip install PyMuPDF

Second, the use of PyMuPDF

1. Introduce the library

The code is as follows (example):

import fitz

2. Read the pdf

The code is as follows (example):

# 打开pdf文件
doc=fitz.open(pdf_file)

This place is used to read pdf files.

3. Extract pdf information

  • Extract text information: text information and location information

The code is as follows (example):

for page in doc:
    words=page.getTextWords()
    for w in words:
        #位置信息:fitz.Rect(w[:4])
        #w[4]:文本信息
        print(fitz.Rect(w[:4]),w[4])
  • Extract picture information: numpy data and location information of the picture

The code is as follows (example):

for page in doc:
    img_list=page.getImageList()
    for img in img_list:
        #图片的位置信息
        print(fitz.Rect(img[:4]))
        
        #可以直接利用pymupdf直接保存图片
        pix=fitz.Pixmap(doc,img[0])
        save_name="./图片/page_{}_{}.png".format(page.number,i)
        pix.writePNG(save_name)

Of course, if you want to directly read the picture data in pdf and convert it to numpy, you need to use the PIL library, which is installed by default here.

The code is as follows (example):

from PIL import Image
import cv2

def pixmap2array(pix):
    '''pixmap数据转数组对象'''
    #获取颜色空间
    cspace = pix.colorspace
    if cspace is None:
        mode = "L"
    elif cspace.n == 1:
        mode = "L" if pix.alpha == 0 else "LA"
    elif cspace.n == 3:
        mode = "RGB" if pix.alpha == 0 else "RGBA"
    else:
        mode = "CMYK"

    #将byte数据转化为PIL格式
    img = Image.frombytes(mode, (pix.width, pix.height), pix.samples)
    #将PIL转化为numpy格式,并将RGB颜色空间转化为BGR
    img = cv2.cvtColor(np.asarray(img), cv2.COLOR_RGB2BGR)

    return img
  • Extract table information: convert the table into a two-dimensional array and save it as an xlsx table file

Here, because pymupdf cannot directly extract the table, it is necessary to reconstruct the table by extracting lines. So here use pdfplumber to achieve. Installation of pdfplumber library:

pip install pdfplumber

The code is as follows (example):

import pdfplumber

def analysis_table(pdf_file):
    #打开表格
    workbook = Workbook()
    sheet = workbook.active

    #打开pdf
    with pdfplumber.open(pdf_file) as pdf:
        #遍历每页pdf
        for page in pdf.pages:
            #提取表格信息
            table=page.extract_table( table_settings = {
            'vertical_strategy':"text",
            "horizontal_strategy":"text"})
            print(table)

            # 格式化表格数据
            for row in table:
                print(row)
                sheet.append(row)
 
    workbook.save(filename="2.xlsx")

to sum up

The complete code is as follows (example):

import fitz
import pdfplumber
from openpyxl import Workbook
from PIL import Image
import cv2
import numpy as np

def analysis_table(pdf_file):

    #打开表格
    workbook = Workbook()
    sheet = workbook.active

    #打开pdf
    with pdfplumber.open(pdf_file) as pdf:
        #遍历每页pdf
        for page in pdf.pages:
            #提取表格信息
            table=page.extract_table( table_settings = {
            'vertical_strategy':"text",
            "horizontal_strategy":"text"})
            print(table)

            # 格式化表格数据
            for row in table:
                print(row)
                sheet.append(row)

    workbook.save(filename="2.xlsx")

def pixmap2array(pix):
    '''pixmap数据转数组对象'''
    #获取颜色空间
    cspace = pix.colorspace
    if cspace is None:
        mode = "L"
    elif cspace.n == 1:
        mode = "L" if pix.alpha == 0 else "LA"
    elif cspace.n == 3:
        mode = "RGB" if pix.alpha == 0 else "RGBA"
    else:
        mode = "CMYK"

    #将byte数据转化为PIL格式
    img = Image.frombytes(mode, (pix.width, pix.height), pix.samples)
    #将PIL转化为numpy格式,并将RGB颜色空间转化为BGR
    img = cv2.cvtColor(np.asarray(img), cv2.COLOR_RGB2BGR)

    return img

def ananlysis_PDF(pdf_file):
    '''解析pdf信息'''

    # 判断pdf是否存在
    if not os.path.exists(pdf_file):
        print("pdf文件不存在")
        return;

    # 打开pdf文件
    doc=fitz.open(pdf_file)

    #遍历pdf,提取信息
    for page in doc:
        words=page.getTextWords()
        for w in words:
            print(fitz.Rect(w[:4]),w[4])

        img_list=page.getImageList()
        i=0
        for img in img_list:
            print(fitz.Rect(img[:4]))
            pix=fitz.Pixmap(doc,img[0])
            save_name="./图片/page_{}_{}.png".format(page.number,i)
            pix.writePNG(save_name)
            image=pixmap2array(pix)
            i+=1

 

Guess you like

Origin blog.csdn.net/wxplol/article/details/109304946