PDF text, picture and table information extraction based on pymupdf - Code World

PDF text, picture and table information extraction based on pymupdf

Others 2021-03-28 20:03:08 views: null

table of Contents

One, install PyMuPDF

Second, the use of PyMuPDF

1. Introduce the library

2. Read the pdf

3. Extract pdf information

Preface

Because the work needs to extract the relevant information in the pdf, check the pymupdf tool. The official document introduces " lightweight PDF, XPS and e-book viewer ". Judging by the use of the situation, the library is more convenient to use, and powerful, basically able to meet the current work needs.

One, install PyMuPDF

pip install PyMuPDF

Second, the use of PyMuPDF

1. Introduce the library

The code is as follows (example):

import fitz

2. Read the pdf

The code is as follows (example):

# 打开pdf文件
doc=fitz.open(pdf_file)

This place is used to read pdf files.

3. Extract pdf information

Extract text information: text information and location information

The code is as follows (example):

for page in doc:
    words=page.getTextWords()
    for w in words:
        #位置信息：fitz.Rect(w[:4])
        #w[4]：文本信息
        print(fitz.Rect(w[:4]),w[4])

Extract picture information: numpy data and location information of the picture

The code is as follows (example):

for page in doc:
    img_list=page.getImageList()
    for img in img_list:
        #图片的位置信息
        print(fitz.Rect(img[:4]))
        
        #可以直接利用pymupdf直接保存图片
        pix=fitz.Pixmap(doc,img[0])
        save_name="./图片/page_{}_{}.png".format(page.number,i)
        pix.writePNG(save_name)

Of course, if you want to directly read the picture data in pdf and convert it to numpy, you need to use the PIL library, which is installed by default here.

The code is as follows (example):

from PIL import Image
import cv2

def pixmap2array(pix):
    '''pixmap数据转数组对象'''
    #获取颜色空间
    cspace = pix.colorspace
    if cspace is None:
        mode = "L"
    elif cspace.n == 1:
        mode = "L" if pix.alpha == 0 else "LA"
    elif cspace.n == 3:
        mode = "RGB" if pix.alpha == 0 else "RGBA"
    else:
        mode = "CMYK"

    #将byte数据转化为PIL格式
    img = Image.frombytes(mode, (pix.width, pix.height), pix.samples)
    #将PIL转化为numpy格式，并将RGB颜色空间转化为BGR
    img = cv2.cvtColor(np.asarray(img), cv2.COLOR_RGB2BGR)

    return img

Extract table information: convert the table into a two-dimensional array and save it as an xlsx table file

Here, because pymupdf cannot directly extract the table, it is necessary to reconstruct the table by extracting lines. So here use pdfplumber to achieve. Installation of pdfplumber library:

pip install pdfplumber

The code is as follows (example):

import pdfplumber

def analysis_table(pdf_file):
    #打开表格
    workbook = Workbook()
    sheet = workbook.active

    #打开pdf
    with pdfplumber.open(pdf_file) as pdf:
        #遍历每页pdf
        for page in pdf.pages:
            #提取表格信息
            table=page.extract_table( table_settings = {
            'vertical_strategy':"text",
            "horizontal_strategy":"text"})
            print(table)

            # 格式化表格数据
            for row in table:
                print(row)
                sheet.append(row)
 
    workbook.save(filename="2.xlsx")

to sum up

The complete code is as follows (example):

import fitz
import pdfplumber
from openpyxl import Workbook
from PIL import Image
import cv2
import numpy as np

def analysis_table(pdf_file):

    #打开表格
    workbook = Workbook()
    sheet = workbook.active

    #打开pdf
    with pdfplumber.open(pdf_file) as pdf:
        #遍历每页pdf
        for page in pdf.pages:
            #提取表格信息
            table=page.extract_table( table_settings = {
            'vertical_strategy':"text",
            "horizontal_strategy":"text"})
            print(table)

            # 格式化表格数据
            for row in table:
                print(row)
                sheet.append(row)

    workbook.save(filename="2.xlsx")

def pixmap2array(pix):
    '''pixmap数据转数组对象'''
    #获取颜色空间
    cspace = pix.colorspace
    if cspace is None:
        mode = "L"
    elif cspace.n == 1:
        mode = "L" if pix.alpha == 0 else "LA"
    elif cspace.n == 3:
        mode = "RGB" if pix.alpha == 0 else "RGBA"
    else:
        mode = "CMYK"

    #将byte数据转化为PIL格式
    img = Image.frombytes(mode, (pix.width, pix.height), pix.samples)
    #将PIL转化为numpy格式，并将RGB颜色空间转化为BGR
    img = cv2.cvtColor(np.asarray(img), cv2.COLOR_RGB2BGR)

    return img

def ananlysis_PDF(pdf_file):
    '''解析pdf信息'''

    # 判断pdf是否存在
    if not os.path.exists(pdf_file):
        print("pdf文件不存在")
        return;

    # 打开pdf文件
    doc=fitz.open(pdf_file)

    #遍历pdf，提取信息
    for page in doc:
        words=page.getTextWords()
        for w in words:
            print(fitz.Rect(w[:4]),w[4])

        img_list=page.getImageList()
        i=0
        for img in img_list:
            print(fitz.Rect(img[:4]))
            pix=fitz.Pixmap(doc,img[0])
            save_name="./图片/page_{}_{}.png".format(page.number,i)
            pix.writePNG(save_name)
            image=pixmap2array(pix)
            i+=1

Guess you like

Origin blog.csdn.net/wxplol/article/details/109304946

PDF text, picture and table information extraction based on pymupdf

TextMining Day3 Text Mining Based on Information Extraction

Document key information extraction to form a knowledge map: Based on the NLP algorithm to extract key information from the text content to generate an information map tutorial and code source (including pyltp installation and use tutorial)

Get text information from ID card picture

Top three teams in Daguan Cup text intelligent information extraction challenge

Four to ten teams in Daguan Cup Text Intelligent Information Extraction Challenge

Large model information extraction, text generation, visual speech application

Add PDF watermark using PyMuPDF

Information extraction - relation extraction (a)

NLP——Information Extraction information extraction

Based on text detection-subject extraction and perspective transformation effect

Python extracts the information of the docx document (text + table)

Python calls Baidu ai to recognize the picture/pdf as a table excel

Java export word document text + table + picture use feedback function

PJzhang: exiftool picture information extraction tool interface calls and SMS tool TBomb

Excel catalyst 36th wave of open source - picture Exif information extraction, super-fast, ultra-wide information

Quickly Convert PDF Files: Python and PyMuPDF Tutorial

A PDF merge and split program using the PyMuPDF library

PyMuPDF`library implements PDF rotation function

How to quickly extract text information from pictures? How to use OCR image text extraction to extract text with one click

Python third-party library to extract PDF text information

In the search system, the application of intelligent question answering system (unstructured data, graph, information extraction, text retrieval)

Vue table renders pictures, moves the mouse into the picture to enlarge, and exports the table as PDF

How to obtain coordinate information in PDF files based on keywords

Personal information extraction (string)

Python extraction information test

Multimodal model architecture based on lung pictures and text information

POI - Text Extraction

Percent Cognitive Intelligence Laboratory: Practice of Information Extraction Based on Incompletely Annotated Sample Sets

NLP actual combat: express order information extraction - based on ERNIE1.0 pre-training model

Recommended

NetBSD bans submission of code generated by AI

Ranking

Talk dubbo of LRUCache

Chapter 1 Learning Summary

Introduction to Virtualization

pytorch 调用lstm

Six modules

[8086] The natural number of 1 to 36 into a 6x6 two-dimensional array of line-sequentially, and then print out the lower left half of the triangular array

mybatis mapper mapping file $ # {} {} understanding and

C language input and output buffer

c language floating-point input and output

andorid P version compatible, refer to Huawei

Daily

More

2024-05-18(31)

2024-05-17(6)

2024-05-16(23)

2024-05-15(5)

2024-05-14(9)

2024-05-13(8)

2024-05-12(28)

2024-05-11(32)

2024-05-10(34)

2024-05-09(32)