[Python] Only 2 lines of code are needed to easily convert PDF to Word (including demonstration cases)

1. Preliminary preparation

Python library to convert PDF to docx files. The project extracts the data in the PDF file through PyMuPDFthe library , and then uses python-docxthe library to parse the layout, paragraphs, pictures, tables, etc. of the content, and finally automatically generates the docx file.

The first step, download PyMuPDFthe package :

pip install PyMuPDF

insert image description here
The second step is to download python-docxthe package :

pip install python-docx

insert image description here
The third step is to download pdf2docxthe package :

pip install pdf2docx

insert image description here

Two, pdf2docx function

  • Parse and create page layouts

(1) Margins

(2) Chapters and columns (currently supports up to two column layouts)

(3) Header and footer [TODO]

  • Parse and create paragraphs

(1) OCR text [TODO]

(2) Horizontal (from left to right) or vertical (bottom to top) direction text

(3) Font style such as font, font size, bold/italic, color

(4) Text styles such as highlight, underline and strikethrough

(5) List styles [TODO]

(6) External hyperlinks

(7) Paragraph horizontal alignment (left/right/center/distributed alignment) and front-to-back spacing

  • Parsing and creating images

(1) Inline images

(2) Grayscale/RGB/CMYK and other color space images

(3) Image with transparent channel

(4) Floating image (lined below the text)

  • Parsing and creating tables

(1) Border styles such as width and color

(2) Cell background color

(3) Merge cells

(4) Cell vertical text

(5) Tables that hide part of the border lines

(6) Nested tables

  • Support multi-process conversion

pdf2docxAt the same time, the table content and style are parsed, so it can also be used as a table content extraction tool.

3. Restrictions

  • Currently does not support scanned PDF text recognition
  • Only languages ​​written from left to right are supported (so Arabic is not supported)
  • Rotated text is not supported
  • Rule-based parsing cannot guarantee 100% restoration of PDF styles

4. Case

We have such a PDF:

insert image description here
code show as below:

from pdf2docx import parse
pdf_file = './ResNet.pdf'
docx_file = './resnet.docx'
# convert pdf to docx
parse(pdf_file, docx_file)

insert image description here
The output is:

insert image description here

The effect is OK, but some data will be missing!

Guess you like

Origin blog.csdn.net/wzk4869/article/details/130521287