Article directory
1. Preliminary preparation
Python library to convert PDF to docx files. The project extracts the data in the PDF file through PyMuPDF
the library , and then uses python-docx
the library to parse the layout, paragraphs, pictures, tables, etc. of the content, and finally automatically generates the docx file.
The first step, download PyMuPDF
the package :
pip install PyMuPDF
The second step is to download python-docx
the package :
pip install python-docx
The third step is to download pdf2docx
the package :
pip install pdf2docx
Two, pdf2docx function
- Parse and create page layouts
(1) Margins
(2) Chapters and columns (currently supports up to two column layouts)
(3) Header and footer [TODO]
- Parse and create paragraphs
(1) OCR text [TODO]
(2) Horizontal (from left to right) or vertical (bottom to top) direction text
(3) Font style such as font, font size, bold/italic, color
(4) Text styles such as highlight, underline and strikethrough
(5) List styles [TODO]
(6) External hyperlinks
(7) Paragraph horizontal alignment (left/right/center/distributed alignment) and front-to-back spacing
- Parsing and creating images
(1) Inline images
(2) Grayscale/RGB/CMYK and other color space images
(3) Image with transparent channel
(4) Floating image (lined below the text)
- Parsing and creating tables
(1) Border styles such as width and color
(2) Cell background color
(3) Merge cells
(4) Cell vertical text
(5) Tables that hide part of the border lines
(6) Nested tables
- Support multi-process conversion
pdf2docx
At the same time, the table content and style are parsed, so it can also be used as a table content extraction tool.
3. Restrictions
- Currently does not support scanned PDF text recognition
- Only languages written from left to right are supported (so Arabic is not supported)
- Rotated text is not supported
- Rule-based parsing cannot guarantee 100% restoration of PDF styles
4. Case
We have such a PDF:
code show as below:
from pdf2docx import parse
pdf_file = './ResNet.pdf'
docx_file = './resnet.docx'
# convert pdf to docx
parse(pdf_file, docx_file)
The output is:
The effect is OK, but some data will be missing!