Convert Python PDF files to Word format in 3 seconds! (With packaging)

PDF documents follow certain specifications, such as the precise positioning of the coordinates where each character appears on the page, and the various shapes (lines, rectangles, curves, etc.) drawn based on the coordinates. Therefore, using PDF format to transmit and print documents can ensure the consistency of the format, and will not cause problems such as formatting disorder, more pages and fewer pages due to different rendering engines in Word.

A Word document is a fluid layout, and the relative distance between elements determines their final position on the page. Therefore, it is suitable for editing content, and modifications to the previous content will automatically trigger subsequent updates to the document layout. PDF to Word conversion is an old topic. The difficulty lies in establishing a mapping from PDF's element position-based format to Word's content-based format. PDF documents do not actually have the concepts of paragraphs and tables. What you need to do to convert PDF to Word is to parse the "horizontal and vertical lines surrounding the text" in the PDF document into the "table" of Word, and convert the "text and a horizontal line below" Parses as "text underline", etc. pdf2docx supports Windows and Linux platforms and requires Python version >= 3.6. pdf2docx installation method:

pip install pdf2docx

pdf2docx use

from pdf2docx import Converter

The idea is as follows

  1. Get the pdf file path.
  2. Filter out all pdf files in the current folder.
  3. Extract pdf file name and suffix.
  4. The file name + 'docx' is spliced ​​to reorganize the word file (the file name remains unchanged when the format is changed).
  5. Use pdf2docx for file conversion.

The source code is very simple. The source code is provided and the ideas are explained in the comments.

import os
from pdf2docx import Converter

def pdf_docx():
    # 获取当前工作目录
    file_path = os.getcwd()

    # 遍历所有文件
    for file in os.listdir(file_path):
        # 获取文件后缀
        suff_name = os.path.splitext(file)[1]

        # 过滤非pdf格式文件
        if suff_name != '.pdf':
            continue
        # 获取文件名称
        file_name = os.path.splitext(file)[0]
        # pdf文件名称
        pdf_name = os.getcwd() + '\\' + file
        # 要转换的docx文件名称
        docx_name = os.getcwd() + '\\' + file_name + '.docx'
        # 加载pdf文档
        cv = Converter(pdf_name)
        cv.convert(docx_name)
        cv.close()

test

The pdf documents we prepare have formats and pictures. Let’s test first

animated cover

The console information is printed as follows. The conversion of 3 pages of pdf->docx files was completed in 0.17 seconds.

[INFO] Start to convert E:\Python\pycharm++\GOGO数据\卢本伟.pdf
[INFO] [1/4] Opening document...
[INFO] [2/4] Analyzing document...
[WARNING] Replace font "MicrosoftYaHeiUI" with "Times New Roman" due to lack of data.
Deprecation: 'getText' removed from class 'Page' after v1.19.0 - use 'get_text'.
Deprecation: 'getImageList' removed from class 'Page' after v1.19.0 - use 'get_images'.
Deprecation: 'getImageBbox' removed from class 'Page' after v1.19.0 - use 'get_image_bbox'.
Deprecation: 'getPNGData' removed from class 'Pixmap' after v1.19.0 - use 'tobytes'.
Deprecation: 'getDrawings' removed from class 'Page' after v1.19.0 - use 'get_drawings'.
Deprecation: 'getLinks' removed from class 'Page' after v1.19.0 - use 'get_links'.
Deprecation: 'getArea' removed from class 'Rect' after v1.19.0 - use 'get_area'.
[INFO] [3/4] Parsing pages...
[INFO] (1/3) Page 1
[INFO] (2/3) Page 2
[INFO] (3/3) Page 3
[INFO] [4/4] Creating pages...
[INFO] (1/3) Page 1
[INFO] (2/3) Page 2
[INFO] (3/3) Page 3
[INFO] Terminated in 0.17s.

The converted docx file format is as follows:

Now that we have completed the operation of converting PDF to word, this limitation is too great. What if my PC does not have a python environment?

Next, we package the files so that you can convert documents anytime and anywhere. The common packaging method in Python is achieved through pyinstaller .

pip install pyinstaller 

detailed steps

pyinstaller is a command line tool. The following are detailed steps 1. cmd to switch to the directory of the python file

2. Execute the command pyinstall -F pdfToword.py

animated cover


After execution, you will find that 3 folders have been generated.

The dist folder contains the exe file we have packaged.

3. Double-click the exe to run successfully. Capture and replace pdf-word with one click

It’s convenient enough~~

animated cover

This concludes today’s sharing.

Guess you like

Origin blog.csdn.net/m0_60961651/article/details/135358135