It can be solved so easily, just a few lines of Python code to convert PDF to Word

PDF to Word is an old topic, and the difficulty lies in establishing the mapping from PDF element position-based format to Word content-based format.

There is no concept of paragraphs and tables in PDF documents. What PDF to Word needs to do is to parse the "horizontal and vertical lines around the text" in the PDF document into Word's "tables", and convert the "text and a horizontal line below it" Parse to "text underscore" etc.

To be clear: pdf2docx supports Windows and Linux platforms and requires Python version >= 3.6. If you like this article, remember to bookmark, follow, and like.

[Note] The full version of the code, data, and communication is available at the end of the article

How to install pdf2docx:

pip install pdf2docx

picture

pdf2docx use

from pdf2docx import Converter

The idea is as follows

  1. Get the pdf file path.

  2. Filter out all pdf files in the current folder.

  3. Extract pdf file name and suffix.

  4. File name + 'docx' splicing and reorganizing the word file (change the format without changing the file name).

  5. Use pdf2docx for file conversion.

source code

The code is very simple, the source code is provided, and the ideas have been explained in the comments

import os
from pdf2docx import Converter

def pdf_docx():
    # 获取当前工作目录
    file_path = os.getcwd()

    # 遍历所有文件
    for file in os.listdir(file_path):
        # 获取文件后缀
        suff_name = os.path.splitext(file)[1]

        # 过滤非pdf格式文件
        if suff_name != '.pdf':
            continue
        # 获取文件名称
        file_name = os.path.splitext(file)[0]
        # pdf文件名称
        pdf_name = os.getcwd() + '\\' + file
        # 要转换的docx文件名称
        docx_name = os.getcwd() + '\\' + file_name + '.docx'
        # 加载pdf文档
        cv = Converter(pdf_name)
        cv.convert(docx_name)
        cv.close()

test

The pdf documents we prepared have formats and pictures. test first

picture

The console information is printed as follows, and the conversion of 3 pages of pdf->docx files is completed in 0.17 seconds

[INFO] Start to convert E:\Python\pycharm++\GOGO数据\卢本伟.pdf
[INFO] [1/4] Opening document...
[INFO] [2/4] Analyzing document...
[WARNING] Replace font "MicrosoftYaHeiUI" with "Times New Roman" due to lack of data.
Deprecation: 'getText' removed from class 'Page' after v1.19.0 - use 'get_text'.
Deprecation: 'getImageList' removed from class 'Page' after v1.19.0 - use 'get_images'.
Deprecation: 'getImageBbox' removed from class 'Page' after v1.19.0 - use 'get_image_bbox'.
Deprecation: 'getPNGData' removed from class 'Pixmap' after v1.19.0 - use 'tobytes'.
Deprecation: 'getDrawings' removed from class 'Page' after v1.19.0 - use 'get_drawings'.
Deprecation: 'getLinks' removed from class 'Page' after v1.19.0 - use 'get_links'.
Deprecation: 'getArea' removed from class 'Rect' after v1.19.0 - use 'get_area'.
[INFO] [3/4] Parsing pages...
[INFO] (1/3) Page 1
[INFO] (2/3) Page 2
[INFO] (3/3) Page 3
[INFO] [4/4] Creating pages...
[INFO] (1/3) Page 1
[INFO] (2/3) Page 2
[INFO] (3/3) Page 3
[INFO] Terminated in 0.17s.

The converted docx file format is as follows:

picture

Now that we have completed the conversion of pdf to word, this limitation is too great. What if my pc does not have a python environment?

Next, we package the files so that you can convert documents anytime, anywhere

The common packaging method on python is implemented by pyinstaller .

pip install pyinstaller 

picture

detailed steps

pyinstaller is a command line tool, the following are the detailed steps

1. cmd switch to the directory of the python file

picture

2. Execute the command pyinstall -F pdfToword.py

picture

After execution, you will find that 3 folders are generated

picture

Among them, the dist folder has the exe file that we have packaged.

picture

3. Double-click the exe to run it successfully. One-click capture and exchange of pdf-word

It's convenient enough~~

picture
That's the end of today's sharing

recommended article

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/124249480