PDF to Word is an old topic, and the difficulty lies in establishing the mapping from PDF element position-based format to Word content-based format.
There is no concept of paragraphs and tables in PDF documents. What PDF to Word needs to do is to parse the "horizontal and vertical lines around the text" in the PDF document into Word's "tables", and convert the "text and a horizontal line below it" Parse to "text underscore" etc.
To be clear: pdf2docx supports Windows and Linux platforms and requires Python version >= 3.6. If you like this article, remember to bookmark, follow, and like.
[Note] The full version of the code, data, and communication is available at the end of the article
How to install pdf2docx:
pip install pdf2docx
pdf2docx use
from pdf2docx import Converter
The idea is as follows
-
Get the pdf file path.
-
Filter out all pdf files in the current folder.
-
Extract pdf file name and suffix.
-
File name + 'docx' splicing and reorganizing the word file (change the format without changing the file name).
-
Use pdf2docx for file conversion.
source code
The code is very simple, the source code is provided, and the ideas have been explained in the comments
import os
from pdf2docx import Converter
def pdf_docx():
# 获取当前工作目录
file_path = os.getcwd()
# 遍历所有文件
for file in os.listdir(file_path):
# 获取文件后缀
suff_name = os.path.splitext(file)[1]
# 过滤非pdf格式文件
if suff_name != '.pdf':
continue
# 获取文件名称
file_name = os.path.splitext(file)[0]
# pdf文件名称
pdf_name = os.getcwd() + '\\' + file
# 要转换的docx文件名称
docx_name = os.getcwd() + '\\' + file_name + '.docx'
# 加载pdf文档
cv = Converter(pdf_name)
cv.convert(docx_name)
cv.close()
test
The pdf documents we prepared have formats and pictures. test first
The console information is printed as follows, and the conversion of 3 pages of pdf->docx files is completed in 0.17 seconds
[INFO] Start to convert E:\Python\pycharm++\GOGO数据\卢本伟.pdf
[INFO] [1/4] Opening document...
[INFO] [2/4] Analyzing document...
[WARNING] Replace font "MicrosoftYaHeiUI" with "Times New Roman" due to lack of data.
Deprecation: 'getText' removed from class 'Page' after v1.19.0 - use 'get_text'.
Deprecation: 'getImageList' removed from class 'Page' after v1.19.0 - use 'get_images'.
Deprecation: 'getImageBbox' removed from class 'Page' after v1.19.0 - use 'get_image_bbox'.
Deprecation: 'getPNGData' removed from class 'Pixmap' after v1.19.0 - use 'tobytes'.
Deprecation: 'getDrawings' removed from class 'Page' after v1.19.0 - use 'get_drawings'.
Deprecation: 'getLinks' removed from class 'Page' after v1.19.0 - use 'get_links'.
Deprecation: 'getArea' removed from class 'Rect' after v1.19.0 - use 'get_area'.
[INFO] [3/4] Parsing pages...
[INFO] (1/3) Page 1
[INFO] (2/3) Page 2
[INFO] (3/3) Page 3
[INFO] [4/4] Creating pages...
[INFO] (1/3) Page 1
[INFO] (2/3) Page 2
[INFO] (3/3) Page 3
[INFO] Terminated in 0.17s.
The converted docx file format is as follows:
Now that we have completed the conversion of pdf to word, this limitation is too great. What if my pc does not have a python environment?
Next, we package the files so that you can convert documents anytime, anywhere
The common packaging method on python is implemented by pyinstaller .
pip install pyinstaller
detailed steps
pyinstaller is a command line tool, the following are the detailed steps
1. cmd switch to the directory of the python file
2. Execute the command pyinstall -F pdfToword.py
After execution, you will find that 3 folders are generated
Among them, the dist folder has the exe file that we have packaged.
3. Double-click the exe to run it successfully. One-click capture and exchange of pdf-word
It's convenient enough~~
That's the end of today's sharing
recommended article
-
Li Hongyi's "Machine Learning" Mandarin Course (2022) is here
-
Someone made a Chinese version of Mr. Wu Enda's machine learning and deep learning
-
I'm addicted, and recently I gave the company a big visual screen (with source code)
-
So elegant, 4 Python automatic data analysis artifacts are really fragrant
-
It's very fragrant, and 20 visual large-screen templates have been organized
Technology Exchange
Welcome to reprint, collect, like and support!
At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends
- Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
- Method ②, add micro-signal: dkl88191 , note: from CSDN
- Method ③, WeChat search public account: Python learning and data mining , background reply: add group