4 lines of code in python to convert pdf to word | python to convert pdf to word | pdf to word

1. Convert pdf to docx

Converting PDF format to Word format is very demanding. Many conversion web pages require payment, and the conversion effect is not good.
In Python, this requirement can be well achieved using the pdf2docx library, which can be installed directly using pip.

pip install pdf2docx

Wait for the installation to be completed.
Insert image description here
This module has a convert() method, which can convert PDF format to Word format
(the entire document is converted by default). The complete code is as follows:

from pdf2docx import Converter

cv = Converter("C:/Users/ypzhao/Desktop/毕业论文.pdf")
cv.convert("C:/Users/ypzhao/Desktop/毕业.docx", start=0, end=None)
cv.close()

(Convert the specified range) The complete code is as follows:

from pdf2docx import Converter

cv = Converter("C:/Users/ypzhao/Desktop/毕业论文.pdf")
cv.convert("C:/Users/ypzhao/Desktop/毕业论文.docx", pages=[0,2])
cv.close()

Use the pages parameter to specify the page range to convert.

2. Convert docx to pdf

First install the library

pip install pypiwin32
from win32com.client import Dispatch

old_file_path = r"C:/Users/ypzhao/Desktop/毕业论文.docx"
new_file_path = r"C:/Users/ypzhao/Desktop/毕业论文_convert.pdf"
word = Dispatch('Word.Application')
doc = word.Documents.Open(old_file_path)
doc.SaveAs(new_file_path,17)
doc.Close()
word.Quit()

The converted effect is the same as the pdf saved as

First, you need to import the Dispatch class in the win32com library. Then, define variables old_file_path and new_file_path to represent the original file path and the target file path respectively. Here, the original file path is "C:/Users/ypzhao/Desktop/Graduation Thesis.docx", and the target file path is "C:/Users/ypzhao/Desktop/Graduation Thesis_convert.pdf".

Next, create a Word.Application object, open the original file, and obtain the Document object. Call the SaveAs method to save the document in PDF format to the specified path, and set parameter 17 to specify the saving format of the document as PDF. Finally, close the document and exit the Word application.

The win32com library can help us use COM applications in Windows, such as Word, Excel, PowerPoint and other Office software, in Python programs to achieve automated operations. In the above code, we use the Dispatch class of the win32com library to create a Word.Application object, then open the specified DOCX document and save it as a PDF format document through the SaveAs method. This process does not require manual intervention at all, realizing automated processing and improving efficiency.

3. Convert doc to docx

from win32com.client import Dispatch
old_file_path = r"C:/Users/ypzhao/Desktop/毕业论文.doc"
new_file_path = r"C:/Users/ypzhao/Desktop/毕业论文_convert.docx"

word = Dispatch('Word.Application')
doc = word.Documents.Open(old_file_path)
doc.SaveAs(new_file_path,12)
doc.Close()
word.Quit()

4. Convert xls format to xlsx

from win32com.client import Dispatch
old_file_path = r"C:/Users/ypzhao/Desktop/毕业论文.xls"
new_file_path = r"C:/Users/ypzhao/Desktop/毕业论文_convert.xlsx"

excel = Dispatch('Excel.Application')
wb = excel.Workbooks.Open(old_file_path)
wb.SaveAs(new_file_path,51)
wb.Close()
excel.Quit()

5. Convert pdf to docx in batches

By using the pdf2docx library, batch conversion of PDF format files to DOCX format files is achieved.

First, define the variables path and path_convert to represent the directory where the original file is located and the directory where the converted file is stored respectively. Here they are "C:/Users/ypzhao/Desktop/pdf/" and "C:/Users/ypzhao/Desktop/docx/" respectively.

Then, use the listdir method in the os module to traverse all files in the directory, determine if the file type is PDF, and then convert it. Use the Converter class to open the PDF file, specify the converted target file name {file_name}.docx, and call the convert method to convert it to a DOCX file, and specify the page number range. Finally, close the Converter object and the conversion is complete.

import os
from pdf2docx import Converter

path = "C:/Users/ypzhao/Desktop/pdf/"
path_convert = "C:/Users/ypzhao/Desktop/docx/"

for i in os.listdir(path):
    file_name,file_suffix = i.split(".")
    if file_suffix == "pdf":
        cv = Converter(path+f"{
      
      i}")
        cv.convert(path_convert+f"{
      
      file_name}"+".docx", start=0, end=None)
        cv.close()
    else:
        pass

6. Convert docx to pdf in batches

from time import sleep
import os
from win32com.client import Dispatch

path = "C:/Users/ypzhao/Desktop/docx/"
path_convert = "C:/Users/ypzhao/Desktop/pdf/"
print("-----doc开始转换为docx-----")

for i in os.listdir(path):
    file_name,file_suffix = i.split(".") 
    if file_suffix == "doc":
        word = Dispatch('Word.Application')
        doc = word.Documents.Open(path+f"{
      
      i}")
        doc.SaveAs(path+f"{
      
      file_name}.docx",FileFormat=12)
        print(i,"转换完成")
        doc.Close()
        word.Quit()
        sleep(3)

print("-----开始转换为pdf-----")
for i in os.listdir(path):
    file_name,file_suffix = i.split(".") 
    if file_suffix == "docx":
        word = Dispatch('Word.Application')
        doc = word.Documents.Open(path+f"{
      
      i}")
        doc.SaveAs(path_convert+f"{
      
      file_name}.pdf",FileFormat=17)
        print(i,"...转换完成")
        doc.Close()
        word.Quit()
        sleep(3)
    else:
        pass

The function of this code is to convert the Word document (.doc) in the specified directory into a Word document (.docx), and then convert it to PDF format.

Among them, the sleep(3) function is used in the code to pause program execution in order to avoid program errors caused by opening the next document before the Word application is completely closed.

It should be noted that since various problems may occur during document conversion, such as format errors, confusing page layout, etc., it is recommended to manually check whether the converted document is normal after the conversion is completed.

Guess you like

Origin blog.csdn.net/m0_58857684/article/details/130804532