20230809 Use python3 to convert DOCX files to TXT files under WIN10

20230809 Use python3 to convert DOCX files to TXT files under WIN102023
/8/9 11:38


python docx txt

 

 


https://blog.51cto.com/u_16175446/6620474How
to implement the specific steps of reading word content into TXT in Python


How to realize the specific steps of reading word content into TXT in Python Original
mob649e81576de12023-07-04 14:08:13
Article tag Python Wordtxt file Article classification Python back-end development Reading count 234

Python reads word content and converts it to TXT.
As an experienced developer, I am happy to teach you how to use Python to read word content and convert it to txt format. Below are the steps of the entire process and the code that needs to be used.

Step
Step Description
Step 1 Install the Python-docx library
Step 2 Open the Word document
Step 3 Read the document content
Step 4 Save the content as a txt file
Code explanation
Step 1: Install the Python-docx library
Python-docx is a Python library that can be used Read, query and modify docx files in Microsoft Word 2007 or above. Before starting, you first need to install the Python-docx library. Run the following command on the command line:

pip install python-docx
1.
Step 2: Open Word document
To open Word document, we need to use the Document class in the Python-docx library. The following is the code to open the Word document:

from docx import Document

doc_path = "path_to_your_word_file.docx"
doc = Document(doc_path)

 


Please replace path_to_your_word_file.docx with the full path of the Word document you want to read.

Step 3: Read the document content
In this step, we will use the paragraphs property of the Document object to access each paragraph of the document, and then use the text property to obtain the text content of the paragraph. Here is the code to read the document content:

text_content = ""
for paragraph in doc.paragraphs:
    text_content += paragraph.text


In the above code, we create an empty string variable text_content and then use a for loop to iterate through each paragraph in the document. By accessing the paragraph.text property, we can get the text content of each paragraph and add it to the text_content string.

 

Step 4: Save the content as a txt file
The last step is to save the read content as a txt file. Here is the code to save the content as a txt file:

txt_file_path = "path_to_save_txt_file.txt"
with open(txt_file_path, "w") as txt_file:
    txt_file.write(text_content)

 


Please replace path_to_save_txt_file.txt with the full path to the txt file you wish to save.

In the above code, we use the open function to open a txt file and assign it to the txt_file variable. Then, we use the write method to write the previously read content text_content into the txt file.

At this point, you have completed the entire process of converting Word content to txt.

I hope this article can help you and enable you to easily use Python to read Word documents and convert their contents to txt format. If you have any further questions, please feel free to ask.

 


txt2docx1.py [removed all newline symbols!

from docx import Document

doc_path = "path_to_your_word_file.docx"
doc = Document(doc_path)


text_content = ""
for paragraph in doc.paragraphs:
    text_content += paragraph.text


txt_file_path = "path_to_save_txt_file.txt"
with open(txt_file_path, "w") as txt_file:
    txt_file.write(text_content)

 

 


txt2docx2.py [Line break after processing is completed]

from docx import Document

doc_path = "path_to_your_word_file.docx"
doc = Document(doc_path)


text_content = ""
for paragraph in doc.paragraphs:
    text_content += paragraph.text
    text_content += '\n'


txt_file_path = "path_to_save_txt_file.txt"
with open(txt_file_path, "w") as txt_file:
    txt_file.write(text_content)

 

 

 

 

 


txt2docx3utf8.py [Processing UTF8 encoding]

from docx import Document

doc_path = "path_to_your_word_file.docx"
doc = Document(doc_path)


text_content = ""
for paragraph in doc.paragraphs:
    text_content += paragraph.text
    text_content += '\n'


#with open("path_to_save_utf8_file.txt", "w", encoding="UTF-8") as utf8_file:
#txt_file_path = "path_to_save_txt_file.txt"
#with open(txt_file_path, "w") as txt_file:
txt_file_path = "path_to_save_txt+utf8_file.txt"
with open(txt_file_path, "w", encoding="UTF-8") as txt_file:
    txt_file.write(text_content)

Transfer to TXT file, encoded in ANSI and UTF-8, the content is the same!

 


docx2txt2all.py/docx2txt+ansi3all.py [Processing the DOCX of the current directory as ANSI encoded TXT]

# coding=utf-8
import os

import docx


# Get the current directory
path = os.getcwd()
# View all files in the current directory files
= os.listdir(path)

# Traverse all files
for file in files:
    # Determine whether the file is a txt file
    #if file.endswith('.txt'):
    if file.endswith('.docx'):
        # Construct a new file name
        #new_file = file. replace('.txt', '.json')
        #new_file = file.replace('.docx', '.srt')
        new_file = file.replace('.docx', '.txt')
        #Rename file
        # os.rename(os.path.join(path, file), os.path.join(path, new_file))


        from docx import Document
        
        #doc_path = "path_to_your_word_file.docx"
        #doc = Document(doc_path)
        doc = Document(file)
        
        
        text_content = ""
        for paragraph in doc.paragraphs:
            text_content += paragraph.text
            text_content += '\n'
        
        
        #txt_file_path = "path_to_save_txt_file.txt"
        #with open(txt_file_path, "w") as txt_file:
        with open(new_file, "w") as txt_file:
            txt_file.write(text_content)


utf8docx2tx4all.py [Processing the DOCX of the current directory as UTF8 encoded TXT]

# coding=utf-8
import os

import docx


# Get the current directory
path = os.getcwd()
# View all files in the current directory files
= os.listdir(path)

# Traverse all files
for file in files:
    # Determine whether the file is a txt file
    #if file.endswith('.txt'):
    if file.endswith('.docx'):
        # Construct a new file name
        #new_file = file. replace('.txt', '.json')
        #new_file = file.replace('.docx', '.srt')
        new_file = file.replace('.docx', '.txt')
        #Rename file
        # os.rename(os.path.join(path, file), os.path.join(path, new_file))


        from docx import Document
        
        #doc_path = "path_to_your_word_file.docx"
        #doc = Document(doc_path)
        doc = Document(file)
        
        
        text_content = ""
        for paragraph in doc.paragraphs:
            text_content += paragraph.text
            text_content += '\n'
        
        
        #txt_file_path = "path_to_save_txt_file.txt"
        #with open(txt_file_path, "w") as txt_file:
        #with open(new_file, "w") as txt_file:
        #txt_file_path = "path_to_save_txt+utf8_file.txt"
        #with open(txt_file_path, "w", encoding="UTF-8") as txt_file:
        with open(new_file, "w", encoding="UTF-8") as txt_file:
            txt_file.write(text_content)

Guess you like

Origin blog.csdn.net/wb4916/article/details/132185618