Use Python to extract and filter sentences in Word documents

In Python programming, we often need to deal with text data. This blog will introduce how to use Python to read Word documents and extract sentences containing specific keywords. We will also show how to filter the extracted sentences, for example based on the length of the sentence or the occurrence of specific characters.

Preparation

Before starting, make sure you have the following libraries installed:

python-docx: A library for reading and manipulating Word documents.

You can install the required libraries via pip with the following command:

pip install python-docx

Extract sentences with specific keywords

First, let's see how to extract sentences with specific keywords in a Word document. We'll use python-docxa library to read document content and regular expressions to split sentences. Here is the corresponding code:

import re
from docx import Document

def extract_sentences_with_keyword(docx_file, keyword):
    document = Document(docx_file)
    keyword_sentences = []

    for paragraph in document.paragraphs:
        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', paragraph.text)
        for sentence in sentences:
            if keyword in sentence:
                keyword_sentences.append(sentence)

    return keyword_sentences

The above code defines a extract_sentences_with_keywordfunction named . It takes a Word document file path and a keyword as input and returns a list of sentences containing that keyword.

filter sentences

Next, we can filter based on the length of the sentences and only keep sentences whose length is greater than a given threshold. Here is the corresponding code:

def filter_long_sentences(sentences, min_length):
    filtered_sentences = [sentence for sentence in sentences if len(sentence) > min_length]
    return filtered_sentences

The above code defines a filter_long_sentencesfunction named . It takes as input a list of sentences and a minimum length and returns a list of sentences whose length is greater than the minimum length.

Batch process multiple documents

If you have multiple Word documents to process, you can use the following code to batch process multiple documents and save the results to a new Word document:

import os
from docx import Document

def extract_sentences_with_keywords(docx_file, keywords):
    document = Document(docx_file)
    keyword_sentences = []

    for paragraph in document.paragraphs:
        for sentence in paragraph.text.split('.'):
            for keyword in keywords:
                if keyword in sentence:
                    keyword_sentence = f"{keyword}: {sentence.strip()}"
                    keyword_sentences.append(keyword_sentence)
                    break

    return keyword_sentences

def filter_sentences_without_ABCD(sentences):
    filtered_sentences = []
    for sentence in sentences:
        if not all(char in sentence for char in 'ABCD'):
            filtered_sentences.append(sentence)
    return filtered_sentences

# 创建新的Word文档
output_doc = Document()

# 处理文件夹中的所有文档
folder_path = "data"  # 替换为你的文件夹路径
keywords = ["中药药剂学", "生物利用度", "剂型", "热原表面活性剂", "中成药", "溶出度", "返砂"]

for filename in os.listdir(folder_path):
    if filename.endswith(".docx"):
        file_path = os.path.join(folder_path, filename)
        sentences = extract_sentences_with_keywords(file_path, keywords)
        filtered_sentences = filter_long_sentences(sentences, 40)
        filtered_sentences = filter_sentences_without_ABCD(filtered_sentences)

        for sentence in filtered_sentences:
            output_doc.add_paragraph(sentence)

# 保存输出的Word文档
output_path = "output.docx"  # 替换为你的输出文件路径
output_doc.save(output_path)

In the code above, we created a output_docnew document object named . Then, by looping through all the documents in the specified folder, extract sentences containing specific keywords, and perform length screening and character screening. Finally, the sentences satisfying the condition are added to the new document sentence by sentence.

To use the above code, datareplace the path of the folder containing the Word document to be processed output.docxwith the path of the output Word document.

After running the code, the sentences that meet the conditions will be saved in the output Word document.

Hope this blog is helpful to you! The above are the detailed steps and code examples of using Python to extract and filter sentences in Word documents. Using these techniques, you can more easily process and analyze text data. Happy programming!