使用Python提取和筛选Word文档中的句子

在Python编程中，我们经常需要处理文本数据。本篇博客将介绍如何使用Python读取Word文档，并提取其中包含特定关键词的句子。我们还将展示如何对提取出的句子进行筛选，例如根据句子的长度或特定字符的出现进行过滤。

准备工作

在开始之前，确保你已经安装了以下库：

python-docx：用于读取和操作Word文档的库。

你可以使用以下命令通过pip安装所需的库：

pip install python-docx

提取带有特定关键词的句子

首先，让我们来看看如何提取Word文档中带有特定关键词的句子。我们将使用python-docx库来读取文档内容，并使用正则表达式来分割句子。下面是相应的代码：

import re
from docx import Document

def extract_sentences_with_keyword(docx_file, keyword):
    document = Document(docx_file)
    keyword_sentences = []

    for paragraph in document.paragraphs:
        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', paragraph.text)
        for sentence in sentences:
            if keyword in sentence:
                keyword_sentences.append(sentence)

    return keyword_sentences

以上代码定义了一个名为extract_sentences_with_keyword的函数。它接受一个Word文档文件路径和一个关键词作为输入，并返回包含该关键词的句子列表。

筛选句子

接下来，我们可以根据句子的长度进行筛选，只保留长度大于给定阈值的句子。下面是相应的代码：

def filter_long_sentences(sentences, min_length):
    filtered_sentences = [sentence for sentence in sentences if len(sentence) > min_length]
    return filtered_sentences

以上代码定义了一个名为filter_long_sentences的函数。它接受一个句子列表和最小长度作为输入，并返回长度大于最小长度的句子列表。

批量处理多个文档

如果你有多个Word文档需要处理，可以使用以下代码来批量处理多个文档，并将结果保存到新的Word文档中：

import os
from docx import Document

def extract_sentences_with_keywords(docx_file, keywords):
    document = Document(docx_file)
    keyword_sentences = []

    for paragraph in document.paragraphs:
        for sentence in paragraph.text.split('.'):
            for keyword in keywords:
                if keyword in sentence:
                    keyword_sentence = f"{keyword}: {sentence.strip()}"
                    keyword_sentences.append(keyword_sentence)
                    break

    return keyword_sentences

def filter_sentences_without_ABCD(sentences):
    filtered_sentences = []
    for sentence in sentences:
        if not all(char in sentence for char in 'ABCD'):
            filtered_sentences.append(sentence)
    return filtered_sentences

# 创建新的Word文档
output_doc = Document()

# 处理文件夹中的所有文档
folder_path = "data"  # 替换为你的文件夹路径
keywords = ["中药药剂学", "生物利用度", "剂型", "热原表面活性剂", "中成药", "溶出度", "返砂"]

for filename in os.listdir(folder_path):
    if filename.endswith(".docx"):
        file_path = os.path.join(folder_path, filename)
        sentences = extract_sentences_with_keywords(file_path, keywords)
        filtered_sentences = filter_long_sentences(sentences, 40)
        filtered_sentences = filter_sentences_without_ABCD(filtered_sentences)

        for sentence in filtered_sentences:
            output_doc.add_paragraph(sentence)

# 保存输出的Word文档
output_path = "output.docx"  # 替换为你的输出文件路径
output_doc.save(output_path)

以上代码中，我们创建了一个名为output_doc的新文档对象。然后，通过循环遍历指定文件夹中的所有文档，提取包含特定关键词的句子，并进行长度筛选和字符筛选。最后，将满足条件的句子逐句添加到新文档中。

要使用上述代码，将data替换为包含要处理的Word文档的文件夹路径，将output.docx替换为输出的Word文档的路径。

运行代码后，满足条件的句子将保存在输出的Word文档中。

希望本篇博客对你有所帮助！以上就是使用Python提取和筛选Word文档中的句子的详细步骤和代码示例。使用这些技巧，你可以更轻松地处理和分析文本数据。祝你编程愉快！