[Python Treasure Box] From Word to Markdown: The Secret of Python Document Processing

Document Magic: The Python Document Processing Library Manual

Preface

In today's era of information explosion, document processing has become an indispensable skill. This article will take you to explore eight powerful document processing libraries in Python, from processing Word and PDF to generating Excel and Markdown, and uncover their mysteries one by one. Whether it's office automation or data analysis, these libraries will become powerful assistants in your toolbox.

[Python Treasure Box] Choose the document generation tool that best suits you: Sphinx, MkDocs, Read the Docs, GitBook, Jupyter Book, Pelican and Hugo

Welcome to subscribe to the column: Python Library Treasure Box: Unlocking the Magical World of Programming

Article directory

Of course, I will try to provide a detailed introduction and Python code examples. Due to space limitations, here is a brief beginning of each chapter:

1. Python-docx

1.1 Introduction to Python-docx

Python-docx is a Python library for processing Word documents. It provides rich functions that allow users to create, modify and manipulate .docx files. Here's a quick rundown of some basic features:

1.1.1 Basic functions

Python-docx can easily create a new Word document and add text, paragraphs, titles, etc. to it. Here's a simple example:

from docx import Document

# 创建一个新的Word文档
doc = Document()

# 添加标题
doc.add_heading('Hello, Python-docx!', level=1)

# 添加段落
doc.add_paragraph('This is a simple paragraph.')

# 保存文档
doc.save('example.docx')
1.1.2 Paragraph processing

Python-docx allows users to perform various operations on paragraphs, including adding text, setting styles, etc. The following example demonstrates how to create a paragraph containing multiple styles:

from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT

doc = Document()

# 添加带样式的段落
paragraph = doc.add_paragraph('This is a styled paragraph.')

# 设置字体大小
run = paragraph.runs[0]
run.font.size = Pt(12)

# 设置文本对齐方式
paragraph.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER

doc.save('styled_paragraph.docx')
1.1.3 Style and formatting

Python-docx supports rich styles and formatting options, and users can customize the appearance of text, paragraphs, and the entire document. The following example demonstrates how to add styled text:

from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_COLOR

doc = Document()

# 添加带样式的文本
paragraph = doc.add_paragraph()
run = paragraph.add_run('This is styled text.')
run.bold = True
run.italic = True
run.font.size = Pt(14)
run.font.color.rgb = (255, 0, 0)  # 设置字体颜色为红色

doc.save('styled_text.docx')

1.2 Advanced usage

1.2.1 Table operations

In addition to basic text and paragraph operations, Python-docx also provides powerful table processing functions. Users can create tables, add data, and set styles. Here is a simple example:

from docx import Document

# 创建一个新的Word文档
doc = Document()

# 添加表格
table = doc.add_table(rows=3, cols=3)

# 填充表格数据
data = [
    ['Header 1', 'Header 2', 'Header 3'],
    [1, 'A', 'X'],
    [2, 'B', 'Y'],
    [3, 'C', 'Z'],
]

for row_num, row_data in enumerate(data):
    for col_num, cell_data in enumerate(row_data):
        table.cell(row_num, col_num).text = str(cell_data)

# 保存文档
doc.save('table_example.docx')
1.2.2 Image insertion

Python-docx also supports inserting images into documents. Here's a simple example demonstrating how to insert an image:

from docx import Document
from docx.shared import Inches

doc = Document()

# 插入图片
doc.add_picture('image.jpg', width=Inches(2.0), height=Inches(1.5))

# 保存文档
doc.save('image_insert_example.docx')
1.2.3 Page settings

Users can set the page attributes of the document through Python-docx, including page margins, paper size, etc. Here's a simple page setup example:

from docx import Document
from docx.shared import Pt

doc = Document()

# 设置页面边距
sections = doc.sections
for section in sections:
    section.left_margin = Inches(1.0)
    section.right_margin = Inches(1.0)
    section.top_margin = Inches(1.0)
    section.bottom_margin = Inches(1.0)

# 设置纸张大小
doc.sections[0].page_width = Inches(8.5)
doc.sections[0].page_height = Inches(11)

# 保存文档
doc.save('page_setup_example.docx')

The above are some advanced uses of the Python-docx library, which provides richer functions, including table operations, image insertion and page settings. For detailed usage and more features, please refer to the official documentation.

2. PyPDF2

2.1 PyPDF2 Overview

PyPDF2 is a Python library for processing PDF files. It provides functions for reading, merging, splitting and modifying PDF documents. Here's a quick rundown of some basic features:

2.1.1 Reading and extracting text

PyPDF2 allows users to read PDF documents and extract text content. The following example demonstrates how to read a PDF document and print the text:

import PyPDF2

# 打开PDF文件
with open('example.pdf', 'rb') as file:
    pdf_reader = PyPDF2.PdfFileReader(file)

    # 获取PDF中的页面数量
    num_pages = pdf_reader.numPages

    # 读取每一页的文本
    for page_num in range(num_pages):
        page = pdf_reader.getPage(page_num)
        text = page.extract_text()
        print(f'Page {
      
      page_num + 1}:\n{
      
      text}\n')
2.1.2 Merge and split PDFs

PyPDF2 allows users to merge multiple PDF files into one file, or split a PDF file into multiple files. The example below demonstrates how to merge two PDF files:

import PyPDF2

# 打开要合并的PDF文件
with open('file1.pdf', 'rb') as file1, open('file2.pdf', 'rb') as file2:
    pdf1_reader = PyPDF2.PdfFileReader(file1)
    pdf2_reader = PyPDF2.PdfFileReader(file2)

    # 创建一个新的PDF写入对象
    pdf_writer = PyPDF2.PdfFileWriter()

    # 将两个PDF文件的所有页面添加到新文件中
    for page_num in range(pdf1_reader.numPages):
        pdf_writer.addPage(pdf1_reader.getPage(page_num))

    for page_num in range(pdf2_reader.numPages):
        pdf_writer.addPage(pdf2_reader.getPage(page_num))

    # 将合并后的内容保存到新文件
    with open('merged_file.pdf', 'wb') as merged_file:
        pdf_writer.write(merged_file)
2.2 Annotations and watermarks

PyPDF2 allows users to add comments and watermarks to PDF documents. The following example demonstrates how to add text annotations and watermarks to PDF documents:

import PyPDF2

# 打开PDF文件
with open('example.pdf', 'rb') as file:
    pdf_reader = PyPDF2.PdfFileReader(file)

    # 创建一个新的PDF写入对象
    pdf_writer = PyPDF2.PdfFileWriter()

    # 将每一页添加到新文件并添加注释和水印
    for page_num in range(pdf_reader.numPages):
        page = pdf_reader.getPage(page_num)

        # 添加文字注释
        page.addText("This is a comment.", 100, 100)

        # 添加水印
        watermark = pdf_reader.getPage(0)  # 使用第一页作为水印
        page.merge_page(watermark)

        pdf_writer.addPage(page)

    # 将带有注释和水印的内容保存到新文件
    with open('annotated_file.pdf', 'wb') as annotated_file:
        pdf_writer.write(annotated_file)

2.2 Advanced usage

2.2.1 Document encryption and decryption

PyPDF2 allows users to encrypt and decrypt PDF documents to enhance document security. Here is a simple example:

import PyPDF2

# 打开PDF文件
with open('example.pdf', 'rb') as file:
    pdf_reader = PyPDF2.PdfFileReader(file)

    # 创建一个新的PDF写入对象
    pdf_writer = PyPDF2.PdfFileWriter()

    # 将每一页添加到新文件
    for page_num in range(pdf_reader.numPages):
        pdf_writer.addPage(pdf_reader.getPage(page_num))

    # 设置加密参数
    pdf_writer.encrypt('password', 'owner_password')

    # 将加密后的内容保存到新文件
    with open('encrypted_file.pdf', 'wb') as encrypted_file:
        pdf_writer.write(encrypted_file)
2.2.2 Rotate page

PyPDF2 allows users to rotate pages in PDF documents to suit specific layout needs. Here is a simple example:

import PyPDF2

# 打开PDF文件
with open('example.pdf', 'rb') as file:
    pdf_reader = PyPDF2.PdfFileReader(file)

    # 创建一个新的PDF写入对象
    pdf_writer = PyPDF2.PdfFileWriter()

    # 将每一页旋转后添加到新文件
    for page_num in range(pdf_reader.numPages):
        page = pdf_reader.getPage(page_num)
        page.rotateClockwise(90)  # 顺时针旋转90度
        pdf_writer.addPage(page)

    # 将旋转后的内容保存到新文件
    with open('rotated_file.pdf', 'wb') as rotated_file:
        pdf_writer.write(rotated_file)
2.2.3 Extracting images

PyPDF2 also provides functions for extracting images from PDF documents. Here is a simple example:

import PyPDF2

# 打开PDF文件
with open('example.pdf', 'rb') as file:
    pdf_reader = PyPDF2.PdfFileReader(file)

    # 获取第一页
    page = pdf_reader.getPage(0)

    # 获取页面中的所有图像
    images = page.extract_images()

    # 保存图像到文件
    for image_num, image in enumerate(images):
        image_data = image[1]['/DCTDecode']
        with open(f'image_{
      
      image_num}.jpg', 'wb') as image_file:
            image_file.write(image_data)

The above are some advanced uses of the PyPDF2 library, including document encryption, page rotation, image extraction and other functions. For detailed usage and more features, please refer to the official documentation.

3. ReportLab

3.1 Introduction to ReportLab

ReportLab is a Python library for generating PDF documents. It provides flexible functionality that allows users to create complex PDF files containing text, images, and charts. Here's a quick rundown of some basic features:

3.1.1 Generate PDF document

ReportLab allows users to create new PDF documents and add content to them. The following example demonstrates how to create a simple PDF document:

from reportlab.pdfgen import canvas

# 创建PDF文档
pdf_file_path = 'example.pdf'
pdf_canvas = canvas.Canvas(pdf_file_path)

# 在PDF中添加文本
pdf_canvas.drawString(100, 750, 'Hello, ReportLab!')

# 保存PDF文档
pdf_canvas.save()
3.1.2 Custom page layout

ReportLab allows users to customize the layout of the page, including size, margins and orientation. The following example demonstrates how to create a PDF document with a custom page layout:

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

# 创建PDF文档,使用letter尺寸
pdf_file_path = 'custom_layout.pdf'
pdf_canvas = canvas.Canvas(pdf_file_path, pagesize=letter)

# 设置自定义页面边距
left_margin = 72
bottom_margin = 72
pdf_canvas.setLeftMargin(left_margin)
pdf_canvas.setBottomMargin(bottom_margin)

# 在PDF中添加文本
pdf_canvas.drawString(left_margin, letter[1] - bottom_margin, 'Custom Layout Example')

# 保存PDF文档
pdf_canvas.save()
3.1.3 Add graphics and charts

ReportLab allows users to add a variety of graphics and charts to PDF documents, including lines, rectangles, circles, and charts. Here is a simple example:

from reportlab.pdfgen import canvas

# 创建PDF文档
pdf_file_path = 'graphics_and_charts.pdf'
pdf_canvas = canvas.Canvas(pdf_file_path)

# 添加线条
pdf_canvas.line(100, 750, 400, 750)

# 添加矩形
pdf_canvas.rect(100, 700, 300, 50, fill=True)

# 添加圆形
pdf_canvas.circle(250, 650, 25, fill=True)

# 保存PDF文档
pdf_canvas.save()

3.2 Advanced usage

3.2.1 Image insertion

In addition to basic text and graphics operations, ReportLab also provides the function of inserting images. Here is a simple example demonstrating how to insert a picture into a PDF document:

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from reportlab.lib import utils

# 创建PDF文档,使用letter尺寸
pdf_file_path = 'image_insert.pdf'
pdf_canvas = canvas.Canvas(pdf_file_path, pagesize=letter)

# 获取图片路径
image_path = 'image.jpg'

# 获取图片尺寸
img_width, img_height = utils.ImageReader(image_path).getSize()

# 计算插入位置
x_pos = (letter[0] - img_width) / 2
y_pos = (letter[1] - img_height) / 2

# 插入图片
pdf_canvas.drawInlineImage(image_path, x_pos, y_pos, width=img_width, height=img_height)

# 保存PDF文档
pdf_canvas.save()
3.2.2 Add table

ReportLab allows users to add tables to PDF documents and supports custom table styles. Here is a simple example:

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from reportlab.lib import colors
from reportlab.platypus import Table, TableStyle

# 创建PDF文档,使用letter尺寸
pdf_file_path = 'table_example.pdf'
pdf_canvas = canvas.Canvas(pdf_file_path, pagesize=letter)

# 定义表格数据
data = [['Name', 'Age', 'Country'],
        ['John', 30, 'USA'],
        ['Emma', 25, 'Canada'],
        ['Mike', 35, 'UK']]

# 创建表格对象
table = Table(data)

# 设置表格样式
style = TableStyle([('BACKGROUND', (0, 0), (-1, 0), colors.grey),
                    ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
                    ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
                    ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
                    ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
                    ('BACKGROUND', (0, 1), (-1, -1), colors.beige),
                    ('GRID', (0, 0), (-1, -1), 1, colors.black)])

table.setStyle(style)

# 将表格添加到PDF文档
table.wrapOn(pdf_canvas, 400, 600)
table.drawOn(pdf_canvas, 100, 500)

# 保存PDF文档
pdf_canvas.save()
3.2.3 Using templates

ReportLab supports the use of templates to implement repetitive structures in documents. Here's a simple example using templates:

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

def create_template(canvas, doc):
    # 在每一页上添加模板内容
    canvas.saveState()
    canvas.setFont('Helvetica', 10)
    canvas.drawString(100, 50, 'This is a template')
    canvas.restoreState()

# 创建PDF文档,使用letter尺寸
pdf_file_path = 'template_example.pdf'
pdf_canvas = canvas.Canvas(pdf_file_path, pagesize=letter)

# 将模板应用到文档
pdf_canvas.showPage()
create_template(pdf_canvas, pdf_canvas)

# 保存PDF文档
pdf_canvas.save()

The above are some advanced uses of the ReportLab library, including functions such as picture insertion, table addition and use of templates. For detailed usage and more features, please refer to the official documentation.

4. XlsxWriter

4.1 XlsxWriter Overview

XlsxWriter is a Python library for creating and modifying Excel files (.xlsx). It provides rich functionality that allows users to generate Excel documents with formatting, charts, and formulas. Here's a quick rundown of some basic features:

4.1.1 Create Excel file

XlsxWriter allows users to create new Excel files and add content to them. The following example demonstrates how to create an Excel file containing data and charts:

import xlsxwriter

# 创建Excel文件
excel_file_path = 'example.xlsx'
workbook = xlsxwriter.Workbook(excel_file_path)

# 添加工作表
worksheet = workbook.add_worksheet()

# 写入数据
data = [
    ['Name', 'Age', 'Country'],
    ['John', 30, 'USA'],
    ['Emma', 25, 'Canada'],
    ['Mike', 35, 'UK'],
]

for row_num, row_data in enumerate(data):
    for col_num, cell_data in enumerate(row_data):
        worksheet.write(row_num, col_num, cell_data)

# 添加图表
chart = workbook.add_chart({
    
    'type': 'column'})
chart.add_series({
    
    'values': '=Sheet1!$B$2:$B$4'})

worksheet.insert_chart('E2', chart)

# 保存Excel文件
workbook.close()
4.1.2 Processing Worksheets

XlsxWriter allows users to work with worksheets, including inserting charts, formatting and adding data validation. Here is a simple example:

import xlsxwriter

# 创建Excel文件
excel_file_path = 'worksheet_handling.xlsx'
workbook = xlsxwriter.Workbook(excel_file_path)
worksheet = workbook.add_worksheet()

# 插入图表
chart = workbook.add_chart({
    
    'type': 'line'})
chart.add_series({
    
    'values': '=Sheet1!$B$2:$B$4'})
worksheet.insert_chart('E2', chart)

# 设置单元格格式
format_bold = workbook.add_format({
    
    'bold': True})
worksheet.write('A1', 'Bold Text', format_bold)

# 添加数据验证
worksheet.data_validation('C1', {
    
    'validate': 'integer'})

# 保存Excel文件
workbook.close()
4.1.3 Cell formatting and styling

XlsxWriter allows users to format and style cells, including fonts, colors and borders. Here is a simple example:

import xlsxwriter

# 创建Excel文件
excel_file_path = 'cell_formatting.xlsx'
workbook = xlsxwriter.Workbook(excel_file_path)
worksheet = workbook.add_worksheet()

# 设置单元格格式
format_bold = workbook.add_format({
    
    'bold': True})
format_italic = workbook.add_format({
    
    'italic': True})
worksheet.write('A1', 'Bold Text', format_bold)
worksheet.write('A2', 'Italic Text', format_italic)

# 设置单元格背景颜色
format_bg_color = workbook.add_format({
    
    'bg_color': 'yellow'})
worksheet.write('B1', 'Yellow Background', format_bg_color)

# 设置单元格边框
format_border = workbook.add_format({
    
    'border': 1})
worksheet.write('B2', 'Cell with Border', format_border)

# 保存Excel文件
workbook.close()

4.2 Advanced usage

4.2.1 Working with charts

XlsxWriter allows users to create and customize various charts, including bar charts, line charts, etc. Here is a simple example:

import xlsxwriter

# 创建Excel文件
excel_file_path = 'chart_advanced.xlsx'
workbook = xlsxwriter.Workbook(excel_file_path)
worksheet = workbook.add_worksheet()

# 写入数据
data = [
    ['Month', 'Sales'],
    ['January', 150],
    ['February', 200],
    ['March', 180],
]

for row_num, row_data in enumerate(data):
    for col_num, cell_data in enumerate(row_data):
        worksheet.write(row_num, col_num, cell_data)

# 创建条形图
chart = workbook.add_chart({
    
    'type': 'bar'})

# 配置图表数据系列
chart.add_series({
    
    'categories': '=Sheet1!$A$2:$A$4',
                  'values': '=Sheet1!$B$2:$B$4',
                  'name': 'Sales'})

# 设置图表标题
chart.set_title({
    
    'name': 'Monthly Sales'})

# 设置X轴和Y轴标签
chart.set_x_axis({
    
    'name': 'Month'})
chart.set_y_axis({
    
    'name': 'Sales'})

# 插入图表
worksheet.insert_chart('E2', chart)

# 保存Excel文件
workbook.close()
4.2.2 Freezing panes

XlsxWriter allows users to freeze panes in Excel files so that certain rows or columns are always visible when scrolling. Here is a simple example:

import xlsxwriter

# 创建Excel文件
excel_file_path = 'freeze_panes.xlsx'
workbook = xlsxwriter.Workbook(excel_file_path)
worksheet = workbook.add_worksheet()

# 冻结前两行
worksheet.freeze_panes(2, 0)

# 写入数据
data = [
    ['Header 1', 'Header 2', 'Header 3'],
    [1, 'A', 'X'],
    [2, 'B', 'Y'],
    [3, 'C', 'Z'],
]

for row_num, row_data in enumerate(data):
    for col_num, cell_data in enumerate(row_data):
        worksheet.write(row_num, col_num, cell_data)

# 保存Excel文件
workbook.close()
4.2.3 Using formulas

XlsxWriter allows users to use formulas in Excel files. Here is a simple example:

import xlsxwriter

# 创建Excel文件
excel_file_path = 'formulas.xlsx'
workbook = xlsxwriter.Workbook(excel_file_path)
worksheet = workbook.add_worksheet()

# 写入数据
data = [
    ['Number 1', 'Number 2', 'Sum'],
    [10, 20, None],
    [30, 40, None],
]

for row_num, row_data in enumerate(data):
    for col_num, cell_data in enumerate(row_data):
        worksheet.write(row_num, col_num, cell_data)

# 添加SUM公式
worksheet.write_formula('C2', '=SUM(A2:B2)')
worksheet.write_formula('C3', '=SUM(A3:B3)')

# 保存Excel文件
workbook.close()

The above are some advanced uses of the XlsxWriter library, including functions such as processing charts, freezing panes, and using formulas. For detailed usage and more features, please refer to the official documentation.

5. Pandoc

5.1 Understanding Pandoc

Pandoc is a document format conversion tool that can convert documents from one markup language to another. Here’s a quick rundown of some of Pandoc’s basic features:

5.1.1 Mark conversion

Pandoc supports multiple markup languages, including Markdown, HTML, LaTeX, etc., and users can convert documents from one format to another. Here is an example:

pandoc input.md -o output.pdf

This command converts Markdown documents to PDF format.

5.1.2 Supported formats

Pandoc supports many input and output formats, and users can choose the appropriate format for conversion according to their needs. Here is an example listing some of the supported formats:

pandoc --list-input-formats
pandoc --list-output-formats

5.2 Integration with Python

Pandoc can be integrated through Python, and users can call Pandoc in Python scripts to convert document formats. Here is an subprocessexample of calling Pandoc using the library:

import subprocess

input_file = 'input.md'
output_file = 'output.pdf'

subprocess.run(['pandoc', input_file, '-o', output_file])
5.2.1 Execute Pandoc command

In Python, you can use subprocessmodules to execute Pandoc commands and embed them into Python scripts to realize automated document processing.

5.2.2 Custom conversion options

Users can customize the conversion process by adding options to the Pandoc command, such as specifying output formats, style files, etc.

5.2.3 Automation and scripting

Combining Python and Pandoc, users can automate and script document processing, thereby improving work efficiency.

5.3 Advanced usage

5.3.1 Custom styles and templates

Pandoc allows users to apply custom styles and templates to customize the appearance of generated documents. Here's an example using custom styles:

pandoc input.md -o output.pdf --template=custom_template.latex
5.3.2 Mathematical formula support

Pandoc supports inserting mathematical formulas into documents and is compatible with the LaTeX mathematical environment. Here is an example:

# Math Example

This is an inline math equation: $E=mc^2$

And here is a displayed math equation:

$$
\int_{-\infty}^{\infty} e^{-x^2} \, dx = \sqrt{\pi}
$$
5.3.3 Multiple document splicing

Pandoc allows users to merge multiple documents into a single document for better organization and management of content. Here is an example:

pandoc file1.md file2.md -o merged_output.pdf
5.3.4 Filters and extensions

Pandoc supports more complex document conversion and processing through filters and extensions. Users can write custom filters or use existing extensions. Here is an example:

pandoc input.md -o output.pdf --filter=my_filter.py

The above are some advanced uses of the Pandoc library, including custom styles, mathematical formula support, multi-document splicing, filters and other functions. For detailed usage and more features, please refer to the official documentation.

6. PyMuPDF

6.1 Processing PDF documents

PyMuPDF is a Python library for processing PDF documents, providing functions for reading, rendering, and manipulating PDFs. Here's a quick rundown of some basic features:

import fitz

# 打开PDF文件
pdf_document = fitz.open('example.pdf')

# 获取页面数量
num_pages = pdf_document.page_count

# 读取文本
for page_num in range(num_pages):
    page = pdf_document[page_num]
    text = page.get_text()
    print(f'Page {
      
      page_num + 1}:\n{
      
      text}\n')

# 关闭PDF文件
pdf_document.close()

6.2 Advanced usage

6.2.1 Image extraction

PyMuPDF allows users to extract images from PDF documents, here is a simple example:

import fitz

# 打开PDF文件
pdf_document = fitz.open('example.pdf')

# 获取页面数量
num_pages = pdf_document.page_count

# 提取图像
for page_num in range(num_pages):
    page = pdf_document[page_num]
    images = page.get_images(full=True)

    for img_index, img in enumerate(images):
        base_image = pdf_document.extract_image(img)
        image_bytes = base_image["image"]

        # 保存图像到文件
        image_file_path = f'image_page{
      
      page_num + 1}_img{
      
      img_index + 1}.png'
        with open(image_file_path, 'wb') as image_file:
            image_file.write(image_bytes)

# 关闭PDF文件
pdf_document.close()
6.2.2 Document merging

PyMuPDF allows users to merge multiple PDF documents into one document. The following is a simple example:

import fitz

# 打开第一个PDF文件
pdf_document1 = fitz.open('file1.pdf')

# 打开第二个PDF文件
pdf_document2 = fitz.open('file2.pdf')

# 创建新的PDF文件
merged_document = fitz.open()

# 将两个PDF文件的页面合并到新文件
for page_num in range(pdf_document1.page_count):
    page = pdf_document1[page_num]
    merged_document.insert_page(page_num, page)

for page_num in range(pdf_document2.page_count):
    page = pdf_document2[page_num]
    merged_document.insert_page(merged_document.page_count, page)

# 保存合并后的PDF文档
merged_document.save('merged_file.pdf')

# 关闭所有PDF文件
pdf_document1.close()
pdf_document2.close()
merged_document.close()
6.2.3 Add comments

PyMuPDF allows users to add comments to PDF documents. Here is a simple example:

import fitz

# 打开PDF文件
pdf_document = fitz.open('example.pdf')

# 获取第一页
page = pdf_document[0]

# 添加注释
annot = page.add_textannot((100, 100), 'This is a comment')
annot.set_font(fitz.Font(fitz.FONT_HELVETICA), 10)
annot.set_color((1, 0, 0))  # 设置颜色为红色

# 保存修改后的PDF文档
pdf_document.save('annotated_file.pdf')

# 关闭PDF文件
pdf_document.close()

The above are some advanced uses of the PyMuPDF library, including image extraction, document merging, and adding comments. For detailed usage and more features, please refer to the official documentation.

7. openpyxl

7.1 Reading and writing Excel files

openpyxl is a Python library for reading and writing Excel files, supporting the xlsx format. Here's a quick rundown of some basic features:

from openpyxl import Workbook, load_workbook

# 创建Excel文件
workbook = Workbook()
worksheet = workbook.active

# 写入数据
worksheet['A1'] = 'Hello'
worksheet['B1'] = 'World'

# 保存Excel文件
workbook.save('example.xlsx')

# 读取Excel文件
loaded_workbook = load_workbook('example.xlsx')
loaded_worksheet = loaded_workbook.active

# 获取数据
cell_value = loaded_worksheet['A1'].value
print(cell_value)

7.2 Advanced usage

7.2.1 Style and formatting

openpyxl allows users to style and format cells, including fonts, colors, and borders. Here is a simple example:

from openpyxl import Workbook
from openpyxl.styles import Font, Alignment, PatternFill

# 创建Excel文件
workbook = Workbook()
worksheet = workbook.active

# 设置单元格样式
cell = worksheet['A1']
cell.value = 'Styled Text'

# 设置字体样式
font = Font(size=14, bold=True, color='FF0000')
cell.font = font

# 设置文本对齐方式
alignment = Alignment(horizontal='center', vertical='center')
cell.alignment = alignment

# 设置背景颜色
fill = PatternFill(start_color='FFFF00', end_color='FFFF00', fill_type='solid')
cell.fill = fill

# 保存Excel文件
workbook.save('styled_cell.xlsx')
7.2.2 Chart insertion

openpyxl allows users to insert charts into Excel files. Here is a simple example:

from openpyxl import Workbook
from openpyxl.chart import BarChart, Reference

# 创建Excel文件
workbook = Workbook()
worksheet = workbook.active

# 写入数据
data = [['Month', 'Sales'],
        ['January', 150],
        ['February', 200],
        ['March', 180]]

for row_data in data:
    worksheet.append(row_data)

# 创建条形图
chart = BarChart()
chart.title = 'Monthly Sales'
chart.x_axis.title = 'Month'
chart.y_axis.title = 'Sales'

# 设置图表数据范围
data_range = Reference(worksheet, min_col=2, min_row=1, max_col=2, max_row=4)
chart.add_data(data_range)

# 插入图表
worksheet.add_chart(chart, 'E2')

# 保存Excel文件
workbook.save('chart_example.xlsx')
7.2.3 Pivot table

openpyxl supports creating and configuring pivot tables, the following is a simple example:

from openpyxl import Workbook
from openpyxl.pivot import PivotTable, Reference

# 创建Excel文件
workbook = Workbook()
worksheet = workbook.active

# 写入数据
data = [['Region', 'Product', 'Sales'],
        ['North', 'A', 100],
        ['South', 'B', 150],
        ['East', 'A', 120],
        ['West', 'B', 200]]

for row_data in data:
    worksheet.append(row_data)

# 创建数据透视表
pivot_table = PivotTable(data_source=Reference(worksheet, min_col=1, min_row=1, max_col=3, max_row=5),
                         location='E5', name='PivotTable')

# 设置行和列字段
pivot_table.add_field('Region', 'rows')
pivot_table.add_field('Product', 'columns')
pivot_table.add_field('Sales', 'values')

# 保存Excel文件
workbook.save('pivot_table_example.xlsx')

The above are some advanced uses of the openpyxl library, including styling and formatting, chart insertion, and pivot tables. For detailed usage and more features, please refer to the official documentation.

8. mistune

8.1 Parse and generate Markdown documents

mistune is a Python library for parsing and generating Markdown documents, supporting standard Markdown syntax. Here's a quick rundown of some basic features:

from mistune import Markdown

# 创建Markdown解析器
markdown = Markdown()

# 解析Markdown文本
parsed_text = markdown('**Hello, Mistune!**')
print(parsed_text)

# 生成Markdown文本
generated_text = markdown.render('**Markdown Generated!**')
print(generated_text)

8.2 Advanced usage

8.2.1 Extension and configuration

mistune allows users to customize the behavior of the Markdown parser using extensions and configurations. Here is a simple example:

from mistune import Markdown, Renderer

# 自定义Renderer
class CustomRenderer(Renderer):
    def block_code(self, code, lang=None):
        if lang and lang == 'python':
            return f'<pre><code class="python">{
      
      code}</code></pre>'
        else:
            return super().block_code(code, lang)

# 创建带扩展的Markdown解析器
markdown_with_extension = Markdown(renderer=CustomRenderer(), escape=False)

# 解析Markdown文本
custom_parsed_text = markdown_with_extension('```python\nprint("Hello, Mistune!")\n```')
print(custom_parsed_text)
8.2.2 Advanced syntax support

mistune supports some advanced Markdown syntax, such as tables, task lists, etc. Here's an example with a table and task list:

from mistune import Markdown

# 创建Markdown解析器
markdown = Markdown()

# 解析Markdown文本
advanced_syntax_text = """
| Header 1 | Header 2 |
| -------- | -------- |
| Cell 1   | Cell 2   |

- [x] Task 1
- [ ] Task 2
"""

parsed_advanced_syntax_text = markdown(advanced_syntax_text)
print(parsed_advanced_syntax_text)
8.2.3 HTML rendering

mistune supports rendering Markdown text to HTML, here is an example:

from mistune import Markdown, HTMLRenderer

# 创建HTML渲染器
html_renderer = HTMLRenderer()

# 创建Markdown解析器
markdown = Markdown(renderer=html_renderer)

# 解析Markdown文本并渲染为HTML
markdown_html = markdown('**Markdown to HTML**')
print(markdown_html)

The above are some advanced uses of the mistune library, including extension and configuration, advanced syntax support, and HTML rendering. For detailed usage and more features, please refer to the official documentation.

Official document link:

  1. Python-docx

  2. PyPDF2

  3. ReportLab

  4. XlsxWriter

  5. Pandoc

  6. PyMuPDF

  7. openpyxl

  8. mistune

Summarize

By learning these eight libraries, readers will fully master the skills of Python document processing. Whether you are processing daily documents and data visualization at work, or document generation and format conversion in personal projects, these libraries provide rich functionality and flexible operation methods. After mastering these tools, you will be able to handle various document tasks more efficiently, thereby increasing your work efficiency.

Guess you like

Origin blog.csdn.net/qq_42531954/article/details/135449850