Python Office Automation: docx


202201 note migration

Introduction

The docx package of python can be used to automatically process docx files. It can generate a docx file from scratch, or modify existing docx files in batches. (But the impression is that you can only operate .docx files. If you want to operate .doc files, you have to save them as .docx first)

The first step to use is to install,

pip install python-docx

Note that python-docx is installed, do not directly pip install docx, although the docx package can be installed successfully, it cannot be used. (As for why, you can refer to Reference 2)

official demo

The official document, that is, reference 1 directly lists a simple demo for everyone to use, from which you can see how to automatically generate a docx step by step.

In order to better explain the functions of the docx package, I made some supplements based on the official demo.

from docx import Document
from docx.shared import Inches, Pt

root_path = 'D:/Document/1-2021碎片学习/12-python/word自动化/docx_sample/'

# 以空白模板,来创建一个文档对象
document = Document()

# 设置document的全局样式,normal代表的应该是全部样式元素
# 需要注意的是,document、paragraph和run都可以自行设置样式style,且遵循就近原则,越往下,style的优先级越高。
document.styles['Normal'].font.name='Microsoft Yahei UI'
document.styles["Heading 1"].font.size=Pt(29)   #设置全局1级标题的字体大小为29

# 创建标题,level参数用来控制标题级别,如果level=0,表示创建一个title
document.add_heading('Document Title', 0)

# 添加一个段落,(并设置其初始化文字)
p = document.add_paragraph('A plain paragraph having some ')
# 以节段run的形式,继续往该段落添加文字,并设置样式
p.add_run('bold').bold = True
p.add_run(' and some ')
p.add_run('italic.').italic = True

document.add_heading('Heading, level 1', level=1)
document.add_paragraph('Intense quote', style='Intense Quote')  # 添加一个引用式段落

# 添加一个列表式段落
document.add_paragraph(
    'first item in unordered list', style='List Bullet'
)
# 添加一系列自行标号的列表段落
document.add_paragraph(
    'first item in ordered list', style='List Number'
)
document.add_paragraph(
    'two item in ordered list', style='List Number'
)



# 添加一个图片
document.add_picture(root_path + 'data/VF.jpeg', width=Inches(1.25))

records = (
    (3, '101', 'Spam'),
    (7, '422', 'Eggs'),
    (4, '631', 'Spam, spam, eggs, and spam')
)
# 添加一个table,并赋值
table = document.add_table(rows=1, cols=3)
hdr_cells = table.rows[0].cells
hdr_cells[0].text = 'Qty'
hdr_cells[1].text = 'Id'
hdr_cells[2].text = 'Desc'
for qty, id, desc in records:
    row_cells = table.add_row().cells
    row_cells[0].text = str(qty)
    row_cells[1].text = id
    row_cells[2].text = desc

# 设置分页符
document.add_page_break()

# 保存成实体文件
document.save(root_path + 'data/demo.docx')

What does the final generated docx file look like?

insert image description here

In just a few lines, the basic functions of python-docx are basically written.

can generate:

  • Headings at all levels
  • Content text in various formats, partly bold, italic, etc.;
  • various lists;
  • add pictures;
  • add form;

The following is a brief introduction to several major elements of python-docx, and finally an example of operating ready-made docx.

First introduce several basic concepts (hereinafter referred to as the three elements):

  • Document: It is a word document object , and a word exists in the form of a Document object in memory;
  • Paragraph: Literal translation is a paragraph , and the content in a word document is composed of paragraphs. When typing a return key in a document, a new paragraph is formed. Enter shift + enter, that is, a soft carriage return, which will not be segmented, but a line break within the segment;
  • Run: Represents a section, and each paragraph is composed of multiple sections. Consecutive text with the same style within a paragraph forms a section . So a Paragraph object consists of a Run list.

Give an example to show the relationship between run and paragraph: (picture from reference 3)

insert image description here

The above structure is actually:

insert image description here

Note that the penultimate line is followed by a soft carriage return, so it is a paragraph with the penultimate line. The second line is empty, only a carriage return, so there is no run.

Wait until you come back and write a verification.

Regarding styles, document, paragraph, and run can all set the style by themselves, and follow the principle of proximity, the lower the style, the higher the priority of the style , that is, in a certain paragraph, the global style set by the document can be overwritten by the style set by the paragraph. If a paragraph does not specify its own style, follow the normal style of the document.

read and modify existing docx

In fact, in daily life, there are still few scenarios for creating a docx, and the most widely used is to read existing docx in batches and make modifications.

Let's first briefly analyze the docx file generated in the above paragraph to briefly analyze its existence form at the memory level, and based on an example done before, let's talk about how to actually use this package.

The following is a simple example intended to demonstrate simple bulk text replacement

# 读取一个已存在的docx示例
from docx import Document
import os

root_path = 'D:/Document/1-2021碎片学习/12-python/word自动化/docx_sample/'
data_path = os.path.join(root_path, 'data/')

document = Document(os.path.join(data_path, 'demo.docx'))

print(document)

# 每个文档对象下都有一个paragraphs列表,由paragraph组合而成
print(document.paragraphs)

for p in document.paragraphs:
    # print(p)
    # print(p.text)
    for run in p.runs:  # 一个paragraph也是由多个run组成
        # print(run)
        print(run.text)     # 可尝试修改demo.docx文件,来调整run
        if run.text == 'bold':  # 对关键字做替换
            run.text = 'bold bold'
document.save(os.path.join(data_path, 'new_demo.docx'))

So if you just want to do batch text replacement, you can actually adjust the template first, (by bolding or font and other style operations) so that the text to be replaced becomes a separate run, and then make an if judgment on this run during the loop.

references

  1. Python-docx official document, to be honest, it is a bit brief
  2. From docx import Document error problem
  3. The word artifact python-docx is very well written, especially the explanation of the three elements, with both pictures and texts, absolutely amazing
  4. An introductory tutorial for Python to operate Word
  5. a browser bundle
  6. The optional values ​​​​of the style in python-docx are listed, but there is no detailed explanation of the effect of each style.
  7. python-docx set paragraph format set global style, paragraph style

Guess you like

Origin blog.csdn.net/wlh2220133699/article/details/131719907