Word artifact python-docx

Two days ago a friend to me for help, when she was writing his thesis, accidentally thesis of Chinese double quotes with English, and for various reasons impossible to roll back, more than 80,000 word dissertation, appeared to be pay, how to do?

Its first thought word replace function, it touches can be found, but not dynamic substitution, i.e. replacement only quote on both sides, without changing media content;

Another solution is that use VBA, be replaced by a program, although several projects done, do not be long, strenuous picked up, coupled with the various concepts and usage of VBA, learning cost is too high, give up;

Another scheme, which uses Python operating word, first of all more familiar with Python, in addition to some others have made good wheels. Sure enough, it did not take long to find a python-docx Python library, complete documentation, powerful replacement to solve the problem alone.

Before you start you a quick view at python-docx

python-docx Introduction

python-docx is used to create a python library can modify Microsoft Word, providing a full range of operating Word, Word is the most commonly used tools

concept

Before use, first understand a few concepts:

  • Document: Is a Word document object in VBA is different from the concept of Worksheet, Document independent, open different Word document, a different Document object, there is no influence on each other

  • Paragraph: A paragraph, a Word document composed of a plurality of paragraphs, when a return key input in the document, it will become the new paragraph, the input carriage shift +, not segmented

  • Run Represents one segment, each segment consisting of a plurality of segments, a continuous text paragraph same style, a composition segment, so a paragraph object has a list Run

For example, a Word, which reads:

word document content

The structure of this division:

Second paragraph (paragraph), no content, the segments (run) is empty

installation

You can use pip to install:

pip install python-docx

Command line, run the following statements, if there is no error, the installation is successful

$ python -c 'import docx'

Small scale chopper

python-docx After installation, test:

from docx import Document


document = Document()
paragraph = document.add_paragraph('Lorem ipsum dolor sit amet.')
prior_paragraph = paragraph.insert_paragraph_before('Lorem ipsum')


document.save(r"D:\test.docx")
  • The introduction of the Document class

  • Define a new document object document

  • I want the document to insert a paragraph (paragraph)

  • Insert another paragraph before this paragraph (paragraph)

  • Finally, save the document calling save the document object document

Open the saved test.docx can be seen with the Word:

Problem analysis and solution

Understand the basics of python-docx, started to address the problem, the general idea is:

  1. Read the document content

  2. Find content between quotes

  3. You will find the contents of the Chinese into English quotation marks and replace the contents back

  4. After completion of the processing document as

Find Target

First to be solved is how to find the content between the quotes?

For example, the content of the document has this to say:

...
对"基于需求的教育资源配
置系统观"的研究,尤其是对"以学习者为中心"和从"个性化学习"、"精准教学"视角出发的
教育资源配置问题提供了理论"支持\\以及"方向指导
...

For the English, it does not distinguish between quotation marks before and after the quotation marks, how to ensure the configuration would not be "和从", "、"as well "以学习者为中心"和从"个性化学习"、"精准教学", or will not be ignored in the case of two quotes appear down the line?

Review regular expressions, and finally obtain the following expression:

'"(?:[^"])*"'
  • ?:: In order to cancel parentheses Cache mode configuration process that does not need to meet on a line with the end of the match

  • [^"]: Representing the contents can not be matched "to avoid greedy match, that match to avoid from the first "start until the last "end

  • Means arranged two overall "content between, and do not provide for"

Later, the finishing process, also found that another way:

'".*?"'

But .can not match a newline \n, insisted on the need to use 可选修饰符 re.S:

import re
pattern = re.compile('".*?"', re.S)


re.findAll(pattern, text)  # text 为待查找字符串
  • The introduction of regular expression module re

  • re.S Identification is optional modifiers that. All matches including newline characters, including

  • Using the findAllFind all matches

More use of reference after reference to the text links on Python is an expression of

achieve

Find the problem is resolved, replace the much more convenient to do:

from docx import Document
import re


doc = Document(r"D:\论文.docx")
restr = '"(?:[^"])*"'


for p in doc.paragraphs:
    matchRet = re.findall(restr, p.text)
    for r in matchRet:
        p.text = p.text.replace(r, '“' + r[1:-1] + '”')
doc.save(r'D:\论文_修正.docx')
  • Document class is introduced, and regular expressions module

  • Open the target document, before the string rrepresentation cancellation string escaping, according to the original character that is produced to explain

  • Circulating the document paragraph (paragraph), for each segment, using regular expressions to match

  • For matching to the loop results, before and after the quotes, quotes into Chinese, and replace paragraph (paragraph) a text; wherein r[1:-1]represents intercepted penultimate position, taken from the second string position (first position is 0), just remove the quotation marks before and after

  • Finally, save the document

Note: python-docx not give any warning when you save a document, it will be instantaneous, so save a prudent

Completed, and quickly replaced with good documentation of their past ......

Not come aftertaste, she said: "! ~ Very grateful that I can no longer help generate a table of figures, this must be ......"

Well, ADM (artifact in hand), dry it over ......

Powerful python-docx

In the above small scale chopper, there is described the use of the insertion passage (paragraph), the following additional functions are some of the python-docx

For brevity, the following example is omitted, and the introduction of the Document class instantiation code, document is an example of Document

Add a title

By default the case of adding the title at the highest level, i.e. a title, parameter levelsetting, the range is 1 to 9, there are levels of 0, indicates that the section headings:

# 添加一级标题
document.add_heading('我是一级标题')


decument.add_heading('我是二级标题', level=2)


decument.add_heading('我是段落标题', level=0)

Add Feed

If a paragraph unhappy one, when paged, you can insert a page break, insert a page break will be a direct call to the last paragraph after:

# 文档最后插入分页
document.add_page_break()


# 特定段落分页
from docx.enum.text import WD_BREAK
paragraph = document.add_paragraph("独占一页")  # 添加一个段落
paragraph.runs[-1].add_break(WD_BREAK.PAGE)  # 在段落的最后一个节段后添加分页

Operating table

Word documents often used in the form, and how to add python-docx operating table it?

# 添加一个 2×2 表格
table = document.add_table(rows=2, cols=2)


# 获取第一行第二列单元格
cell = table.cell(0, 1)


# 设置单元格文本
cell.text = '我是单元格文字'


# 表格的行
row = table.rows[1]
row.cells[0].text = 'Foo bar to you.'
row.cells[1].text = 'And a hearty foo bar to you too sir!'


# 增加行
row = table.add_row()

A more complex example:

# 表格数据
items = (
    (7, '1024', '手机'),
    (3, '2042', '笔记本'),
    (1, '1288', '台式机'),
)


# 添加一个表格
table = document.add_table(1, 3)


# 设置表格标题
heading_cells = table.rows[0].cells
heading_cells[0].text = '数量'
heading_cells[1].text = '编码'
heading_cells[2].text = '描述'


# 将数据填入表格
for item in items:
    cells = table.add_row().cells
    cells[0].text = str(item[0])
    cells[1].text = item[1]
    cells[2].text = item[2]

add pictures

Add a picture, that is, the Word in the menu Insert> Picture insert function, insert a picture to its original size:

document.add_picture('image-filename.png')

When you set the picture size is inserted:

from docx.shared import Cm
# 设置图片的跨度为 10 厘米
document.add_picture('image-filename.png', width=Cm(10))

In addition cm, python-docx further provides inch (Inches), is provided as one inch: Inches(1.0)

style

Styles can be for the whole document (document), paragraph (paragraph), the segments (run), specific month, the higher the priority of style

python-docx diverse styles feature configuration, feature-rich, here to do a brief introduction of the paragraph style and text style

Paragraph styles

Paragraph style comprising: alignment, list style, line spacing, indent, background color, etc., may be set when adding a paragraph may be provided after the addition of:

# 添加一个段落,设置为无序列表样式
document.add_paragraph('我是个无序列表段落', style='List Bullet')


# 添加段落后,通过 style 属性设置样式
paragraph = document.add_paragraph('我也是个无序列表段落')
paragraph.style = 'List Bullet'

Text Style

Can be seen in front of python-docx document map, paragraphs, the contents of different styles, is divided into a plurality of segments (Run), by text style segment (Run) arranged to

Set bold / italic
paragraph = document.add_paragraph('添加一个段落')
# 设置 节段文字为加粗
run = paragraph.add_run('添加一个节段')
run.bold = True


# 设置 节段文字为斜体
run = paragraph.add_run('我是斜体的')
run.italic = True
Set the font

Set Font slightly more complicated, for example, set to a text 宋体:

paragraph = document.add_paragraph('我的字体是 宋体')
run = paragraph.runs[0]
run.font.name = '宋体'
run._element.rPr.rFonts.set(qn('w:eastAsia'), '宋体')

to sum up

python-docx Word is a powerful database, can achieve almost all operations in Word today through an example, introduces some basic usage of python-docx, limited space, can not discuss more, if you are interested in depth research, maybe you can have Word as simple as Markdown.

reference

https://python-docx.readthedocs.io/en/latest/

https://www.runoob.com/python/python-reg-expressions.html

https://www.cnblogs.com/nixindecat/p/12157623.html

[ Code acquisition mode ]

The end of the two-dimensional code identification text, reply: 666

PS: reply within public number "Python " to enter the novice to learn Python exchange group, together with the  100 days plan!

-END-

Python Technology

About Python are here

Published 287 original articles · won praise 6605 · Views 1.98 million +

Guess you like

Origin blog.csdn.net/ityouknow/article/details/105236575