OpenXML parsing of Word files (taking Python3 as an example)

OpenXML parsing of Word files

Since Office 2007, the newly launched .docx files can be converted to OpenXML format losslessly, so that third-party tools can generate and modify Word files. This article uses Python as the background to briefly analyze the common elements in OpenXML, mainly as a personal memo.

My personal email address is [email protected], welcome to exchange letters.

This article is still under construction. Due to the impact of the epidemic, the completion date is undecided. QAQ
12/1/2020 update: I feel that this pit should not be filled. If you have any questions, please email to discuss.

Contents of this article:

  • Preparation
  • Getting to Know OpenXML
  • Get Word XML
    • Manually save the Word file as an XML file
    • Get OpenXML with python-docx
  • Common structure of Word OpenXML
    • Overall structure: body, styles, setting, etc.
    • Paragraph Node:<w:p>
    • Basic format unit Run node:<w:r>
    • Format Properties node: <w:pPr>with<w:rPr>
      • font<w:rFonts>
      • font size <w:sz>,<w:szCs>
    • The visible text Text:<w:t>
    • revision number rsid
    • Ruby phonetic system:<w:ruby>

Preparation

Python parses two common libraries of OpenXML: python-docx and lxml , which can be installed through pip. If you are not familiar with the two libraries, please refer to the following information:

The content of the text used as an example in this article is "Spring River Flowers and Moonlight Night" written by Zhang Ruoxu, a poet of the Tang Dynasty. The full text is as follows:

《春江花月夜》
【唐·张若虚】
春江潮水连海平,海上明月共潮生。
滟滟随波千万里,何处春江无月明?
江流宛转绕芳甸,月照花林皆似霰。
空里流霜不觉飞,汀上白沙看不见。
江天一色无纤尘,皎皎空中孤月轮。
江畔何人初见月?江月何年初照人?
人生代代无穷已,江月年年只相似。
不知江月待何人,但见长江送流水。
白云一片去悠悠,青枫浦上不胜愁。
谁家今夜扁舟子?何处相思明月楼?
可怜楼上月徘徊,应照离人妆镜台。
玉户帘中卷不去,捣衣砧上拂还来。
此时相望不相闻,愿逐月华流照君。
鸿雁长飞光不度,鱼龙潜跃水成文。
昨夜闲潭梦落花,可怜春半不还家。
江水流春去欲尽,江潭落月复西斜。
斜月沉沉藏海雾,碣石潇湘无限路。
不知乘月几人归?落月摇情满江树。
  • Paste the full text into the newly created word document in plain text, without modifying any format , and directly save the file as " Chunjianghuayueye.docx ".

  • (not required) After that, save the file as a "word xml(*.xml)" file, name it " Chunjiang Huayueye.xml ", and open it with your favorite xml viewer ( personally recommend Notepad++ with the XML Tools plug-in installed Notepad++ is no longer recommended, because Chinese people understand the reason, I use VSCode myself now).

This article will take this as an example.


Getting to Know OpenXML

After the preparations are complete, let's look at a simple example of OpenXML (note that this is not a complete word file):

<w:body xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex">
	<w:p w:rsidR="00691F43" w:rsidRDefault="00691F43" w:rsidP="00691F43">
		<w:r>
			<w:rPr>
				<w:rFonts w:hint="eastAsia"/>
			</w:rPr>
			<w:t>《春江花月夜》</w:t>
		</w:r>
	</w:p>
	<w:p w:rsidR="00691F43" w:rsidRDefault="00691F43" w:rsidP="00691F43">
		<w:r>
			<w:rPr>
				<w:rFonts w:hint="eastAsia"/>
			</w:rPr>
			<w:t>【唐·张若虚】</w:t>
		</w:r>
	</w:p>
</w:body>

These are the first two natural paragraphs in the body. OpenXML uses namespaces, see XML Namespaces . In this example, the root tag is <w:body>, and the attributes of the root tag xmlns:XXare used to describe the namespace. There are three namespaces in this example: w, wpcand cx, in fact, a word file may have more than a dozen names space, but the usual ones are w.

Get Word XML

Manually save the Word file as an XML file

Select in turn in word: File - Save As - Save as type and select "Word XML Document (*.xml)" - Save.

Get OpenXML with python-docx

1. Only read, not modify:

from docx import Document
from lxml import etree

doc = Document('春江花月夜.docx')
body_xml_str = doc._body._element.xml # 获取body中的xml
body_xml = etree.fromstring(body_xml_str) # 转换成lxml结点
print(etree.tounicode(body_xml)) # 打印查看

for p in doc.paragraphs:
    p_xml_str = p._p.xml # 按段落获取xml
    p_xml = etree.fromstring(p_xml_str) # 转换成lxml结点
    print(etree.tounicode(p_xml)) # 打印查看

Note: Modifying the xml obtained in this way will not affect the original word file.

2. If you want to directly modify the xml of the original word file , you can directly use the variable types starting with docx.Document type( CT_such as doc._body._element, p._p, doc.styles._elementetc.) lxml.etree.Element(although there seems to be no inheritance relationship between them).

print(type(doc._body._element)) # <class 'docx.oxml.document.CT_Body'>
print(type(p._p)) # <class 'docx.oxml.text.paragraph.CT_P'>
print(type(doc.styles._element)) # <class 'docx.oxml.styles.CT_Styles'>

For example:

from docx import Document
from lxml import etree

doc = Document('春江花月夜.docx')
body_xml = doc._body._element # 获取body中的xml
print(etree.tounicode(body_xml)) # 打印查看

for p in doc.paragraphs:
    p_xml = p._p # 按段落获取xml
    print(etree.tounicode(p_xml)) # 打印查看

Common structure of Word OpenXML

The main reference material in this section is the official OpenXML Microsoft document.

(Friends who can’t open it can copy the address to the browser: https://docs.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing, change en-us in the address to zh-cn Browse machine-translated Simplified Chinese version)

Overall structure: body, styles, setting, etc.

  • bodyIt is the text node, and all visually visible text information should be in this part.
  • stylesIt is equivalent to the style sheet in Word, but there should be more high-end usage. I personally guess that the theme setting in Word should also be here, but it has not been verified yet.
  • settingContains file version information, author information, etc.

Paragraph Node:<w:p>

Nodes used for paragraphs, paragraphs are also formatted at this level, ie <w:pPr>.

Paragraph nodes do not directly contain text, but contain lower-level <w:r>nodes first, and then contain text.

Basic format unit Run node:<w:r>

I haven't figured out how to translate the run here. If you have any good suggestions, please leave a message or write to us.

A piece of text with an independent format occupies a Run node <w:r>, and its format node is <w:rPr>.

The actual test found that sometimes the format of the front and back is clearly the same, but two Run nodes are still divided. The specific reason is currently unknown, but (incomplete) tests found that after merging two Run nodes into one, it does not affect The visual effect after the file is opened.

Format Properties node: <w:pPr>with<w:rPr>

  • pRrMeans Paragraph Properties, usually used to represent the relevant properties of the paragraph.
  • rPrIt means Run Properties, which is usually used to represent the related properties of fonts.

font<w:rFonts>

rFontsIt means RunFonts, which is used to represent fonts in different languages ​​and the current language. Click here to view the official documents .

It may not be appropriate to use the word "language family", but I can't think of other words to describe it. If you have good suggestions, please leave a message or write to us.

If no font is specified explicitly, this text will use the font specified by the previous level.

Currently rFontssupports four language families, namely:

{
    
    
    'ascii':'ASCII Font. Represents the following attribute in the schema: w:ascii',
    'eastAsia':'East Asian Font. Represents the following attribute in the schema: w:eastAsia',
    'hAnsi':'High ANSI Font. Represents the following attribute in the schema: w:hAnsi',
    'cs':'Complex Script Font. Represents the following attribute in the schema: w:cs',
}

rFontsIndicates w:hintwhich language font is used for this text.

For example, to set both the Chinese font and the Western font of a Chinese character to Song typeface:

<w:r>
    <w:rPr>
        <w:rFonts w:ascii="宋体" w:eastAsia="宋体" w:hAnsi="宋体" w:cs="宋体" w:hint="eastAsia"/>
    </w:rPr>
    <w:t>这一段文字字体为宋体。</w:t>
</w:r>

font size <w:sz>,<w:szCs>

szIndicates Non-Complex Script Font Size, which is the size of single-byte characters (such as ASCII coded characters, etc.) in a simple understanding.

szCsIndicates the Complex Script Font Size, which can be simply understood as the size of double-byte characters (such as Chinese, Japanese, Korean, Arabic, etc.).

If no font size is specified explicitly, this paragraph of text will use the font size specified by the previous level.

The sentence The font sizes specified by this element's val attribute are expressed as half-point values ​​in sz's official information shows that the value in <w:sz>and <w:szCs>is twice the actual point value (the original translation is "the font size represented by the value of this element , representing a half-point value).

For example, for a Chinese character of size five, consider the correspondence between the Chinese character size and the point size as follows. Its point value is 10.5, and its double is 21, so there should be<w:szCs w:val="21"/>

<w:r>
    <w:rPr>
        <w:rFonts w:hint="eastAsia"/>
        <w:sz w:val="21"/>
        <w:szCs w:val="21"/>
    </w:rPr>
    <w:t>这一段文字是五号字</w:t>
</w:r>
SZ_ZH_PT = {
    
    	# 中文字号与磅值对应关系
    "八号": 5,
    "七号": 5.5,
    "小六": 6.5,
    "六号": 7.5,
    "小五": 9,
    "五号": 10.5,
    "小四": 12,
    "四号": 14,
    "小三": 15,
    "三号": 16,
    "小二": 18,
    "二号": 22,
    "小一": 24,
    "一号": 26,
    "小初": 36,
    "初号": 42,
}

The visible text Text:<w:t>

Normally, there should be <w:t>a node in each run, and the content contained in this node is the text actually displayed in the word.

paragraph.textThe sum function in python-docx run.textreturns <w:t>the content in the node, for example:

from docx import Document

doc = Document('春江花月夜.docx')
for paragraph in doc.paragraphs:
    print('本段内容长度为%d' % len(paragraph.text))
    for i, run in enumerate(paragraph.runs):
        print('第{}个run的内容:{}'.format(i, run.text))

revision number rsid

Word will record every modification to the file and randomly number the revision. In OpenXML, the revision number is represented by rsid, specifically w:rsidR, , w:rsidRPretc.

The specific version change information is <w:settings>saved <w:rsids>in the following.

Ruby phonetic system:<w:ruby>

Characters marked with pinyin are represented as a field in word. In OpenXML, it can be seen as in <w:r>, <w:ruby>结点replacing the node <w:t>with .

A typical Zhuyin xml:

<w:ruby>
    <w:rubyPr>
        <w:rubyAlign w:val="center"/>
        <w:lid w:val="zh-CN"/>
    </w:rubyPr>
    <w:rt>
        <w:r>
            <w:rPr></w:rPr>
            <w:t>zhōng</w:t>
        </w:r>
    </w:rt>
    <w:rubyBase>
        <w:r>
            <w:rPr></w:rPr>
            <w:t></w:t>
        </w:r>
    </w:rubyBase>
</w:ruby>

Guess you like

Origin blog.csdn.net/liuqixuan1994/article/details/104486600