About parsing docx with python

Recent work involves parsing docx. I have looked at many methods, using C++, Java, and Python. Finally, I found that the practicality, including simplicity, still needs python, and it can’t run. Then I looked at many python parsing docx libraries, and finally chose to use python-docx. Of course, many tutorials are also about how to use python-docx for word document parsing, but if we use the basic functions of python-docx, such as:

Paragraph printing:

from docx import Document
from docx.shared import Inches

document = Document('demo.docx')  #打开文件
for paragraph in document.paragraphs:
    print(paragraph.text)  #打印各段落内容文本

Or table extraction:

import docx
from docx import Document #导入库

path = 'demo.docx' #文件路径
document = Document(path) #读入文件
tables = document.tables #获取文件中的表格集

for table in tables[:]:
    for i, row in enumerate(table.rows[:]):   # 读每行
        row_content = []
        for cell in row.cells[:]:  # 读一行中的所有单元格
            c = cell.text
            row_content.append(c)
        print (row_content) #以列表形式导出每一行数据

Using this method to read, we can only read the most basic text information of each part, and cannot read and parse some special characters.

For example , symbols like: cannot be parsed in this way, but they are needed. At this time, we need to change our thinking and use OPENXML for word analysis.

In fact, if you want to know about openxml, just change the suffix of the docx file to zip and unzip it, and you will find that the entire file is actually divided into many xml files, which store the content and format of the document. Open document.xml and you will see our body content. Up.

Among them, document.xml also follows the xml data structure, including tags, attributes and so on. So what we have to do is to find the label that represents the special symbol and restore it to the corresponding position in the original text.

So the reason for choosing python-docx here is that he can restore his openxml structure while reading docx. We can directly use xml to search for tags.

First, let's first understand the format of openxml, refer to https://blog.csdn.net/liuqixuan1994/article/details/104486600/ :

Overall structure: body, styles, settings, etc.

Paragraph node:<w:p>

Basic format unit Run node:<w:r>

Format Properties node: <w:pPr>and<w:rPr>

Font<w:rFonts>

Font size <w:sz>,<w:szCs>

Text you can see:<w:t>

Among them, we need to analyze the representation of special characters in the xml file. I have encountered about three types, one is

For this reason, we can directly search the corresponding tags in xml to find special characters.

First, we read the document and open it in XML, find the w:r tag, the basic paragraph unit:

from docx import Document
from lxml import etree

doc = Document('demo.docx')
body_xml_str = doc._body._element.xml # 获取body中的xml
body_xml = etree.fromstring(body_xml_str) # 转换成lxml结点
print(etree.tounicode(body_xml)) # 打印查看

for p in doc.paragraphs:
    p_xml_str = p._p.xml # 按段落获取xml
    p_xml = etree.fromstring(p_xml_str) # 转换成lxml结点
    print(etree.tounicode(p_xml)) # 打印查看
    xml_dom = parseString(etree.tounicode(p_xml))
    stus = xml_dom.getElementsByTagName('w:r')
    for si in stus:
        print(si)

Extract special characters in paragraphs:

sym_id = si.getElementsByTagName('w:sym')
    for sym_i in sym_id:
        if((sym_i.attributes._attrs['w:font'].nodeValue=='Wingdings2')and(sym_i.attributes._attrs['w:char'].nodeValue=='0052')):
            print('特殊字符',end='')
    
sym_box = si.getElementsByTagName('w:instrText')
for box_i in sym_box:
    if((box_i.childNodes[0].data == 'FORMCHECKBOX'):
        print('特殊字符',end='')

sym_box = si.getElementsByTagName('w:checked')
for box_i in sym_box:
    print('特殊字符',end='')

In this way, the special characters can be found. Of course, you have to find the corresponding representation for specific matching in specific scenarios.

Guess you like

Origin blog.csdn.net/wi162yyxq/article/details/108431881