Read content from Word and output to Excel in required format

1. Read Word

python-docx library
can create, edit and update Microsoft Word (.docx) files (cannot process doc files)

Function libraries and usage syntax

The basic idea is to regard a word file as a document object, in which each paragraph object is processed, and text corresponds to the text content in the paragraph.
Objects such as tables and pictures can also be processed.

Importing libraries and instantiating

from docx import Document
document = Document()

If an error is reported, pip

pip install python-docx

function syntax

- document.add_heading()	#	添加标题
- document.add_paragraph()	#	添加段落(style='ListBullet'/'ListNumber')
- document.add_picture()	#	添加照片
- document.add_table()	#	添加表格
- document.add_page_break()	#	添加分页符
- document.save('demo.docx')	#	保存文件

2. Write to Excel

Function import and usage

import xlwt
# 创建workbook,工作簿,即一个excel
workbook = xlwt.Workbook(encoding = 'utf-8')
# 创建worksheet,即在工作簿上的一个工作表sheet
worksheet = workbook.add_sheet(’Sheet1‘)

# 写入Excel
worksheet.write(0,0,'context')	#	行,列,值

# 保存
worksheet.save('demo.xls')

3. Regular expressions

Runoob/Rookie Tutorial: Regular Expression Tutorial
The re module enables the Python language to have all regular expression functions.

Commonly used functions and usage

>>>import re
>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)   # re.I 表示忽略大小写
>>> m = pattern.match('Hello World Wide Web')
>>> print m                               # 匹配成功,返回一个 Match 对象
<_sre.SRE_Match object at 0x10bea83e8>
>>> m.group(0)                            # 返回匹配成功的整个子串
'Hello World'
>>> m.span(0)                             # 返回匹配成功的整个子串的索引
(0, 11)
>>> m.group(1)                            # 返回第一个分组匹配成功的子串
'Hello'
>>> m.span(1)                             # 返回第一个分组匹配成功的子串的索引
(0, 5)

>>> pattern = re.compile(r'\d+')                    # 用于匹配至少一个数字
>>> m = pattern.match('666')        # 查找头部

pattern = re.compile(r'\d+')   # 查找数字
result1 = pattern.findall('runoob 123 google 456')
result2 = pattern.findall('run88oob123google456', 0, 10)

The idea for completing the entire project:
Read Word
to obtain all paragraphs in the docx file, and divide all paragraphs through certain common characteristics ('['). After obtaining all divided items, in order to obtain the next step result, pass Grouping rule, filter all items, and after obtaining the required content, store it in a specific way through dic, which is stored in dic as a List.
Write to Excel
to read the value through the key in dic, and write it to Excel in specific rows and columns.

Guess you like

Origin blog.csdn.net/qq_32301683/article/details/104200849