Author: Ann Star Fruit
Source: AirPython (public account)
The last article summarized some common operations of writing data in Word. For details, please see the Word of Python Office Automation (Part 1). Compared to writing data, reading data is also very practical! This article will talk about how to read the data in a Word document comprehensively, and will point out some points to pay attention to.
Basic Information
We also use python-docx, a dependency library, to read Word documents. First, let's read the basic information of the document. They are: chapters, margins, header and footer margins, page width and height, page orientation, etc.
Before obtaining the basic information of the document, we construct a document object Document through the document path.
from docx import Document
# Source file directory
self.word_path ='./output.docx'
# Open the document and build a document object
self.doc = Document(self.word_path)
1-Section
# 1. Get section information
# Note: Chapters can set the size, header, and footer of this page
msg_sections = self.doc.sections
print("Chapter List:", msg_sections)
# Number of chapters
print('Number of chapters:', len(msg_sections))
2-Page Margin
The left_margin, top_margin, right_margin, and bottom_margin attribute values of the chapter object can get the left margin, top margin, right margin, and bottom margin of the current chapter
def get_page_margin(section):
"""
Get the page margin (EMU) of a page
:param section:
:return:
"""
# Corresponding to:
left, top, right , and bottom margins left, top, right , bottom = section.left_margin, section.top_margin, section.right_margin, section.bottom_margin
return left, top, right, bottom
# 2. Page margin information
first_section = msg_sections[0]
left, top, right, bottom = get_page_margin(first_section )
print('Left margin:', left, ",Top margin:", top, ",Right margin:", right, ",Bottom margin:", bottom)
The unit of the return value is EMU, and the conversion relationship with centimeters and feet is as follows:
3-header and footer margins
Header margin: header_distance
Footer margin: footer_distance
def get_header_footer_distance(section):
"""
Get the header and footer margins
: param section:
:return:
"""
# correspond to the header margin and footer margin respectively
header_distance, footer_distance = section.header_distance, section.footer_distance
return header_distance, footer_distance
# 3. Header and footer margin
header_distance, footer_distance = get_header_footer_distance(first_section)
print('Header margin:', header_distance, ", footer margin:", footer_distance)
4-page width and height
Page width: page_width
Page height: page_height
def get_page_size(section):
"""
Get page width and height
: param section:
:return:
"""
# Corresponding to page width and height
respectively page_width, page_height = section.page_width, section.page_height
return page_width, page_height
# 4. Page Width, height
page_width, page_height = get_page_size(first_section)
print('page width:', page_width, ", page height:", page_height)
5-Page Orientation
The page orientation is divided into: horizontal and vertical
Use the orientation property of the chapter object to get the page orientation of a chapter
def get_page_orientation(section):
"""
Get page orientation
: param section:
:return:
"""
return section.orientation
# 5. Page orientation
# Type: class'docx.enum.base.EnumValue
# Contains: PORTRAIT (0) , LANDSCAPE (1)
page_orientation = get_page_orientation(first_section)
print("page direction:", page_orientation)
Similarly, you can directly use this attribute to set the direction of a chapter
from docx.enum.section import WD_ORIENT
# Set page orientation (horizontal, vertical)
# Set to horizontal
first_section.orientation = WD_ORIENT.LANDSCAPE
# Set to vertical
# first_section.orientation = WD_ORIENT.PORTRAIT
self.doc.save(self. word_path)
paragraph
Use the paragraphs property of the document object to get all paragraphs in the document
Note: The paragraphs obtained here do not include headers, footers, paragraphs in tables
# Get all the paragraphs in the document object, the default does not include: headers, footers, paragraphs in the table
paragraphs = self.doc.paragraphs
# 1. The number of
paragraphs paragraphs_length = len(paragraphs)
print('The document contains a total of: { }Paragraphs'.format(paragraphs_length))
1-paragraph content
We can traverse all the paragraph lists in the document, and get all the paragraph content through the text property of the paragraph object
# 0, read all paragraph data
contents = [paragraph.text for paragraph in self.doc.paragraphs]
print(contents)
2-paragraph format
Through the previous article, we know that paragraphs also have formatting
Use the paragraph_format property to get the basic format information of the paragraph
Including: alignment, left and right indentation, line spacing, paragraph front and back spacing, etc.
# 2. Get the format information of a
paragraph paragraph_someone = paragraphs[0]
# 2.1 Paragraph content
content = paragraph_someone.text
print('Paragraph content:', content)
# 2.2 Paragraph format
paragraph_format = paragraph_someone.paragraph_format
# 2.2.1 Alignment
# <class'docx.enum.base.EnumValue'>
alignment = paragraph_format.alignment
print('Paragraph alignment:', alignment)
# 2.2.2 Left and right indentation
left_indent, right_indent = paragraph_format.left_indent, paragraph_format.right_indent
print ('Paragraph left indent:', left_indent, ", right indent:", right_indent)
# 2.2.3 First line indent
first_line_indent = paragraph_format.first_line_indent
print('Paragraph first line indent:', first_line_indent)
# 2.2. 4 line spacing
line_spacing = paragraph_format.line_spacing
print('Paragraph line spacing:', line_spacing)
# 2.2.5 The space before and after the paragraph
space_before, space_after = paragraph_format.space_before, paragraph_format.space_after
print('The space before and after paragraph are:', space_before, ' ,', space_after)
Text block-Run
The text block Run is part of the paragraph, so to get the text block information, you must first get a paragraph instance object
Take basic information and font format information of text blocks as examples
1-Basic information of the text block
We use the runs property of the paragraph object to get all the text block objects in the paragraph
def get_runs(paragraph):
"""
Get all the text block information under the paragraph, including: number, content list
:param paragraph:
:return:
"""
# The text block contained in the paragraph object Run
runs = paragraph.runs
# number
runs_length = len(runs)
# text block content
runs_contents = [run.text for run in runs]
return runs, runs_length, runs_contents
2-Text block format information
The text block is the smallest text unit in the document, and its font properties can be obtained by using the font property of the text block object
One-to-one correspondence with setting text block format attributes, font name, size, color, bold or italic, etc. can all be obtained
# 2. Text block format information
# Contains: font name, size, color, whether to be bold, etc.
# The font attribute of a text block
run_someone_font = runs[0].font
# font name
font_name = run_someone_font.name
print('font name :', font_name)
# font color(RGB)
# <class'docx.shared.RGBColor'>
font_color = run_someone_font.color.rgb
print('font color:', font_color)
print(type(font_color))
# font size
font_size = run_someone_font.size
print('font size:', font_size)
# Whether to bold
# True: Bold; None/False: No bold
font_bold = run_someone_font.bold
print('Whether to bold:', font_bold)
# Whether to italic
# True: agreement; None/False: not italic
font_italic = run_someone_font.italic
print('Is it italic:',font_italic)
# Underlined
# True: with
underline ; None/False: font without underline font_underline = run_someone_font.underline
print('with underline:', font_underline)
# strikethrough/double strikethrough
# True: with strikethrough; None /False: The font has no strikethrough
font_strike = run_someone_font.strike
font_double_strike = run_someone_font.double_strike
print('With strikethrough:', font_strike, "\nWith double strikethrough:", font_double_strike)
form
The tables attribute of the document object can get all table objects in the current document
# All table objects in the document
tables = self.doc.tables
# 1. Number of
tables table_num = len(tables)
print('Number of tables contained in the document:', table_num)
1-All data in the table
There are 2 ways to get all the data in the table
The first way: by traversing all tables in the document, then traversing by row and cell, and finally obtaining the text content of all cells through the text property of the cell
# 2. Read all table data
# All table objects
# tables = [table for table in self.doc.tables]
print('The contents are:')
for table in tables:
for row in table.rows:
for cell in row.cells:
print(cell.text, end='')
print()
print('\n')
Another way is to use the _cells property of the table object to get all the cells in the table, and then traverse to get the cell value
def get_table_cell_content(table):
"""
Read the contents of all cells in the
table: param table:
:return:
"""
# All cells
cells = table._cells
cell_size = len(cells)
# All cells content
content = [cell.text for cell in cells]
return content
2-table style
# 3、表style名
# Table Grid
table_someone = tables[0]
style = table_someone.style.name
print("表style:", style)
3-Number of table rows, number of columns
table.rows: Iteration object of row data in the table
table.columns: column data iteration object in the table
def get_table_size(table):
"""
Get the number of rows and columns of the
table: param table:
:return:
"""
# Several rows and columns
row_length, column_length = len(table.rows), len(table.columns)
return row_length, column_length
4-row data, column data
Sometimes, we need to get all the data by row or column alone
def get_table_row_datas(table):
"""
Get row data in the
table: param table:
:return:
"""
rows = table.rows
datas = []
# Get the data of each cell to form a list and add it to the result list
for row in rows:
datas.append([cell.text for cell in row.cells])
return datas
def get_table_column_datas(table):
"""
Get the column data in the table
: param table:
:return:
"""
columns = table.columns
datas = []
# Get the cell data in each column to form a list and add it to the result list
for column in columns:
datas.append([cell.text for cell in column.cells])
return datas
image
Sometimes, we need to download the pictures in the Word document to the local. The Word document is actually a compressed file. After using the decompression tool, we found that the pictures contained in the document are all placed in the /word/media/ directory
There are 2 methods to extract document pictures, namely:
- Unzip the document file and copy the pictures in the corresponding directory
- Use python-docx's built-in method to extract pictures (recommended)
def get_word_pics(doc, word_path, output_path):
"""
Extract pictures from word documents
: param word_path: source file name
: param output_path: result directory
: return:
"""
dict_rel = doc.part._rels
for rel in dict_rel:
rel = dict_rel[rel]
if "image" in rel.target_ref:
# Picture save directory
if not os.path.exists(output_path):
os.makedirs(output_path)
img_name = re.findall("/(.*)", rel.target_ref)[0]
word_name = os.path.splitext(word_path)[0]
# New name
newname = word_name.split('\\')[-1] if os.sep in word_name else word_name.split('/')[-1]
img_name = f'{newname}_{img_name}'
# 写入到文件中
with open(f'{output_path}/{img_name}', "wb") as f:
f.write(rel.target_part.blob)
Header and footer
The header and footer are based on chapters. Let's take a chapter object as an example.
# Get a certain section
first_section = self.doc.sections[0]
Use the header and footer properties of the chapter object to get the header and footer objects. Since the header and footer may contain multiple paragraphs Paragraph, we can first use the paragraphs property of the header and footer object to get all the paragraphs, then traverse the values of all the paragraphs, and finally concatenate all the contents of the header and footer .
# Note: The header and footer may contain multiple paragraphs
# All paragraphs in the header
header_content = "".join([paragraph.text for paragraph in first_section.header.paragraphs])
print("Header content:" , header_content)
# Footer
footer_content = "".join([paragraph.text for paragraph in first_section.footer.paragraphs])
print("footer content:", footer_content)