Talk about Word of python office automation (medium)

Author: Ann Star Fruit 

Source: AirPython (public account)

The last article summarized some common operations of writing data in Word. For details, please see the Word of Python Office Automation (Part 1). Compared to writing data, reading data is also very practical! This article will talk about how to read the data in a Word document comprehensively, and will point out some points to pay attention to.

Basic Information

We also use python-docx, a dependency library, to read Word documents. First, let's read the basic information of the document. They are: chapters, margins, header and footer margins, page width and height, page orientation, etc.

Before obtaining the basic information of the document, we construct a document object Document through the document path.

from docx import Document

# Source file directory
self.word_path ='./output.docx'

# Open the document and build a document object
self.doc = Document(self.word_path)

1-Section

# 1. Get section information
# Note: Chapters can set the size, header, and footer of this page
msg_sections = self.doc.sections
print("Chapter List:", msg_sections)
# Number of chapters
print('Number of chapters:', len(msg_sections))

2-Page Margin

The left_margin, top_margin, right_margin, and bottom_margin attribute values ​​of the chapter object can get the left margin, top margin, right margin, and bottom margin of the current chapter

def get_page_margin(section):
    """
    Get the page margin (EMU) of a page
    :param section:
    :return:
    """
    # Corresponding to:
    left, top, right , and bottom margins left, top, right , bottom = section.left_margin, section.top_margin, section.right_margin, section.bottom_margin
    return left, top, right, bottom

# 2. Page margin information
first_section = msg_sections[0]
left, top, right, bottom = get_page_margin(first_section )
print('Left margin:', left, ",Top margin:", top, ",Right margin:", right, ",Bottom margin:", bottom)

The unit of the return value is EMU, and the conversion relationship with centimeters and feet is as follows:

3-header and footer margins

Header margin: header_distance

Footer margin: footer_distance

def get_header_footer_distance(section):
    """
    Get the header and footer margins
    : param section:
    :return:
    """
    # correspond to the header margin and footer margin respectively
    header_distance, footer_distance = section.header_distance, section.footer_distance
    return header_distance, footer_distance

# 3. Header and footer margin
header_distance, footer_distance = get_header_footer_distance(first_section)
print('Header margin:', header_distance, ", footer margin:", footer_distance)

4-page width and height

Page width: page_width

Page height: page_height

def get_page_size(section):
    """
    Get page width and height
    : param section:
    :return:
    """
    # Corresponding to page width and height
    respectively page_width, page_height = section.page_width, section.page_height
    return page_width, page_height

# 4. Page Width, height
page_width, page_height = get_page_size(first_section)
print('page width:', page_width, ", page height:", page_height)

5-Page Orientation

The page orientation is divided into: horizontal and vertical

Use the orientation property of the chapter object to get the page orientation of a chapter

def get_page_orientation(section):
    """
    Get page orientation
    : param section:
    :return:
    """
    return section.orientation

# 5. Page orientation
# Type: class'docx.enum.base.EnumValue
# Contains: PORTRAIT (0) , LANDSCAPE (1)
page_orientation = get_page_orientation(first_section)
print("page direction:", page_orientation)

Similarly, you can directly use this attribute to set the direction of a chapter

from docx.enum.section import WD_ORIENT

# Set page orientation (horizontal, vertical)
# Set to horizontal
first_section.orientation = WD_ORIENT.LANDSCAPE
# Set to vertical
# first_section.orientation = WD_ORIENT.PORTRAIT
self.doc.save(self. word_path)

paragraph

Use the paragraphs property of the document object to get all paragraphs in the document

Note: The paragraphs obtained here do not include headers, footers, paragraphs in tables

# Get all the paragraphs in the document object, the default does not include: headers, footers, paragraphs in the table
paragraphs = self.doc.paragraphs

# 1. The number of
paragraphs paragraphs_length = len(paragraphs)
print('The document contains a total of: { }Paragraphs'.format(paragraphs_length))

1-paragraph content

We can traverse all the paragraph lists in the document, and get all the paragraph content through the text property of the paragraph object

# 0, read all paragraph data
contents = [paragraph.text for paragraph in self.doc.paragraphs]
print(contents)

2-paragraph format

Through the previous article, we know that paragraphs also have formatting

Use the paragraph_format property to get the basic format information of the paragraph

Including: alignment, left and right indentation, line spacing, paragraph front and back spacing, etc.

# 2. Get the format information of a
paragraph paragraph_someone = paragraphs[0]

# 2.1 Paragraph content
content = paragraph_someone.text
print('Paragraph content:', content)

# 2.2 Paragraph format
paragraph_format = paragraph_someone.paragraph_format

# 2.2.1 Alignment
# <class'docx.enum.base.EnumValue'>
alignment = paragraph_format.alignment
print('Paragraph alignment:', alignment)

# 2.2.2 Left and right indentation
left_indent, right_indent = paragraph_format.left_indent, paragraph_format.right_indent
print ('Paragraph left indent:', left_indent, ", right indent:", right_indent)

# 2.2.3 First line indent
first_line_indent = paragraph_format.first_line_indent
print('Paragraph first line indent:', first_line_indent)

# 2.2. 4 line spacing
line_spacing = paragraph_format.line_spacing
print('Paragraph line spacing:', line_spacing)

# 2.2.5 The space before and after the paragraph
space_before, space_after = paragraph_format.space_before, paragraph_format.space_after
print('The space before and after paragraph are:', space_before, ' ,', space_after)

 

Text block-Run

The text block Run is part of the paragraph, so to get the text block information, you must first get a paragraph instance object

Take basic information and font format information of text blocks as examples

1-Basic information of the text block

We use the runs property of the paragraph object to get all the text block objects in the paragraph

def get_runs(paragraph):
    """
    Get all the text block information under the paragraph, including: number, content list
    :param paragraph:
    :return:
    """
    # The text block contained in the paragraph object Run
    runs = paragraph.runs

    # number
    runs_length = len(runs)

    # text block content
    runs_contents = [run.text for run in runs]

    return runs, runs_length, runs_contents

2-Text block format information

The text block is the smallest text unit in the document, and its font properties can be obtained by using the font property of the text block object

One-to-one correspondence with setting text block format attributes, font name, size, color, bold or italic, etc. can all be obtained

# 2. Text block format information
# Contains: font name, size, color, whether to be bold, etc.
# The font attribute of a text block
run_someone_font = runs[0].font

# font name
font_name = run_someone_font.name
print('font name :', font_name)

# font color(RGB)
# <class'docx.shared.RGBColor'>
font_color = run_someone_font.color.rgb
print('font color:', font_color)
print(type(font_color))

# font size
font_size = run_someone_font.size
print('font size:', font_size)

# Whether to bold
# True: Bold; None/False: No bold
font_bold = run_someone_font.bold
print('Whether to bold:', font_bold)

# Whether to italic
# True: agreement; None/False: not italic
font_italic = run_someone_font.italic
print('Is it italic:',font_italic)

# Underlined
# True: with
underline ; None/False: font without underline font_underline = run_someone_font.underline
print('with underline:', font_underline)

# strikethrough/double strikethrough
# True: with strikethrough; None /False: The font has no strikethrough
font_strike = run_someone_font.strike
font_double_strike = run_someone_font.double_strike
print('With strikethrough:', font_strike, "\nWith double strikethrough:", font_double_strike)

form

The tables attribute of the document object can get all table objects in the current document

# All table objects in the document
tables = self.doc.tables

# 1. Number of
tables table_num = len(tables)
print('Number of tables contained in the document:', table_num)

1-All data in the table

There are 2 ways to get all the data in the table

The first way: by traversing all tables in the document, then traversing by row and cell, and finally obtaining the text content of all cells through the text property of the cell

# 2. Read all table data
# All table objects
# tables = [table for table in self.doc.tables]
print('The contents are:')
for table in tables:
    for row in table.rows:
        for cell in row.cells:
            print(cell.text, end='')
        print()
    print('\n')

Another way is to use the _cells property of the table object to get all the cells in the table, and then traverse to get the cell value

def get_table_cell_content(table):
    """
    Read the contents of all cells in the
    table: param table:
    :return:
    """
    # All cells
    cells = table._cells
    cell_size = len(cells)

    # All cells content
    content = [cell.text for cell in cells]
    return content

2-table style

# 3、style名
# Table Grid
table_someone = tables[0]
style = table_someone.style.name
print("style:", style)

3-Number of table rows, number of columns

table.rows: Iteration object of row data in the table

table.columns: column data iteration object in the table

def get_table_size(table):
    """
    Get the number of rows and columns of the
    table: param table:
    :return:
    """
    # Several rows and columns
    row_length, column_length = len(table.rows), len(table.columns)
    return row_length, column_length

4-row data, column data

Sometimes, we need to get all the data by row or column alone

def get_table_row_datas(table):
    """
    Get row data in the
    table: param table:
    :return:
    """
    rows = table.rows
    datas = []

    # Get the data of each cell to form a list and add it to the result list
    for row in rows:
        datas.append([cell.text for cell in row.cells])
    return datas

def get_table_column_datas(table):
    """
    Get the column data in the table
    : param table:
    :return:
    """
    columns = table.columns
    datas = []

    # Get the cell data in each column to form a list and add it to the result list
    for column in columns:
        datas.append([cell.text for cell in column.cells])
    return datas

image

Sometimes, we need to download the pictures in the Word document to the local. The Word document is actually a compressed file. After using the decompression tool, we found that the pictures contained in the document are all placed in the /word/media/ directory

There are 2 methods to extract document pictures, namely:

  • Unzip the document file and copy the pictures in the corresponding directory
  • Use python-docx's built-in method to extract pictures (recommended)

def get_word_pics(doc, word_path, output_path):
    """
    Extract pictures from word documents
    : param word_path: source file name
    : param output_path: result directory
    : return:
    """
    dict_rel = doc.part._rels
    for rel in dict_rel:
        rel = dict_rel[rel]
        if "image" in rel.target_ref:
            # Picture save directory
            if not os.path.exists(output_path):
                os.makedirs(output_path)
            img_name = re.findall("/(.*)", rel.target_ref)[0]
            word_name = os.path.splitext(word_path)[0]

            # New name
            newname = word_name.split('\\')[-1] if os.sep in word_name else word_name.split('/')[-1]
            img_name = f'{newname}_{img_name}'

            # 写入到文件中
            with open(f'{output_path}/{img_name}', "wb") as f:
                f.write(rel.target_part.blob)

Header and footer

The header and footer are based on chapters. Let's take a chapter object as an example.

# Get a certain section
first_section = self.doc.sections[0]

Use the header and footer properties of the chapter object to get the header and footer objects. Since the header and footer may contain multiple paragraphs Paragraph, we can first use the paragraphs property of the header and footer object to get all the paragraphs, then traverse the values ​​of all the paragraphs, and finally concatenate all the contents of the header and footer .

# Note: The header and footer may contain multiple paragraphs
# All paragraphs in the header
header_content = "".join([paragraph.text for paragraph in first_section.header.paragraphs])
print("Header content:" , header_content)

# Footer
footer_content = "".join([paragraph.text for paragraph in first_section.footer.paragraphs])
print("footer content:", footer_content)

 

Guess you like

Origin blog.csdn.net/yoggieCDA/article/details/110056745