Python-docx combat: my colleague asked me to help write 178 daily newspapers! Don't

Preface

Not long ago, a colleague had a project to talk to the leader. Part of the work was based on the daily data in the excel sheet and sorted it into a daily report and written it in word. Good guys! It takes a full 178 days to make up. If you need to copy and paste, wouldn't it be vomiting blood? (You can solve it by yourself!) Okay ojbk, it's time to offer Python office automation.

Insert picture description hereInsert picture description here

1. Basic data collation

First, let's take a look at the requirements of data samples and output documents ( sensitive data has been processed harmoniously ): There are n sub-tables in the original excel file, and each sub-table is one day's data. There are no records and there are records (the number of departments ≥ 1 , The number of records in each department ≥ 1) In two cases, it needs to be organized into two daily reports, one is a plain text description, and the other is a document with a table.
Insert picture description here

Insert picture description here

Insert picture description here

Roll up your sleeves and curse! Oh no, start writing code!

Combine the sub-tables into one first, which is convenient for observing the regularity of daily data records, and also convenient for post-processing. Use the xlrd library to read the table, get the name of the active table in the workbook, and then use the pandas library to traverse the sub-tables to merge. The data in the dataframe format is very compatible with the excel table.

def merge_sheet(filepath): # 合并多个同表头的子表
    wb = xlrd.open_workbook(filepath)
    sheets = wb.sheet_names()
    df_total = pd.DataFrame()
    for name in sheets:
        df = pd.read_excel(filepath, sheet_name=name)
        df_total = df_total.append(df)
    df_total.to_excel("merge.xlsx", index=False)

2. Output two daily newspapers

(1) Plain text document

According to the daily report format that needs to be output, to output the daily report without records, just read the [Date] column and the [Report Department] column, and the [Report Department] is listed as the non-date period and output daily. Observe the data in the original table, and directly filter the data with no reported records and drop them into the sub-table named "None".
Insert picture description here

Here you can also use .groupby()that group to fill [department] column grouping, to take "no", but the point to note: Although Python is very powerful, but it is not necessary to do everything to Python.

Import libraries and modules are as follows:

import pandas as pd
import xlrd
from docx import Document
from docx.shared import Pt
from docx.shared import Inches
from docx.oxml.ns import qn
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT
from docx.enum.section import WD_ORIENTATION

The basic process is very simple, read in the data without reporting records, and output word documents by date.

def wu_to_word(filepath):
    df = pd.read_excel(filepath, sheet_name="无")
    date_list = list(df['日期'])
    for d in date_list:
        filename = wordname+str(d)+").docx"  # 输出的word文件名
        title = "("+str(d)[:4]+"."+str(d)[4:6]+"."+str(d)[6:8]+")"  # 副标题日期XXXX.XX.XX
        word = str(d)[:4]+"年"+str(d)[4:6]+"月"+str(d)[6:8]+"日"  # 开头、落款日期XXXX年XX月XX日
        wu_doc(title, word, filename)
        print(f"文件:{filename},{title},{word} 已保存")

The same content that will be used in each document can also be set first.

wordname = "XX company business data sheet (daily report")
all_title = "XX company business report"

It is relatively easy to generate word content without adding a table. Pay attention to adjusting the format.

def wu_doc(title,word,filename):  # 传入副标题日期,文段开头及落款的日期,文件名
    doc = Document()  # 创建文档对象
    section = doc.sections[0]  # 获取页面节点
section.orientation = WD_ORIENTATION.LANDSCAPE  # 页面方向设为横版
new_width, new_height = section.page_height, section.page_width  # 将原始长宽互换,实现将竖版页面变为横版
    section.page_width = new_width
    section.page_height = new_height
    # 段落的全局设置
    doc.styles['Normal'].font.name = u'宋体'  # 字体
    doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体')  # 中文字体需再添加这个设置
    doc.styles['Normal'].font.size = Pt(14)  # 字号 四号对应14
    t1 = doc.add_paragraph()  # 添加一个段落
    t1.paragraph_format.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER  # 居中
    _t1 = t1.add_run(all_title)  # 添加段落内容(大标题)
    _t1.bold = True  # 加粗
    _t1.font.size = Pt(22)
    t2 = doc.add_paragraph()  # 再添加一个段落
    t2.paragraph_format.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER  # 居中
    _t2 = t2.add_run(title + "\n")  # 添加段落内容(副标题)
    _t2.bold = True
    doc.add_paragraph(word + "无记录。\n\n").paragraph_format.first_line_indent = Inches(0.35)  # 添加段落同时添加内容,并设置首行缩进
    doc.add_paragraph(word).paragraph_format.alignment = WD_PARAGRAPH_ALIGNMENT.RIGHT  # 落款日期右对齐
    doc.save(dir+filename)  # 按路径+文件名保存

carried out! Just write 104 daily newspapers without reporting records, so let’s do the business, and I don’t want to study the rest hahaha.
Insert picture description here

(Two), attached form document

The processing of the reported data is a bit more complicated. Let's look at the original data first.
Insert picture description here

For example, on X year X month X day, N departments have filled in data. According to the document sample, the paragraph description part needs to be organized into the following format:

Department A: "Submission Content 1" X records; "Submission Content 2" Y records; Department B: ...; Department C: ...;

And the attachment table part needs to be organized into the following format, you can expect to organize a list of the data needed for each row, and write it into the table by row:

First level indicator Secondary indicators Level three indicators Four-level indicators Reporting by various departments Remarks
lalala hahaha balabala If it is empty, the superior Department A: Submit content 1 There are records that have not been uploaded or reported, and the system has crashed
aaa bbb ccc ddd Department A: Submit content 2 Uploaded, good report
Department B: Submit content 1

The basic process is similar. After reading the meter, first group by date, each group contains one or more department data in a day, and then generate the form required for the attachment of a certain day, then organize the paragraph description, and finally output the word of each day by date Document.

def what_to_word(filepath):
    df = pd.read_excel(filepath, sheet_name="有")
    df.fillna('', inplace=True)  # 替换nan值为空字符
    dates = []  # 日期列表
    df_total = []  # 分日期存的所有df
    list_total = []  # 每一份word中需要的表数据合集
    for d in df.groupby('日期'):
        dates.append(d[0])
        df_total.append(d[1])
    for index,date in enumerate(dates):
        list_oneday = []  # 某一个word所需的表数据
        for row in range(len(df_total[index])):
            list_row = get_table_data(df_total, index, row)  # 其中一行数据
            list_oneday.append(list_row)
        list_total.append(list_oneday)
    for index, date in enumerate(dates):
        filename = wordname+str(date)+").docx"  # 输出的word文件名
        title = "("+str(date)[:4]+"."+str(date)[4:6]+"."+str(date)[6:8]+")"  # 副标题日期XXXX.XX.XX
        word = str(date)[:4]+"年"+str(date)[4:6]+"月"+str(date)[6:8]+"日"  # 开头、落款日期XXXX年XX月XX日
        sentence = get_sentence(df_total, index)  # 某一天的文段描述
        what_doc(title, word, sentence, list_total[index], filename)  #传入需要的内容后输出文档
        print(f"文件:{filename} 已保存")

Let us take a look at how to organize tables, organize paragraphs, and output documents.

1. Organize the table

Gets a row of data excel table ( explanation : df_total[df_index]a dataframe, it valuesis a two-dimensional numpy arrays), index finishing at all levels and all departments to submit the case and notes, returns a list.

def get_table_data(df_total, df_index, table_row):
    list1 = df_total[df_index].values[table_row]  # excel表中的一行
    list2 = list1[3:7]  # 一至四级指标
    for i in range(len(list2)):  # 当前指标为空则沿用上级指标
        if list2[i] == '空' and i != 0:
            list2[i] = list2[i - 1]
    content = list1[2] + ":\n" + list1[-4]  # 报送内容
    if '否' in list1[-2]:  # 备注
        remark = '有记录未上传,' + str(list1[-1])
    else:
        remark = '已上传'
    list3 = list2.tolist()  # 需填入word中的表数据,由numpy数组转为list列表
    list3.append(str(content))
    list3.append(str(remark))
    return list3

2. Organize paragraphs

Counting the unique values ​​in the column of [Reporting Department] in the data of the day, it is known that N departments have filled in data. Sector packet to obtain the relevant information, combined into a [(报送内容,记录数,是否上报,备注)]format, and then sort out the form "has submitted the data of N sector: Sector X-:" Content submitted XXX "X records; ..." description string.

def get_sentence(df_total, df_index):
    df_oneday = df_total[df_index]
    num = df_oneday['填报部门'].nunique()  # 部门的数量
    group = []  # 部门名称
    detail = []  # 组合某个部门的数据,其中元素为元组格式(, , , )
    info = ''  # 报送情况描述
    for item in df_oneday.groupby('填报部门'):
        group.append(item[0])
        detail.append(
            list(
                zip(
                    list(item[1]['报送内容']),
                    list(item[1]['记录数']),
                    list(item[1]['是否上报']),
                    list(item[1]['备注'])
                )
            )
        )
    for index, g in enumerate(group):  # 整理每个部门的填报情况
        mes = str(g)+':'  # 部门开头
        for i in range(len(detail[index])):
            _mes = detail[index][i]
            if int(_mes[1])>0:
                mes = mes + f'“{_mes[0]}”{_mes[1]}条记录;'
        info = info + mes
    info = info[:-1]+"。"  #将最后一个分号替换成句号
    sentence = f"有{num}个部门报送了数据:{info}"
    return sentence

3. Output document

( Warning patience! ) The operation of adjusting the text and table styles in word is cumbersome and needs to be set step by step. The preset table headers are as follows:

table_title = ['First-level indicators','Second-level indicators','Third-level indicators','Four-level indicators','Submission status of each department','Remarks']

See code comments for other details.

def what_doc(title, word, sentence, table, filename):  # 传入副标题日期,开头/落款日期,文段,表数据,文件名
    doc = Document()
    section = doc.sections[0]
    new_width, new_height = section.page_height, section.page_width
    section.orientation = WD_ORIENTATION.LANDSCAPE
    section.page_width = new_width
    section.page_height = new_height
    # 段落的全局设置
    doc.styles['Normal'].font.name = u'宋体'  # 字体
    doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体')  # 中文字体需再添加这个设置
    doc.styles['Normal'].font.size = Pt(14)  # 字号 四号对应14
    t1 = doc.add_paragraph()  # 大标题
    t1.paragraph_format.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER  # 居中
    _t1 = t1.add_run(all_title)
    _t1.bold = True
    _t1.font.size = Pt(22)
    t2 = doc.add_paragraph()  # 副标题
    t2.paragraph_format.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER  # 居中
    _t2 = t2.add_run(title + "\n")
    _t2.bold = True
    doc.add_paragraph(word + sentence +"\n\n").paragraph_format.first_line_indent = Inches(0.35)  # 首行缩进
    doc.add_paragraph(word).paragraph_format.alignment = WD_PARAGRAPH_ALIGNMENT.RIGHT  # 右对齐
    doc.add_paragraph("各部门具体报送情况见附件:")

    doc.add_page_break()  # 分页---------------------------------------------------------------
    fujian = doc.add_paragraph().add_run("\n附件")
    fujian.bold = True
    fujian.font.size = Pt(16)
    t3 = doc.add_paragraph()  # 附件大标题
    t3.paragraph_format.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER  # 居中
    _t3 = t3.add_run("XX公司业务数据表")
    _t3.bold = True
    _t3.font.size = Pt(22)

    rows = len(table)+1
    word_table = doc.add_table(rows=rows, cols=6, style='Table Grid')  # 创建rows行、6列的表格
    word_table.autofit=True  # 添加框线
    table = [table_title] + table  # 固定的表头+表数据
    for row in range(rows):  # 写入表格
        cells = word_table.rows[row].cells
        for col in range(6):
            cells[col].text = str(table[row][col])
    for i in range(len(word_table.rows)):    # 遍历行列,逐格修改样式
        for j in range(len(word_table.columns)):
            for par in word_table.cell(i, j).paragraphs:  # 修改字号
                for run in par.runs:
                    run.font.size = Pt(10.5)
            for par in word_table.cell(0, j).paragraphs:  # 第一行加粗
                for run in par.runs:
                    run.bold = True
    doc.save(dir+filename)

carried out! 74 daily newspapers with records have also been written, a total of 178 copies.

Insert picture description here
The operation was as fierce as a tiger, and the daily newspapers were finally generated in batches. It's time to add a chicken leg to the lunch...
Insert picture description here

Guess you like

Origin blog.csdn.net/zohan134/article/details/108680844