Write a few lines of code in Python to get the workload of the day in one minute, and my colleague said: Good fellow!

Click on " Python crawler and data mining " above to follow

Reply to " Books " to receive a total of 10 e-books of Python from beginner to advanced

now

day

Chickens

soup

Lingwaiyin's book is broken, and spring is repeated through winter.

A few days ago, a reader said that he had to organize thousands of documents recently and his head was bald. I wonder if it can be solved with Python. Let’s take a look, and you can also think about it.

The specific content has been desensitized due to the privacy of the file.

This is probably the case, there are multiple meeting notices under one folder (this article uses 7 documents as an example)

The opening format of each notification is basically similar, as shown below????

Now it is necessary to extract the four key information of study time, study content, study form, and moderator in each meeting document and organize them into an Excel table:

In his real needs, nearly 1,000 meeting notices were accumulated in four years (so many meetings in four years are also very powerful...), and the workload of opening files one by one and recording them in Excel is too much.

Good guy, isn't this kind of repetitive and boring job an automated job that is very suitable for handing over to Python? I don’t allow my fans yet!

Let's take a look at how to solve this problem with Python, which will mainly involve:

  • openpyxl Write to Excel file

  • python-docx Read Word file

  • glob Get file paths in batches

In order to simplify the above requirements, meeting in this article need to obtain a total of seven of the notification file, named 会议通知1.docx 会议通知2.docx... 会议通知7.docx, stored in a Noticefolder. The output target Excel file is namedMeeting_temp.xlsx

Basic logic

Before writing the code, it is clear that the complete problem needs to be implemented in several small steps . From the requirements, we can roughly divide the code into the following steps:

  • Obtain all files in the Notice folder of the meeting notice;

  • Analyze each Word file, obtain the four required information, and output it to Excel;

  • Save Excel file

With logic, there is an idea of ​​writing code. Step 1 can be made globto complete the library, followed by a two-step operation is the Word of python-docxlibrary operations and Excel's openpyxlinteractive collaboration of the library.

We have talked about these two libraries. If you are not familiar with it, you must read the following article first!

Code

First import the required libraries:

from docx import Document
from openpyxl import load_workbook
import glob

Read the template Excel into the program:

path  = r'C:\Users\xxx' # 路径为会议通知文件夹和 Excel 模板所在的位置,可按实际情况更改
workbook = load_workbook(path + r'\Meeting_temp.xlsx')
sheet = workbook.active

It is recommended to write about a single-source operating batch before writing any code, so let's complete notice of the meeting 1.docxto resolve the file to ensure correct. Now for the structure and location of the key information document is not clear, it can first paragraph Word to Paragraphobserve the unit output:

wordfile = Document(path + r'\Notice\会议通知 1.docx')
for paragraph in wordfile.paragraphs:
    print(paragraph)

The text layout of the document is relatively clear, basically a sentence corresponds to a paragraph, and the required information can be clarified simply by judging the first few words of each sentence (each paragraph):

    for paragraph in wordfile.paragraphs:
        if paragraph.text[0:5] == '学习时间:':
            study_time = paragraph.text[5:]
        if paragraph.text[0:4] == '主持人:':
            host = paragraph.text[4:]
        if paragraph.text[0:5] == '学习形式:':
            study_type = paragraph.text[5:]

The acquisition of learning content is quite special, unlike the other three information, which are all in one sentence, and the keywords are the first few words:

It can be seen that the four words "learning content" and the actual content are scattered in different sentences. Here is a simple strategy:

Establish a list of empty store, and then traverse each segment to determine if a character is a digit and the second character for the Chinese comma “、”will get stored in the list. Finally, recombine the elements in the list into a long string:

    content_lst = []
    for paragraph in wordfile.paragraphs:
        if paragraph.text[0:5] == '学习时间:':
            study_time = paragraph.text[5:]
        if paragraph.text[0:4] == '主持人:':
            host = paragraph.text[4:]
        if paragraph.text[0:5] == '学习形式:':
            study_type = paragraph.text[5:]
        if len(paragraph.text) >= 2:
            if paragraph.text[0].isdigit() and paragraph.text[1] == '、':
                content_lst.append(paragraph.text)
    content = ' '.join(content_lst)

After finishing parsing the Word file, you need to output the content in an Excel file.

Briefly, the above code is the acquired combination of elements into a list, by sheet.append(list)the method of writing Excel file:

number = 0 # 全局中设置一个变量用于计数,做为序号输出

wordfile = Document(path + r'\Notice\会议通知 1.docx')
content_lst = []
for paragraph in wordfile.paragraphs:
    if paragraph.text[0:5] == '学习时间:':
        study_time = paragraph.text[5:]
    if paragraph.text[0:4] == '主持人:':
        host = paragraph.text[4:]
    if paragraph.text[0:5] == '学习形式:':
        study_type = paragraph.text[5:]
    if len(paragraph.text) >= 2:
        if paragraph.text[0].isdigit() and paragraph.text[1] == '、':
            content_lst.append(paragraph.text)
content = ' '.join(content_lst)
number += 1
sheet.append([number, study_time, content, study_type, host])

Single file parsing finished with globchange completely get all the documents folder, the establishment of recycling can be done individually resolve this demand, of course, and finally remember to save Excel files.

The complete code is as follows????

from docx import Document
from openpyxl import load_workbook
import glob

path  = r'C:\Users\xxx'
workbook = load_workbook(path + r'\Meeting_temp.xlsx')
sheet = workbook.active
number = 0

for file in glob.glob(path + r'\Notice\*.docx'):
    wordfile = Document(file)
    content_lst = []
    for paragraph in wordfile.paragraphs:
        if paragraph.text[0:5] == '学习时间:':
            study_time = paragraph.text[5:]
        if paragraph.text[0:4] == '主持人:':
            host = paragraph.text[4:]
        if paragraph.text[0:5] == '学习形式:':
            study_type = paragraph.text[5:]
        if len(paragraph.text) >= 2:
            if paragraph.text[0].isdigit() and paragraph.text[1] == '、':
                content_lst.append(paragraph.text)
    content = ' '.join(content_lst)
    number += 1
    sheet.append([number, study_time, content, study_type, host])

workbook.save(path + r'\Meeting_notice.xlsx')

The core is only 30 lines of code , and it takes only three seconds to complete!

------------------- End -------------------

Recommendations of previous wonderful articles:

Welcome everyone to like , leave a message, forward, reprint, thank you for your company and support

If you want to join the Python learning group, please reply in the background [ Enter the group ]

Thousands of rivers and mountains are always in love, can you click [ Looking ]

/Today's message topic/

Just say a word or two~

Guess you like

Origin blog.csdn.net/pdcfighting/article/details/113285592