Preprocessing for model fine-tuning

1. Preparation of resume text annotation data

Goal: Convert the original data set to the text/document extraction annotation format supported by PaddleNLP to prepare for subsequent model fine-tuning.

Tool: Label Studio

manual:

applications/information_extraction/label_studio_text.md PaddlePaddle/PaddleNLP - Gitee.com icon-default.png?t=N3I4https://gitee.com/paddlepaddle/PaddleNLP/blob/develop/applications/information_extraction/label_studio_text.md Label Studio is an open source data labeling tool for Create, manage, and maintain various types of machine learning datasets. It provides a web-based user interface that allows users to easily create custom data labeling tasks and invite other users to complete the labeling work together. Label Studio supports a wide variety of data types and label types, including text, images, audio, video, and other custom data types.

For environment configuration, see:

Resume information extraction (3): UIE format conversion and fine-tuning training for text extraction - Flying Paddle AI Studio (baidu.com) icon-default.png?t=N3I4https://aistudio.baidu.com/aistudio/projectdetail/5418951?channelType=0&channel=0 of the data set Prepare:

1. Data import

txtLabel Studio only supports document import in NER tasks , but resumes are all word files, so our purpose is to save the resume content in multiple words into a txt file, and each line is the content of a resume.

First look at how to parse multiple resume files:

You can use globthe module to get all the filenames in a specified directory that meet a certain condition, then foriterate through these files in a loop, and perform the same parsing operation on each file. Here is a sample code that does this:

import glob
import os

# 获取指定目录下所有的Word文件名
dir_path = 'path/to/directory'
file_pattern = '*.docx'  # 或者是 '*.doc',具体根据您的文件类型来定
file_names = glob.glob(os.path.join(dir_path, file_pattern))

# 循环遍历每个Word文件,执行相同的解析操作
for file_name in file_names:
    # 执行解析操作,例如使用python-docx库来读取Word文件
    # ...

Rewrite the content extraction code based on this code:

dir_path = '../ResumeFiles/resume'   //存放简历文件的文件夹的相对路径
file_pattern = '*.docx'  # 或者是 '*.doc',具体根据您的文件类型来定
file_names = glob.glob(os.path.join(dir_path, file_pattern))

for file_name in file_names:
    print("---------------------“简历{}”------------------------".format(file_name))
    document = ZipFile(file_name)
    xml = document.read("word/document.xml")
    wordObj = BeautifulSoup(xml.decode("utf-8"))
    texts = wordObj.findAll("w:t")
    paragraphs_text = ""
    for text in texts:
        # print(text.text)
        paragraphs_text += text.text
    print(ie(paragraphs_text))    //ie函数的功能是将文本进行信息提取,然后返回提取结果

Print result:

[2023-05-11 17:54:30,441] [ INFO ] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'C:\Users\ysy2001 0615\.paddlenlp\taskflow\information_extraction \uie-base'.
---------------------"Resume../ResumeFiles/resume\1.docx"---------- --------------
[{'Name': [{'text': 'Zhang Jiwei', 'start': 0, 'end': 3, 'probability': 0.9939343973137511}], 'Date of Birth': [{'text': '1998.11', 'start': 51, 'end': 58, 'probability': 0.9982788171921086}]}]
-------------- -------"Resume../ResumeFiles/resume\2.docx"---------------------
[{'name': [{'text': 'Lin Guorui', 'start': 402, 'end': 405, 'probability': 0.9860693449614004}], 'date of birth': [{'text': '1990.2. 8', 'start': 1101, 'end': 1109, 'probability': 0.9969562203449982}], 'telephone': [{'text': '138 3108 8888', 'start': 985, 'end': 998 , 'probability': 0.40060286240933607}]}]
---------------------"Resume../ResumeFiles/resume\3.docx"------ ------------------
[{'Name': [{'text': 'Lin Wenshu', 'start': 0, 'end': 3, 'probability ': 0.9931329291108639}], 'date of birth': [{'text': '1996.05', 'start': 18, 'end': 25, 'probability': 0.9974436452829565}], 'telephone': [{'text' : '13801138023','start': 43, 'end': 54, 'probability': 0.9825973998539936}]}]
--------------------- "Resume ../ResumeFiles/resume \4.docx" ------------------------
[{'Name': [{'text': 'Lin Yanan', 'start': 0, 'end': 3, 'probability': 0.9905885262895744}, {'text': 'Yimiao', 'start': 2026 , 'end': 2028, 'probability': 0.8292328175251846}, {'text': 'Luo Jialiang', 'start': 2889, 'end': 2892, 'probability': 0.7978081125050949}, {'text': 'Luo Jialiang' , 'start': 3220, 'end': 3223, 'probability': 0.9071803656674007}, {'text': 'Lu Liangwei', 'start': 3740, 'end': 3743, 'probability': 0.7161086518860031}, {' text': 'Luo Jialiang', 'start': 3818, 'end': 3821, 'probability': 0.42471927022469913}, {'text': 'Luo Jialiang', 'start': 4134, 'end': 4137, 'probability' : 0.303645922465833}, {'text': 'Lu Liangwei', 'start': 4056, 'end': 4059, 'probability': 0.9645008868973619}], 'date of birth': [{'text': '1996.05', 'start': 41, 'end ': 48, 'probability': 0.9984617729245855}], 'Phone': [{'text': '13801138823', 'start': 51, 'end': 62, 'probability': 0.9902135906254692}]}]
---------------------"Resume../ResumeFiles/resume\5.docx"---------------- --------
[{'Name': [{'text': 'Jiang Yiyun', 'start': 112, 'end': 115, 'probability': 0.9881768328965066}, {'text': 'Luo Jialiang ', 'start': 2947, 'end': 2950, ​​'probability': 0.7255947350980705}, {'text': 'Luo Jialiang', 'start': 3272, 'end': 3275, 'probability': 0.887640133829251}], 'Date of Birth': [{'text': '1996.05', 'start': 3, 'end': 10, 'probability': 0.9960509002566198}], 'Phone': [{'text': '13812138123', ' start': 25, 'end': 36, 'probability': 0.9519620354312615}]}]

Process finished with exit code 0

The above code can extract and print out the content of multiple files.

Let's store it in a txt file and implement the branch:

  with open('../ResumeFiles/resume/sample.txt', 'a', encoding='utf-8') as f:
        paragraphs_text = paragraphs_text + "\n"
        f.write(paragraphs_text)

Use Python's built-in open()function to create a new text file object sample, and paragraphs_textuse all the strings in it '\n'as delimiters to concatenate and write to the file sample.txt.

Then, 'a 'open this output file with mode and set the encoding format to utf-8.

Notice:

The a mode is append , which means to append. The w mode is write . If it is set to w mode, it will be overwritten. In the end, there is only the content of the last resume in the file.

It is important to note that using withthe statement to open the file ensures that the file is automatically closed after the file operation is complete, thus avoiding resource leaks and other errors.

Episode:

In the process of writing code, I mistakenly wrote paragraphs_text = paragraphs_text + "\n" as:

  paragraphs_text += paragraphs_text + "\n"

The result is that the program does not report an error, there is no output, the computer memory is full, the computer fan is running loudly, and some common operations on the page will freeze.

After the force ends, an error is reported:

Traceback (most recent call last):
File "D:\pycharm\5.4\recognition_model\extraction.py", line 25, in <module>
paragraphs_text += text.text
MemoryError

MemoryError, memory error.

Python is trying to use more memory than is available on the system. This statement accumulates the text content into a string variable paragraphs_text, which takes up too much memory when dealing with large amounts of text.

After the error is corrected, run the program, and a sample.txt file will be generated under ../ResumeFiles/resume:

 document content:

 The purpose of converting multiple files into a single txt file has been realized, and one line is a resume.

2023.5.11

To be continued. . .

 

Guess you like

Origin blog.csdn.net/qq_53162179/article/details/130627366