Use Taskflow to complete resume information extraction

Steps to build the model:

 First of all, data processing is required to extract the data in the resume file.

First of all, the data set must be prepared. Most of the data sets provided by the competition are word documents, all in .docx format.

docx files are  XML  based and can contain text, objects, styles, formatting, and images, all stored as separate files that are ultimately compressed in a single ZIP compressed docx file. For example, to open a .docx file in zip format:

 There are several folders inside, and there are files in xml format after opening:

Open the document.xml file:

 The body of the document is expressed in xml format.

Extract word document information

The project chooses to use BeautifulSoup to extract Word document information, source code:

from zipfile import ZipFile
from bs4 import BeautifulSoup
from pprint import pprint
from paddlenlp import Taskflow
# 定义实体关系抽取的schema
schema = ['姓名', '出生日期', '电话']
ie = Taskflow('information_extraction', schema=schema)

document = ZipFile('D:\\DeskTop\\dataset_CV\\dataset_CV\\CV\\3.docx')
xml = document.read("word/document.xml")
wordObj = BeautifulSoup(xml.decode("utf-8"))
texts = wordObj.findAll("w:t")
paragraphs_text = ""
for text in texts:
    # print(text.text)
    paragraphs_text += text.text + "\n"
# print(paragraphs_text)
pprint(ie(paragraphs_text)

       Taskflow is an information extraction framework that provides general information extraction of text and documents, evaluation opinion extraction and other capabilities, and can extract various types of information, including but not limited to named entity recognition (such as person name, place name, organization name, etc.), relationship ( Such as the director of the movie, the release time of the song, etc.), events (such as a car accident at a certain intersection, an earthquake in a certain place, etc.), and information such as evaluation dimensions, opinion words, and emotional tendencies. Users can use natural language to customize the extraction target, and can uniformly extract the corresponding information in the input text or document without training.

A schema variable with three fields is defined, which are name, date of birth, and phone number. Then, use this schema variable to create an information extraction task process object ie.

Use the ZipFile module in the python standard library to open a Word document and read the " word/document.xml " file in it. Then, use the BeautifulSoup library to parse the text in XML format into an object wordObj. Next, read all the text content in the document into a string variable paragraphs_text by looking for all " w:t " tags in wordObj .

references:

[1]

[1] Resume Information Extraction (1): PDFPlumber and PP-Structure - Flying Paddle AI Studio (baidu.com) https://aistudio.baidu.com/aistudio/projectdetail/2493247?contributionType=1&sUid=90149&shared=1&ts=1674726578546 [2]

applications/information_extraction/taskflow_doc.md · PaddlePaddle/PaddleNLP - Gitee.comhttps://gitee.com/paddlepaddle/PaddleNLP/blob/develop/applications/information_extraction/taskflow_doc.md#https://gitee.com/link?target=https%3A%2F%2Fbj.bcebos.com%2Fpaddlenlp%2Ftaskflow%2Finformation_extraction%2Fcases.zip

Guess you like

Origin blog.csdn.net/qq_53162179/article/details/130650090