[Entertainment Star Knowledge Map 2] Information Extraction

Table of contents

1. Project introduction

2. Introduction to information extraction

3. ChatGPT information extraction code practice

4. Main logic of information extraction

5. Project source code


1. Project introduction

Utilize the large amount of information crawled in the crawler project

[Entertainment Star Knowledge Map 1] Encyclopedia Crawler_Encarta1993's Blog-CSDN Blog Entertainment Star Knowledge Map Baidu Encyclopedia Crawler Baidu Encyclopedia Crawler Baidu Encyclopedia Crawler Baidu Encyclopedia Crawler Baidu Encyclopedia Crawler Baidu Encyclopedia Crawler Baidu Encyclopedia Crawler Baidu Encyclopedia Crawler Baidu Encyclopedia Crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler Baidu Encyclopedia crawler https://blog.csdn.net/u014147522/article/details/131160490 Extract the structured key information.

In this project we pay more attention to

Name

gender

Birthday

place of birth

Graduated school

Main works

These 6 information points.

2. Introduction to information extraction

Information extraction tasks refer to identifying and extracting specific types of information from text. This information can be entities (such as names, locations, organizations, etc.), relationships (such as associations between people, attributes of items, etc.) or events (such as time, action, status, etc.). Information extraction tasks usually include the following steps:

1. Entity recognition: Identify entities in text, such as names, locations, organizations, etc.

2. Relationship recognition: Identify relationships between entities, such as associations between characters, item attributes, etc.

3. Event recognition: Identify events described in text, such as time, action, status, etc.

4. Information extraction: Extract the required information from the text, such as a company’s headquarters location, a person’s contact information, etc.

Since it is now 2023, the first year of the big model, all NLP tasks are unified by the big model. Therefore, this project uses ChatGPT for information extraction.

  • large model

A large language model (LLM) is a pre-trained deep learning model that can be used for a variety of natural language processing tasks, including information extraction. Information extraction is the process of extracting structured information from unstructured text. LLM can improve its performance in information extraction tasks by learning large amounts of text data. LLM is usually pre-trained using self-supervised learning, which means it can learn from unlabeled data without the need for human-labeled data. LLM can be fine-tuned in various ways to adapt to different information extraction tasks.

  • ChatGPT

ChatGPT is an artificial intelligence chatbot developed by OpenAI. It uses large-scale language models based on GPT-3.5 and GPT-4 and is able to understand and learn human language and conduct natural conversations and interactions. ChatGPT can not only chat, but also complete various tasks, such as writing emails, video scripts, copywriting, translation, coding, papers, etc.

3. ChatGPT information extraction code practice

The interaction of large models mainly uses prompts.

Prompt is a text fragment used to guide large language models to generate natural language text. When using a large language model, we need to provide a prompt to guide the model to produce text results that meet our expectations. Prompt can be a word, a sentence, a paragraph or a complete chapter.

import openai

from utils import get_api_key


openai.api_key = get_api_key()


def call_gpt(context):
    prompt = "\n\n\n根据上文中给定的介绍细节,请仔细找出或推测出这个人的‘姓名、性别、生日、出生地、毕业学校、主要作品’这6个信息点,如果没有则用空字符串代替,并按照json格式输出,如果value有多个则按照jsonarray输出"
    content = context + prompt

    messages = [
        {
            'role': 'system', 
            'content': '你是一个自动信息抽取专家机器人。'
        }
    ]
    messages.append(
        {
            'role': 'user', 
            'content': content
        }
    )

    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=messages,
    )

    return response["choices"][0]["message"]["content"]


if __name__ == "__main__":
    context = "黄晓明,1977年11月13日出生于山东省青岛市市南区,中国内地影视男演员、流行乐歌手,毕业于北京电影学院表演系"
    result = call_gpt(context=context)
    print(result)


Executing the above code will result in:

{   "Name": "Huang Xiaoming",   "Gender": "Male",   "Birthday": "November 13, 1977",   "Birthplace": "Shinan District, Qingdao City, Shandong Province",   "Graduation School": "Beijing Film Academy",   "Major Works": "" }






It can be seen that it meets our information extraction requirements very well.

4. Main logic of information extraction

This project mainly focuses on extracting the star information crawled in the previous project. The following is one of the crawled data:

{
    "title": "黄晓明",
    "url": "https://baike.baidu.com/item/黄晓明/6597",
    "summary": "\n黄晓明,1977年11月13日出生于山东省青岛市,中国内地男演员、歌手,毕业于北京电影学院表演系\n[1-2]  。1998年主演个人首部电视剧《爱情不是游戏》进入演艺圈\n[3] \n。2001年凭借古装剧《大汉天子》获得关注\n[4] \n。自2005年起连续10年入选“福布斯中国名人榜”\n[5] \n。2006年参演古装片《夜宴》\n[378] \n。2007年主演民国剧《新上海滩》\n[440] \n;同年发行个人首张专辑《It's Ming》\n[382] \n 。2009年凭借歌曲《好人卡》获得北京流行音乐典礼年度金曲奖\n[391] \n。2010年凭借谍战片《风声》获得第17届北京大学生电影节最受欢迎男演员奖\n[6] \n。2011年成立黄晓明工作室\n[383] \n。2013年凭借剧情片《中国合伙人》获得中国电影金鸡奖、中国电影华表奖、大众电影百花奖最佳男主角奖\n[7-9]   。2015年成为首位在好莱坞中国剧院留下手印的中国内地男演员\n[10] \n。2016年凭借史诗片《大唐玄奘》获得第13届中国长春电影节最佳男主角奖\n[11] \n。2017年主演古装剧《琅琊榜之风起长林》\n[12] \n。2018年主演爱情片《无问西东》上映\n[13] \n。2019年凭借剧情片《烈火英雄》该片获得第35届大众电影百花奖最佳男主角奖、第33届中国电影金鸡奖最佳男主角奖\n[15-16]  ;同年担任第32届中国电影金鸡奖评委\n[17] \n。2020年主演民国剧《鬓边不是海棠红》\n[380] \n。2021年主演年代剧《光荣与梦想》播出\n[377] \n。演艺事业外,他还热心于公益慈善\n[390] \n。2008年担任中国儿童少年基金会形象大使。2009年担任联合国儿童基金香港委员会儿童基金会爱心大使\n[18] \n。2014年当选山东省十大杰出青年\n[19] \n,同年成立“黄晓明明天爱心基金”。2016年担任中国保护大熊猫研究中心形象大使\n[20] \n。\n",
    "basic-info": "\n\n中文名\n\n黄晓明\n\n外文名\n\nHuang Xiaoming\n\n别    名\n\n教主、猫、钢钉侠、熊猫明\n[376] \n、囧明\n\n国    籍\n\n中国\n\n民    族\n\n汉族\n\n出生地\n\n山东省青岛市市南区\n\n出生日期\n\n1977年11月13日\n\n星    座\n\n天蝎座\n\n血    型\n\nO型\n\n身    高\n\n179 cm\n[21] \n\n毕业院校\n\n北京电影学院\n\n职    业\n\n演员、歌手\n[22] \n\n经纪公司\n\n黄晓明工作室\n\n代表作品\n\n中国合伙人、风声、烈火英雄、无问西东、大唐玄奘、大上海、撒娇女人最好命、大汉天子、神雕侠侣、新上海滩、暗香、精忠岳飞、鬓边不是海棠红、匹夫、锦绣缘华丽冒险、琅琊榜之风起长林、赵氏孤儿、鹿鼎记、玫瑰之战、暗恋、什么都可以、缘、精忠传奇、就算没有明天\n\n\n\n主要成就\n\n第29届中国电影金鸡奖最佳男主角奖\n第32届大众电影百花奖最佳男主角奖\n第15届中国电影华表奖优秀男演员奖\n第32届中国电影金鸡奖评委\n第13届中国长春电影节最佳男主角奖\n\n展开\n\n\n\n主要成就\n\n第29届中国电影金鸡奖最佳男主角奖\n第32届大众电影百花奖最佳男主角奖\n第15届中国电影华表奖优秀男演员奖\n第32届中国电影金鸡奖评委\n第13届中国长春电影节最佳男主角奖\n\n第17届北京大学生电影节最受欢迎男演员\n第10届华语电影传媒大奖最具人气男演员\n第11届华语电影传媒大奖最受瞩目男演员\n山东省十大杰出青年称号\n[23] \n联合国艾滋病规划署中国亲善大使\n[24] \n中国电影家协会青年和新文艺群体工作委员会会长\n[25] \n第12届中国长春电影节最佳男主角奖\n第35届大众电影百花奖最佳男主角奖\n第33届中国电影金鸡奖最佳男主角奖\n[26] \n\n收起\n\n\n\n\n\n公益基金\n\n黄晓明明天爱心基金\n\n生    肖\n\n蛇\n\n影友会\n\n明教\n\n性    别\n\n男\n\n\n"
}

We need to extract each crawled data in turn

import json
import random
import time
from tqdm import tqdm
from extractor import call_gpt


def main():
    with open("data/person.jsonl", "r", encoding="utf-8") as f:
        data = [i.strip() for i in f.readlines() if i.strip()]

    with open("data/result.jsonl", "w", encoding="utf-8") as f:
        for line in tqdm(data):
            line = json.loads(line)
            query = line["title"] + "\n\n\n" + line["summary"] + "\n\n\n" + line["basic-info"] + "\n\n\n"
            url = line["url"]
            try:
                res = call_gpt(query)
                f.write(json.dumps(json.loads(res), ensure_ascii=False) + "\t" + url + "\n")
            except KeyboardInterrupt:
                break
            except:
                print("error")
                time.sleep(120)
                continue
            
            time.sleep(random.random() * 3)
            


if __name__ == "__main__":
    main()

Among them, person.jsonl comes from the previous crawler project. After executing this code, you can get result.jsonl, which is the result of information extraction.

5. Project source code

https://gitee.com/hl0929/baike-extractor

Guess you like

Origin blog.csdn.net/u014147522/article/details/132066686