Python+neo4j builds an intelligent question answering system for movie knowledge based on knowledge graph

Author:qyan.li

Date:2022.6.3

Topic: An Intelligent Question Answering System for Movie Knowledge Using Python to Construct a Knowledge Graph

Reference: (13 messages) Python creates knowledge map

One, written in front:

         ~~~~~~~~         Recently, course design requires research work on knowledge graphs . During the research process, I found on the Internet that many students build knowledge graphs by themselves, so I considered building one myself. After research and consideration based on my own technology, I finally plan to build an intelligent question answering system for movie knowledge based on knowledge graphs (mainly because the data set is easier to construct). Although it is relatively simple, I also gained a lot of new knowledge in this process, and I have a deeper understanding of the entire system framework.

2. System preparation:

         ~~~~~~~~         Before the construction of the intelligent question answering system, some preparatory work needs to be done, mainly including two aspects:

  • Neo4jSoftware Installation:

             ~~~~~~~~         In the process of building and using the knowledge map, it is necessary to use Neo4jthe graph database for visual management and operation, so the implementation must be configured well. There are many online tutorials on the Neo4j configuration process, but bloggers still have a hard time configuring it. Here are a few simple blog posts The problems encountered by the master and the corresponding solutions:

    1. Installing any version will not work without changing JDKthe version :Neo4j

      References: (6 messages) Neo4j Installation and Usage Tutorial_Huali's Blog-CSDN Blog_neo4j Installation Tutorial

    2. The key to the error is that JDKthe version Neo4jdoes not match JDKthe version and needs to be changed:

      References: (6 messages) [neo4j installation problem] You are using an unsupported version of the Java runtime._vxiao_shen_longv's blog-CSDN blog

    3. Install JDK8version:

      References: (6 messages) JDK8.0 installation and configuration_I want rua panda's blog-CSDN blog_jdk8.0

      SmallTips:

      Carefully follow the tutorial to configure environment variables, do not lose or change the configuration of any variable

  • Movie knowledge database construction:

             ~~~~~~~~         The construction of the movie knowledge database is essentially the application of web crawler technology, and the objects being crawled are our old acquaintances: Douban 250 (I feel that everyone is about to be crawled!!!). The object to be crawled is the name of the movie, together with the main actor, director, release time, one-sentence evaluation, region, type, number of reviewers, and rating eight tags of the movie, and save it in the corresponding file csv.

    Partial data display of the movie dataset file:
    insert image description here

             ~~~~~~~~         The code of the crawler will not be explained here. The complete code will be placed at the end, and the dataset file movieInfo.csvwill also be placed in it. You can download and use it yourself.

3. System Construction

         ~~~~~~~~         After the software configuration is successful and the data set construction is completed, the most exciting part can be entered: the system construction of the knowledge graph. The core of knowledge map construction is the module pythonin use py2neo, which can connect to neo4jthe database and pythoncomplete neo4jvarious operations with the help of language. Subsequent knowledge map construction and content retrieval rely on this module to complete.

         ~~~~~~~~         First put the completed code of this part for the convenience of subsequent explanation:

## 相关模块导入
import pandas as pd
from py2neo import Graph,Node,Relationship

## 连接图形库,配置neo4j
graph = Graph("http://localhost:7474//browser/",auth = ('*****','********'))
# 清空全部数据
graph.delete_all()
# 开启一个新的事务
graph.begin()


## csv源数据读取
storageData = pd.read_csv('./movieInfo.csv',encoding = 'utf-8')
# 获取所有列标签
columnLst = storageData.columns.tolist()
# 获取数据数量
num = len(storageData['title'])

# KnowledgeGraph知识图谱构建(以电影为主体构建的知识图谱)
for i in range(num):
    
    if storageData['title'][i] == '黑客帝国2:重装上阵' or storageData['title'][i] == '黑客帝国3:矩阵革命':
        continue

    # 为每部电影构建属性字典
    dict = {
    
    }
    for column in columnLst:
        dict[column] = storageData[column][i]
    # print(dict)
    node1 = Node('movie',name = storageData['title'][i],**dict)
    graph.merge(node1,'movie','name')

    ## 上述代码已经成功构建所有电影的主节点,下面构建所有的分结点以及他们之间的联系
    # 去除所有的title结点
    dict.pop('title')
    ## 分界点以及关系
    for key,value in dict.items():
        ## 建立分结点
        node2 = Node(key,name = value)
        graph.merge(node2,key,'name')
        ## 创建关系
        rel = Relationship(node1,key,node2)
        graph.merge(rel)

Explain several important points in the code:

  • With the help of py2neoconnecting to the database , replace the number with your user name and password graph = Graph("http://localhost:7474//browser/",auth = ('*****','********'))during the actual call .*

    Here the old and new versions are called differently, reference: https://blog.csdn.net/u010785550/article/details/116856031

  • The reason why The Matrix 2 and The Matrix 3 are deleted is that there are unknown characters in the eight tags to which they belong, and an error will be reported when building the neo4j node, so they are directly eliminated in the data reading phase.

         ~~~~~~~~         The following explains the code of the core part of the knowledge map construction: Since I am also a preliminary contact, there are problems in the code or explanation, please criticize and correct.

         ~~~~~~~~         The two most important sections in the construction of knowledge graphs: the construction of nodes and the connection of node relationships . Therefore, the main body of the code is mainly carried out around these two directions, using Nodeclasses and Relationshipclasses and mergefunctions to realize node creation and Connections of relationships between nodes.

  1. node1 = Node('movie',name = storageData['title'][i],**dict)This code is used to build a single node, node1the node belongs to moviethis category, namethe name is set to the name of the crawled movie, and the following dictis an additional tree as a node (here are the eight labels under each movie)

    graph.merge(node1,'movie','name')It is used to insert the created node into the knowledge map, moviewhich is a category

  2. node2 = Node(key,name = value)It is used to create nodes for the eight attributes under each movie. The category is the category, such columnas time, atcor, directoretc., namewhich is the specific content under each label. Here, it is also necessary to mergeinsert the child nodes into the In the knowledge graph

  3. rel = Relationship(node1,key,node2)With the help Relationshipof classes, the relationship between nodes is connected, and the calling form is to Relationship(node1,relationship,node2)establish a node1pointed relationship. Here, the relationship between movie nodes pointing to eight tags is established, and the relationship is the content.node2relationshipcolumn

             ~~~~~~~~         OK,The main content of the code has been built, run the code, and you can see the built knowledge map in the browser of neo4j, as shown in the figure below:
    insert image description here


             ~~~~~~~~         This is a dividing line, because the above code mainly explains how to use py2neo to build a movie knowledge map, and the following mainly explains how to use this knowledge map to complete the retrieval of movie content.

             ~~~~~~~~         As always, paste the code first for your reference:

    # 相关模块导入
    import jieba.posseg as pseg
    import jieba
    from fuzzywuzzy import fuzz
    from py2neo import Graph
    
    ## 建立neo4j对象,便于后续执行cyphere语句
    graph = Graph("http://localhost:7474//browser/",auth = ('neo4j','999272@123xy'))
    
    ## 用户意图的判断
    #设计八类问题的匹配模板
    info = ['这部电影主要讲的是什么?','这部电影的主要内容是什么?','这部电影主要说的什么问题?','这部电影主要讲述的什么内容?']
    director = ['这部电影的导演是谁?','这部电影是谁拍的?']
    actor = ['这部电影是谁主演的?','这部电影的主演都有谁?','这部电影的主演是谁?','这部电影的主角是谁?']
    time = ['这部电影是什么时候播出的?','这部电影是什么时候上映的?']
    country = ['这部电影是那个国家的?','这部电影是哪个地区的?']
    type = ['这部电影的类型是什么?','这是什么类型的电影']
    rate = ['这部电影的评分是多少?','这部电影的评分怎么样?','这部电影的得分是多少分?']
    num = ['这部电影的评价人数是多少?','这部有多少人评价过?']
    # 设计八类问题的回答模板
    infoResponse = '{}这部电影主要讲述{}'
    directorResponse = '{}这部电影的导演为{}'
    actorResponse = '{}这部电影的主演为{}'
    timeResponse = '{}这部电影的上映时间为{}'
    countryResponse = '{}这部电影是{}的'
    typeResponse = '{}这部电影的类型是{}'
    rateResponse = '{}这部电影的评分为{}'
    numResponse = '{}这部电影评价的人数为{}人'
    # 用户意图模板字典
    stencil = {
          
          'info':info,'director':director,'actor':actor,'time':time,'country':country,'type':type,'rate':rate,'num':num}
    # 图谱回答模板字典
    responseDict = {
          
          'infoResponse':infoResponse,'directorResponse':directorResponse,'actorResponse':actorResponse,'timeResponse':timeResponse,'countryResponse':countryResponse,'typeResponse':typeResponse,'rateResponse':rateResponse,'numResponse':numResponse}
    
    # 由模板匹配程度猜测用户意图
    ## 模糊匹配参考文献:https://blog.csdn.net/Lynqwest/article/details/109806055
    def AssignIntension(text):
        '''
        :param text: 用户输入的待匹配文本
        :return: dict:各种意图的匹配值
        '''
        stencilDegree = {
          
          }
        for key,value in stencil.items():
            score = 0
            for item in value:
                degree = fuzz.partial_ratio(text,item)
                score += degree
            stencilDegree[key] = score/len(value)
    
        return stencilDegree
    
    
    ## 问句实体的提取
    ## 结巴分词参考文献:https://blog.csdn.net/smilejiasmile/article/details/80958010
    def getMovieName(text):
        '''
        :param text:用户输入内容 
        :return: 输入内容中的电影名称
        '''
        movieName = ''
        jieba.load_userdict('./selfDefiningTxt.txt')
        words =pseg.cut(text)
        for w in words:
            ## 提取对话中的电影名称
            if w.flag == 'lqy':
                movieName = w.word
        return movieName
    
    
    ## cyphere语句生成,知识图谱查询,返回问句结果
    ## py2neo执行cyphere参考文献:https://blog.csdn.net/qq_38486203/article/details/79826028
    def SearchGraph(movieName,stencilDcit = {
          
          }):
        '''
        :param movieName:待查询的电影名称 
        :param stencilDcit: 用户意图匹配程度字典
        :return: 用户意图分类,知识图谱查询结果
        '''
        classification = [k for k,v in stencilDcit.items() if v == max(stencilDcit.values())][0]
        ## python中执行cyphere语句实现查询操作
        cyphere = 'match (n:movie) where n.title = "' + str(movieName) + '" return n.' + str(classification)
        object = graph.run(cyphere)
        for item in object:
            result = item
        return classification,result
    
    ## 根据问题模板回答问题
    def respondQuery(movieName,classification,item):
        '''
        :param movieName: 电影名称
        :param classification: 用户意图类别
        :param item:知识图谱查询结果 
        :return:none 
        '''
        query = classification + 'Response'
        response = [v for k,v in responseDict.items() if k == query][0]
        print(response.format(movieName,item))
    
    def main():
        queryText = '肖申克的救赎这部电影的导演是谁?'
        movieName = getMovieName(queryText)
        dict = AssignIntension(queryText)
        classification,result = SearchGraph(movieName,dict)
        respondQuery(movieName,classification,result)
    
    if __name__ == '__main__':
        main()
    

             ~~~~~~~~         First of all, explain the above system: the movie knowledge intelligent question answering system built in this project can only answer questions in eight aspects, which correspond to the eight tags corresponding to each movie when the movie node is constructed, respectively (leading role) actor, director(Director), time(Release time), country(Release country), type(Movie genre), num(Number of reviewers), rate(Movie rating), content(One-sentence review).

    ​ The overall idea of ​​question answering system construction:

    • Match the user input with the preset question template, and determine the type of question asked by the user (which one of the above eight types)

    • Understand the user input content and extract the entity content of the sentence (in this case, extract the movie name)

    • Combining the question category and movie name to construct cypherea query statement, call the knowledge map to return the query result

    • Match the returned query result to the corresponding reply statement, output and complete the whole process of film knowledge quiz

      The following is an explanation of the four steps of the intelligent question system, illustrating the steps of implementation and the main code:

      1. User intent matching:

               ~~~~~~~~         The idea of ​​this part is relatively simple. It mainly uses pythonthe fuzzy matching library to match the sentence entered by the user with each sentence in the category list constructed in advance. After obtaining the matching value, calculate the average value, store it in the dictionary, and finally take out the dictionary The category with the highest degree of matching in is the user intent.

      AssignIntension()The function is the corresponding function in advance, receiving user input and returning a matching list

      1. Content entity extraction :

               ~~~~~~~~         Content entity extraction is mainly responsible for extracting movie titles in user questions in this project, which is the key and core of our follow-up processing.

               ~~~~~~~~         The movie name is included in the user's input, so the first thing that comes to mind when extracting the movie name is to use Chinese word segmentation to separate the sentences, and then extract the field of the movie name. However, due to the diversity and complexity of movie names, word segmentation may would separate the movie titles, and also make it inconvenient to identify which field belongs to the movie title.

               ~~~~~~~~         Therefore, simple word segmentation cannot accomplish the above tasks, and we need to rely on the custom dictionary function of stuttering word segmentation. Stuttering word segmentation supports the import of custom dictionaries . During word segmentation, the words you customize will be recognized as one word and reserved, without the above-mentioned situation where the movie name is separated. Custom dictionaries and built functions will also be placed in the folder at the end, and you can refer to them yourself.

      The code jieba.load_userdict('./selfDefiningTxt.txt')completes the import of custom dictionaries.

               ~~~~~~~~         OK,The movie name is successfully preserved, but how can we confirm which field is the movie name? Stuttering word segmentation provides part-of-speech tagging, and custom dictionaries also support it. We only need to add a special field after the movie name as the part-of-speech of the movie name word (used in this example, the abbreviation of your own name), and extract the part-of -speech words lqywhen separating lqyGet the movie name.

      if w.flag == 'lqy':
          movieName = w.word
      

               ~~~~~~~~         Each word after word segmentation has wordand flagtwo attributes, which store word content and part of speech respectively

      References: (6 messages) jieba stammer word segmentation added to custom dictionary_Am the gentlest blog-CSDN blog_jieba custom dictionary

      3. cyphereStatement query :

               ~~~~~~~~         According to my own understanding, similar to neo4jand mysqlhas its own official query language, which cyphereis the official query language of . For a detailed explanation, students who need it can move to other blog posts to learn grammar. Here are only the simplest query statements in the application:neo4jcypherecypherecyphere

      # 查询肖申克的救赎的上映时间
      match (n:movie) where n.title = '肖申克的救赎' return n.time
      

               ~~~~~~~~         Therefore, with the help of the movie name and user intent category obtained above, the cyphere statement can be constructed and input into the knowledge map for query, and the target result will be returned.

      cyphere = 'match (n:movie) where n.title = "' + str(movieName) + '" return n.' + str(classification)

               ~~~~~~~~         The above code completes cypherethe task of constructing the statement, and then py2neothe content returned by the target can be obtained by running the query statement.

      1. The reply statement matches :

               ~~~~~~~~         After the target result is queried in the knowledge graph, the query result and movie name can be substituted into the reply template,

               ~~~~~~~~        There are a total of eight reply templates, which need to be substituted into the reply template that matches the user's intention, and the output can complete the question and answer function of the intelligent question answering system.

4. Summary and reflection:

         ~~~~~~~~         This project uses the python language to build the simplest intelligent question answering system for knowledge graphs. Although the sparrow is small, it has all the internal organs. Through this project, we can basically understand the basic process of building a knowledge graph question answering system, but the problems in the project are relatively There are many, and the room for improvement is relatively large:

  • The processing of data sets is simple and rough, such as The Matrix, which does not meet the requirements, and the data that is difficult to process is directly eliminated, which is absolutely undesirable in a perfect project construction

  • Due to cypherethe unfamiliarity of the syntax of the sentence, the function of the knowledge map has not been efficiently utilized. Careful readers will find that the retrieval of the content in the project only uses the 8 attributes in the attribute Nodedictionary below, and has not been applied relationship. relationshipPersonally, I think it is It is the core competitiveness of the knowledge graph, but this requires a more advanced cypheresyntax, which is also the focus of future improvement and improvement

Five, the complete code:

         ~~~~~~~~        Considering that some students githubare not fluent, a link to Baidu Netdisk is provided, and githubthe link will be put up later:

Link: https://pan.baidu.com/s/1E9-BQUAlfi05dyDgNxK9bQ
Extract code:dbo9

         ~~~~~~~~         GitHub link:

https://github.com/booue/Movie-Knowledge-QS-system-using-KnowledgeGraph


         ~~~~~~~~         Finally finished! ! ! It is the first time to contact the knowledge map. If there is something inappropriate, criticism and correction are welcome.

Guess you like

Origin blog.csdn.net/DALEONE/article/details/125116858