python在线爬取数据导入Neo4j创建知识图谱

近期学习Neo4j,以豆瓣top250数据为研究对象,实现python在线爬取数据写入Neo4j创建知识图谱,下文详细介绍步骤。

1、知识图谱设计

通过分析网页,爬取网页可以得到movie、country、type、time、director、actor、score等信息,此处我将movie、country、type、time、director、actor作为节点,而score作为movie的属性,网上有很多地方讲到只将movie、director、actor作为节点,其余均作为movie的属性,这个我之前也做过,但最后的效果并不是我想要的,至于什么效果,后文会提到。节点和关系设计如下图。

2、爬取数据并写入Neo4j

此处就直接上代码了:

from bs4 import BeautifulSoup
from urllib.request import urlopen,urlparse,urlsplit,Request
import urllib.request
import re
import codecs
import random
import py2neo
from py2neo import Graph
#
ua_list = [
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36",#Chrome
    "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0",#firwfox
    "Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko",#IE
    "Opera/9.99 (Windows NT 5.1; U; zh-CN) Presto/9.9.9",#Opera
]

if __name__ == "__main__":
    # connect to graph
    graph = Graph (
        "http://localhost:11010/",
        username="admin",
        password="password"
    )
    for i in range(0,9):
        ua = random.choice( ua_list )
        url = 'https://movie.douban.com/top250?start='+str(i*25)+'&filter='
        req = urllib.request.Request( url, headers={'User-agent' : ua} )
        html=urlopen(req).read()
        soup = BeautifulSoup ( html, 'lxml' )
        page=soup.find_all('div', {'class' : 'item'})
        punc = ':· - ...:-'
        list_item=[]
        for item in page:
            content = {}
            try :
                text0=item.find ( 'p', {'class' : ""} ).text.strip ( ).split ( '\n' )[0]
                text1=item.find ( 'p', {'class' : ""} ).text.strip ( ).split ( '\n' ) [1]
                #get film
                film=item.find( 'span', {'class' : 'title'} ).text.strip ( )
                film=re.sub ( r"[%s]+" % punc, "", film.strip ( ) )
                # get score
                score=item.find ( 'span', {'class' : 'rating_num'} ).text.strip ( )
                graph.run (
                    "CREATE (movie:Movie {name:'" + film + "', score:'" + score +"'})" )
                #get director
                directors=text0.strip().split('   ')[0].strip().split(':')[1]
                directors = re.sub ( r"[%s]+" % punc, "", directors.strip ( ) )#存在特殊字符需要先去除
                # director=directors.split ( '/' )
                if len ( directors.split ( '/' ))>1:
                    print(film+'has more than one director')
                #创建director节点
                if directors not in list_item:
                    graph.run (
                        "CREATE (director:Person {name:'" + directors + "'})" )
                    list_item.append ( directors )
                #创建director-movie关系
                graph.run (
                    "match (p:Person{name:'" + directors + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:directed]->(b)" )
                 #get actor
                actors = text0.strip ( ).split ( '   ' ) [1].strip ( ).split ( ':' ) [1]
                actors = re.sub ( r"[%s]+" % punc, "", actors.strip ( ) )#存在特殊字符需要先去除
                if len ( actors.split ( '/' ) ) == 1 :
                    actor = actors
                    if actor not in list_item:
                        graph.run (
                            "CREATE (actor:Person {name:'" + actor + "'})" )
                        list_item.append ( actor )
                    graph.run (
                            "match (p:Person{name:'" + actor + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:acted_in]->(b)" )
                else :
                    actor = actors.split ( '/' )
                    if '...' in actor:
                        actor.remove ( '...' )
                    for i in range(len(actor)-1):
                        if actor[i] not in list_item :
                            graph.run (
                                "CREATE (actor:Person {name:'" + actor [i] + "'})" )
                            list_item.append ( actor [i] )
                        graph.run (
                                "match (p:Person{name:'" + actor[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:acted_in]->(b)" )
                #get time
                time=text1.strip ( ).split ( '/' ) [0].strip()
                if time not in list_item:
                    graph.run (
                        "CREATE (time:Time {year:'" + time + "'})" )
                    list_item.append ( time )
                graph.run (
                        "match (p:Time{year:'" + time + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:created_in]->(p)" )
                #get country
                #maybe more than one
                country=text1.strip ( ).split ( '/' ) [1].strip().split(' ')[0]
                if country not in list_item:
                    graph.run (
                        "CREATE (country:Country {name:'" + country + "'})" )
                    list_item.append ( country )
                graph.run (
                        "match (p:Country {name:'" + country + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:produced_by]->(p)" )
                #get type
                types=text1.strip ( ).split ( '/' ) [2].strip().split(' ')
                if len(types)==1:
                    type = types
                    if type not in list_item:
                        graph.run (
                            "CREATE (type:Type {name:'" + type + "'})" )
                        list_item.append ( type )
                    graph.run (
                            "match (p:Type{name:'" + type + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)" )
                else:
                    for i in range(0,len(types)):
                        if types[i] not in list_item:
                            graph.run (
                                "CREATE (type:Type {name:'" + types[i] + "'})" )
                            list_item.append ( types[i] )
                        type_relation="match (p:Type{name:'" + types[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)"
                        graph.run (
                            "match (p:Type{name:'" + types[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)" )
            except:
                continue

代码比较粗糙,后续再完善。

3、知识图谱show

整体效果如上图,即可以通过country、type、time信息显性化的检索相关信息,如果只将movie、director、actor作为node,则需要点击具体节点才能看到其属性country、type、time等信息。

如此,一个简易的豆瓣top250知识图谱就构建好了,但是,此处仍存在一个问题-数据重复,做完后发现不仅仅是节点有重复,关系竟然也有重复的,这个问题还在探究中。

发布了123 篇原创文章 · 获赞 12 · 访问量 5万+

猜你喜欢

转载自blog.csdn.net/haiziccc/article/details/103268674