Project design collection (artificial intelligence direction): Help newcomers quickly master skills in actual combat, complete project design upgrades independently, and improve their own hard power (not limited to NLP, knowledge graph, computer vision, etc.): Collect meaningful project design collections to help Newcomers quickly master skills in actual combat, helping users make better use of the CSDN platform, independently complete project design upgrades, and improve their own hard power.
From zero knowledge map life, build an encyclopedia knowledge map, complete knowledge extraction based on Deepdive, simple semantic search based on ES, simple KBQA based on REfO
The study notes in the process of personal introduction to knowledge graphs are semi-tutorials, which guide beginners to have a preliminary understanding of the various tasks of knowledge graphs. There are currently no plans to add more.
1 Introduction
The goal is to include the knowledge of Baidu Encyclopedia, Interactive Encyclopedia, and Chinese Wiki Encyclopedia, the number of entities in the tens of millions and the number of relationships in the billions. At present, Baidu Encyclopedia and Interactive Encyclopedia have been completed, including 4,190,390 entries in Baidu Encyclopedia and 4,382,575 entries in Interactive Encyclopedia. Convert to RDF format to get 128,596,018 triples. Stored in neo4j, there are 16,498,370 nodes, 56,371,456 relationships, and 61,967,517 attributes.
For the source of the project, see the top or end of the article
https://download.csdn.net/download/sinat_39620217/87988980
- Table of contents
-
Knowledge Extraction of Baidu Encyclopedia and Interactive Encyclopedia
- semi-structured data
- Baidu Encyclopedia Crawler
- Interactive encyclopedia crawler
- unstructured data
- WeChat official account crawler
- Tiger Sniff Web Crawler
- semi-structured data
-
Knowledge Extraction from Unstructured Text
-
knowledge storage
-
knowledge fusion
-
KBQA
-
semantic search
-
2. Get data
2.1 Semi-structured data
The semi-structured data is obtained from Baidu Encyclopedia and Interactive Encyclopedia, using the scrapy framework, currently in two categories: film field and general field.
- Encyclopedia data in the general field: 4,190,390 entries in Baidu Encyclopedia, 3,677,150 entries in Interactive Encyclopedia. For crawling details, please refer to Building a Knowledge Graph from Scratch (7) Encyclopedia Knowledge Graph Construction (1) Knowledge Extraction from Baidu Encyclopedia
- Film field: Baidu Encyclopedia contains 22,219 films with 13,967 actors, and Interactive Encyclopedia contains 13,866 films with 5,931 actors. For a detailed introduction to the project, please refer to Building a Knowledge Graph from Scratch (1) Acquisition of Semi-structured Data
2.2 Unstructured Data
The main sources of unstructured data are WeChat official account, Huxiu.com news and unstructured text in Baike.
The WeChat official account crawler obtains the title, release time, official account name, article content, and article reference source of the article published by the official account, corresponding to ie/craw/weixin_spider. Huxiu.com crawler obtains the title, brief description, author, release time, and news content of Huxiu.com news, corresponding to ie/craw/news_spider.
3. Knowledge extraction from unstructured text
3.1 Knowledge extraction based on Deepdive
Deepdive is an open source knowledge extraction system developed by Stanford University InfoLab. It extracts structured relational data from unstructured text through weakly supervised learning
. This actual combat is based on [deepdive that supports Chinese: Stanford University's open source knowledge extraction tool (triple extraction)] (http://www.openkg.cn/dataset/cn-deepdive) on OpenKG. Based on this, we extract Actor-Film Relationships in the Film Domain.
For a detailed introduction, please refer to Building a Knowledge Graph from Scratch (5) Deepdive Extracting Actor-Movie Relationships
3.2 Neural Network Relation Extraction
Use your own encyclopedia class graph to build a remote supervision dataset and run it on OpenNRE. The final generated dataset contains 18,226 relational facts, 336,693 no-relationship (NA) entity pairs, and 354,919 total entity pairs, using 462 relations (including NA).
For a detailed introduction, please refer to Building a Knowledge Graph from Scratch (9) Encyclopedia Knowledge Graph Construction (3) Dataset Construction and Practice of Neural Network Relation Extraction
4. Structured data to RDF
There are two main ways to transfer structured data to RDF, one is through direct mapping , and the other is through R2RML language. The way based on R2RML language is more flexible and highly customizable. There are some useful tools for R2RML, here we use the d2rq tool, which is based on R2RML-KIT.
For a detailed introduction, please refer to Building a Knowledge Graph from Scratch (2) Database Access to RDF and Jena
5. Knowledge storage
5.1 Storing data into Neo4j
Graph database is a new type of NoSQL database based on graph theory. Its data storage structure and data query methods are based on graph theory. The basic elements of a graph in graph theory are nodes and edges, which correspond to nodes and relationships in graph databases. We store the data obtained above into Neo4j.
For encyclopedia graphs, please refer to: Building Knowledge Graphs from Scratch (8) Encyclopedia Knowledge Graph Construction (2) Storing Data in Neo4j
For the film field, please see Building a Knowledge Graph from Scratch (6) Storing Data in Neo4j
6.KBQA
6.1 Simple KBQA based on REfO
Based on the REfO-based KBQA implementation and examples provided by Zhejiang University on openKG , a simple knowledge question answering system is implemented on its own knowledge graph.
For a detailed introduction, please see Building a Knowledge Graph from Scratch (3) Simple Knowledge Questions and Answers Based on REfO
- example
semantic search
Simple semantic search based on elasticsearch
This project is a simplified version of Zhejiang University's KBQA implementation and examples based on elasticsearch , and implemented it on its own database.
For a detailed introduction, please see Building a Knowledge Graph from Scratch (4) Simple Semantic Search Based on ES
- example