I. Introduction
Neo4j is a new type of NoSQL database based on graph theory. This kind of database has huge advantages in dealing with social networks, logistics and transportation, recommendation systems, fraud detection, etc., relationship analysis and other fields. In this Chat, I will introduce you:
- Comparison of the advantages of Neo4j with relational and other non-relational databases
- Which areas are suitable for Neo4j and which areas are not suitable
- Neo4j installation
- Introducing the Cypher query language
- Case actual combat:
- Bank fraud ring analysis
- Literature index
- Find the source of spam
- Corporate Relationship Construction
- Social relationship analysis to realize a simple friend recommendation function
2. Text
I think everyone must be familiar with relational (SQL) databases like Mysql. No stranger to MongoDb, a non-relational (NoSQL) database. Today we are going to introduce a new type of database-graph database. This kind of database is based on graph theory.
When it comes to graph theory, some friends may take a breath. If you have ever involved some knowledge of data structures, you will find that graphs are almost the most difficult to learn, involving many obscure algorithms. Don't worry here, Neo4j introduced today has already encapsulated many algorithms. You do not need to involve the bottom layer, it is very convenient. Let's start to compare the advantages and disadvantages of graph databases and other relational and non-relational databases.
Unlike traditional databases, graph data stores nodes (objects) and edges (relationships between objects). When the data has intricate relationships, using this type of database is the best choice.
1. Comparison of the advantages of Neo4j with relational and other non-relational databases
(1) Comparison of Neo4j and relational database
The relational database represented by Mysql has been born for a long time and has always been the driving force in the database field. They store highly structured data in a two-dimensional table, and must operate on the data strictly in accordance with relevant conventions (such as foreign keys constraint). You can understand it as ledgers.
However, it is precisely because relational databases need to formulate related agreements before creating tables, there are often mutual constraints and mutual references between tables. With the continuous increase of the database, the relationship of mutual restriction will continue to increase, and the number of operations to perform search matching will increase exponentially, which will consume a lot of resources. For example, when you want to query a question such as "Xiao Ming's friend", the relational database will involve some expensive indirect layers, such as querying with an index table:
From the table, you can find that Xiao Ming (ID: 2)'s friend is Xiao Hua (ID: 1). You might say that this is not complicated, it's just a table.
But if I ask "Xiao Ming's friend's friend's friend..." the depth continues to increase, and an index table is added for each additional layer, so that the indirect layer continues to increase. The query becomes slower and slower, and the memory overhead required becomes larger.
Another point, if I ask in the opposite direction, "Whose friend is Xiao Ming", you might say, of course it is Xiao Hua. But if you look at the index table carefully, Xiaohua's FriendID is 3 not 1 (Xiao Ming). In other words, relational databases can't handle this kind of reverse questions.
Does this kind of reverse questioning make sense? Of course it is meaningful and useful. For example, Xiao Ming likes programming, so ask in the opposite direction, who still likes programming. Find a god who likes programming (judging by attributes such as closeness centrality) and recommend it to Xiao Ming for attention. Such a simple recommendation function came out.
In contrast, graph database has unique advantages, it stores nodes, node attributes, and node relationships. These relationships are organized by type and direction. The access of the relationship is done directly through the node. Ask "Xiao Ming's friend's friend's friend...", even if the depth increases, it just increases the node. In the query of complex connections, it can reach the millisecond level.
How to understand the concepts of nodes, relationships, and attributes?
Here is an example, create a node, the label is person, his name is Xiaoming. Xiao Ming is a node. Xiao Ming likes watching movies (attribute of node), he and Xiao Hua are friends (relationship). Such a simple relationship is established.
(2) Comparison between Neo4j and other non-relational databases
At present, most NoSQL databases are based on collections and documents. These data are stored in incoherent collections, which makes it more difficult to connect and establish relationships between data. That is to say, the data is discrete. If a relationship is to be established, one set is usually embedded in another set to realize the relationship. This overhead is also great. However, this kind of data does not have much "relationship", and the use of this kind of database is particularly efficient and has good read and write functions.
This article compares 8 different NoSQL databases for reference
2. Which venues are suitable for Neo4j and which are not suitable?
Here, it needs to be explained that the graph database is not only Neo4j, but Neo4j is a very good graph database. It was developed in 2003 and released in 2007. Used by many companies, eBay, Adidas, Wal-Mart, etc.
Neo4j is based on graph theory and naturally has natural advantages in processing maps. Therefore, it is suitable for logistics management and traffic big data.
Since the basic elements of Neo4j are nodes and relationships, it is also particularly suitable for dealing with social networks with complex relationships. In addition, it is also very advantageous in implementing a recommendation system, and it is also very helpful for analyzing transaction customer data. It can also be used to detect fraud. An example will be used below. It is even used in games. For example, "Three Kingdoms 13" by Guangrong Company.
Summarize the suitable areas slightly:
- Social network
- Transportation Big Data (Logistics)
- Recommended system
- Fraud analysis
- Web security (spam, etc.)
However, there are also areas that are not suitable for graph databases.
- Record large amounts of event-based data
- Need for large-scale distributed data processing
- Binary data storage
- Structured data suitable for storage in relational data
3. Neo4j installation
Windows has an exe installation file, which is more convenient, just follow the visual tutorial step by step. Mac installation should not be difficult. I won’t introduce it here. Here is based on Linux Ubuntu16.04, introducing the installation tutorial.
(1) Install JAVA environment
Neo4j is implemented in Java, so you need to install Java Runtime Environment (JRE). If you are already up and running, please continue and skip this step. Open the instruction box:
sudo apt update sudp apt upgrade sudo apt install default-jre default-jre-headless
If the instruction doesn't work, try these two sentences first before using it.
sudo update-alternatives --set java / usr / lib / jvm / java-8-openjdk-amd64 / bin / java sudo update-alternatives --set javac / usr / lib / jvm / java-8 -openjdk-amd64 / bin / javac
Take a look at the java version:
(2) Install Neo4j
First, we add the repository key to our keychain.
wget --no-check-certificate -O -https://debian.neo4j.org/neotechnology.gpg.key| sudo apt-key add -
Then add the repository to the apt source list.
echo 'deb http://debian.neo4j.org/repo stable/' | sudo tee /etc/apt/sources.list.d/neo4j.list
Update:
sudo apt update sudo apt install neo4j
The server should have started automatically, and it should be restarted at startup. If necessary, the server can be stopped:
sudo service neo4j stop
And restart:
sudo service neo4j start
Visit Neo4j
You should now be able to access the database via http://localhost:7474/browser/.
Let's introduce the panel after opening. At the beginning, there is a login interface that allows you to enter your account and password. The default is neo4j when you open it for the first time. After you enter it, it will automatically pop up to let you change your password. It looks like this after logging in.
Here is some sample code:
You can try the tutorial of Example Graphs first and enter the query statement to get this interface:
The table is like this, you can see the attributes of the node: name, born, etc.
Text can see the data of the table:
Code can see the input code and data in json format:
4. Introducing the Cypher query language
Just as Mysql has SQL language, Neo4j also has a corresponding query language Cypher. Cypher draws on the structure of the SQL language and has many familiar keywords. For database operations, it is nothing more than adding, deleting, modifying, and checking, which are introduced one by one below:
(1) Increase
The so-called increase is to create data. The basic elements in a graph database are nodes, relationships, and attributes. Neo4j has two keywords to achieve an increase. One is CREATE (lowercase is also possible):
<1>CREATE to create a node
create (n:User {name:"Dav"})
Here n is the variable name, User is the label (in the graph database, the label can be understood as the table in the relational database), and the attribute in the curly braces.
<2>CREATE creates a relationship:
MATCH (n{name:"a"}),(m{name:"b"}) CREATE (n)-[r:KNOWS]->(m) return n,m
Here are a few explanations:
- The parentheses are nodes, and the nodes can also be unlabeled (but it’s not a good practice). The square brackets are relations. On both sides of the relations are two nodes, similar to this ()-[]→()
- MATCH is the query keyword
- return n,m return two nodes to get the graph
The other is to use MERGE to create a node. The difference between MERGE and CREATE is that MERGE is equal to MATCH + CREATE. It will check in the database before creating this node.
<3>MERGR create node
MERGE (n:Test{name:"c"}) ON CREATE SET n.created = timestamp() return n
First check the Test label, whether the node c whose attribute is the value of name exists, if it exists, use the existing node, otherwise create a new node.
Here, MERGE is used to create a node, and a SET keyword is used. This is to change the properties of the node and belongs to the scope of "change"
<4> MERGR create relationship
Create a relationship by merging, first check whether the relationship exists, if it does not modify any data, otherwise create a new relationship
MATCH (a:Person{name:'Joel Silver'}),(b:Person{name:'J.T. Walsh'}) MERGE (a)-[r:LOVES]->(b)
Match the person named Joel Silver and the person named JT Walsh to establish a relationship LOVES. When this statement appears, it means that the establishment has been successful.
Created 1 relationship, completed after 199 ms.
(2) Delete
<1>The DELETE keyword can delete data:
MATCH (n)DELETE n
This will report an error because the relationship must be deleted to delete the node:
MATCH ()-[r:朋友]->(m) DELETE r,m
To do this, find the friend relationship r and the node m pointed to by the friend, and delete r and m at the same time. This is successful:
Deleted 1 node, deleted 1 relationship, completed after 8 ms.
<2>REMOVE can also delete data:
MATCH (n) REMOVE n:Test
Use REMOVE to remove data and remove all nodes with the Test label:
Removed 3 labels, completed after 9 ms.
Be careful when using REMOVE to remove a node, REMOVE will not be like DELETE, because this node has other relationships and an error is not deleted.
After this node is deleted, the relationship will point to an empty node.
(3) Change
Use SET to change the node. There are already examples just above, so I won’t repeat them.
(4) Check
Use MATCH to query nodes. Cypher has other ways to narrow down the search.
<1>MATCH p=()-[r:LOVES]->() RETURN p LIMIT 25
Use the LIMIT keyword to get only the specified number of nodes.
<2>WHERE implements conditional filtering
MATCH p=(n:Person)-[:LOVES]->() WHERE n.name <> "a" RETURN p
Query all nodes whose attribute name of node n is not equal to a <> means not equal.
<3>Use INDEX index
Use the keyword INDEX ON to create a common index for the node's attribute name:
CREATE INDEX ON :Person(name)
Output:
Added 1 index, completed after 46 ms.
Use the index to query:
MATCH (n:Person) WHERE n.name IN ["a","b"] RETURN n as Person
Get the picture:
Explicitly use index query:
MATCH(n:Person) USING INDEX n:Person(name) WHERE n.name = 'a' RETURN n as Person
Use the DROP keyword to delete an existing index:
DROP INDEX ON :Person(name)
Output:
Removed 1 index, completed after 1 ms.
Cypher can also use some functions to assist queries such as size() any() and so on. For details, please refer to the relevant API documentation. Only two important ones are mentioned here, querying the shortest path and all the shortest paths.
<1>The shortest path shortestpath()
Take this picture as an example, find the shortest path from Joel Silver to Jonathan Lipnicki from some of the pictures:
MATCH (p1:Person {name:"Jonathan Lipnicki"}),(p2:Person{name:"Joel Silver"}), p=shortestpath((p1)-[*..10]-(p2)) RETURN p
Here [*..10] means to find the shortest path relationship among all existing relationships within a path depth of 10
Get the picture:
<2>All shortest paths
MATCH (p1:Person {name:"Jonathan Lipnicki"}),(p2:Person{name:"Joel Silver"}), p=allshortestpaths((p1)-[*..10]-(p2)) RETURN p
The above theoretical part has been finished, let’s move on to the actual combat part.
5. Case actual combat
(1) Analysis of Bank Fraud Ring
A concept is introduced here. First-party bank fraud is essentially the use of others’ real identities to fabricate and forge identities to commit fraud.
It has the following characteristics:
- <1>Two or more people form a fraud ring
- <2>People in the fraud ring share part of the information of legitimate contacts, such as phone numbers
We can use Neo4j to identify the existence of fraud rings.
First, we create a fraud ring (due to space limitations, we will not post the code here and you can refer to this tutorial ).
Query suspicious fraud rings:
From left to right, the members of the fraud ring, the contact information of the suspected fraud, and the size of the fraud ring. Calculate the risk of fraud:
( 2) Literature index
Let me give you a small example. In academic circles, you need to check some papers, usually full-text search, which is not efficient. You can use Neo4j to obtain highly matched papers. Here is a small example. Open Neo4j and insert data manually first.
create (Paper 1: Thesis Atlas {Paper Name: "Paper 1"}), (Paper 2: Thesis Atlas {Paper Name: "Paper 2"}), (Paper 3: Thesis Atlas {Paper Name: "Paper 3"} ),(Paper 4: Paper Atlas {Paper Name: "Paper 4"}), (Paper 5: Paper Atlas {Paper Name: "Paper 5"}), (Paper 6: Paper Atlas {Paper Name: "Paper 6" }),(Paper 7: Thesis Atlas {Paper Name: "Paper 7"}),(Paper 1)-[: Similar]->(Paper 2),(Paper 1)-[: Similar]->(Paper 3 ),(Paper 2)-[: Similar]->(Paper 4),(Paper 2)-[: Similar]->(Paper 5),(Paper 3)-[: Similar]->(Paper 5), (Paper 5)-[: Similar]->(Paper 6), (Paper 7)-[: Similar]->(Paper 2), (Paper 7)-[: Similar]->(Paper 6) return *
Look for similar transfer paths between Paper 1 and Paper 6, so that you can find out which papers are the main references for the paper.
MATCH n=allshortestPaths((论文1:论文图谱{论文名:"论文1"})-[*..6]->(论文6:论文图谱{论文名:"论文6"})) RETURN n
Next, I want to implement a document index through a complete example, from data acquisition, import, and analysis. I will use Scrapy to crawl information about 1,000 books. Save it to csv, import neo4j, and analyze it further.
The first is to use Scrapy to crawl information. After the data climbed down, it was like this.
From left to right is the book's upc code, name, type, storage, price, rating, number of ratings, and the profile target website is this . First use the scrapy shell to operate a crawler, first perform a simple crawling experiment to analyze the web page.
scrapy shell http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
Design ideas:
- <1> Set items according to the web page information just analyzed
- <2>Design a crawler spieder based on the webpage just analyzed
- Crawlers need to crawl to a single page and need information
- After crawling a webpage, the crawler needs to crawl the next target webpage
- <3>Set relevant information in setting
- <4>Process special data in pipelines
I have uploaded the code to git .
Import data into Neo4j via csv file. First put book2.csv in this directory:
/var/lib/neo4j/import
First read the file to see if you can get:
LOAD CSV WITH HEADERS FROM "file:///books2.csv" AS line WITH line RETURN line
LOAD CSV WITH HEADERS FROM "file:///books2.csv" AS line CREATE (:Books { Id: line.upc, Name: line.name, Price: line.price,Rate:line.review_rating,content:line.jianjie,kinds:line.Kinds,stock:line.stock})
In this way, 1,000 books are stored as nodes. Output:
Added 1000 labels, created 1000 nodes, set 5998 properties, completed after 228 ms.
Check 25 to see the situation:
And you can view various attributes. But it doesn't matter yet. Create several book nodes:
create (n:书类名{n.Name=”Sequential Art”})
..
..
Re-establish relationship:
MATCH (n:Books2),(m:书类名) where n.kinds = m.Name create(n)-[r:属于]->(m) RETURN n,r,m LIMIT 25
A simple book-category diagram has been built, and now we can index by book rating, category, and price. So as to complete a simple bibliographic recommendation system. In the fifth case I will do it together.
(3) Find the source of the spam mailbox
If you don't want to download Neo4j locally, you can go to the official website of Weiyun Shuju Company and try Neo4j online . Here we are based on this platform to make a case of spam.
Enter in the command box:
MATCH m=(s:Person)-->(e:Email)-->(r:Person) WHERE e.title=~'.*普通发票.*' RETURN m LIMIT 15
Only 15 nodes are returned here. What if we want to find the source of the spam mailbox? Usually, the title or content of the spam mailbox will have words about promotion, recruitment, etc.
Here we search for the keyword "invoice" by traversing the headers of all emails. If people send this kind of email frequently, there must be a lot of emails. Here we set that when the number of such letters exceeds 105, it will be output.
MATCH m=(s:Person)-->(e:Email)-->(r:Person) WHERE e.title=~'.发票.' WITH s,COUNT(e) AS num,COLLECT(e) AS emails,COLLECT(r) AS recevies WHERE num > 105 RETURN s,emails,recevies
Get the picture:
Obviously, I found that almost all invoices came from this mailbox [email protected], and the main criminal was found.
Then you can find the main culprit. Is this the end? No, no, I said above that "reverse questioning" is very valuable. Now that we have found the main culprit, we might as well take a look.
What kind of emails does he often send, and what are the characteristics of such emails? This kind of people are excavated from it, and the "routine" of sending spam mailboxes.
MATCH m=(s:Person)-->(e:Email)-->(r:Person)
WHERE s.account=~'[email protected]'
RETURN s,e,r
Get the picture:
Observe carefully and you will find that in addition to the words "invoice", the main criminal's email also contains words such as "call-in-person chat" and "fee discount". Then we can remember such words, and we can filter emails with such words next time.
Through the above example, you should be able to appreciate the advantages of mapping databases in processing spam and finding information. Especially when dealing with "reverse questions". And the query efficiency is at the millisecond level.
4) Corporate relationship construction
This is still based on the platform of Weiyun Data Ju.
MATCH (n:`公司`) RETURN n LIMIT 25
Investment chart:
MATCH a=(:公司 {名称:'中航工业集团公司'})-[r*]->() RETURN nodes(a)
In this way, there is an intuitive grasp of the company. Who invested in who, the flow of cash flow. There are also intuitive displays for the company's financial management.
Guarantee map:
MATCH a=(:公司 {名称:'中航工业集团公司'})-[r:担保*]->() RETURN nodes(a)
and many more
(5) Social relationship analysis to realize a simple friend recommendation function
Here we will use the book library of the second case, first create a circle of friends.
create (Xiaobei: circle of friends {name: "小北", favorite books: "Poetry"}), (Xiaofei: circle of friends {name: "小菲", favorite books: "Science Fiction"} ), (Xiao Peng: Moments of Friends{Name: "小鹏", Favorite Books: "Music"}), (Xiaoying: Moments of Friends {Name: "小颖", Favorite Books: "Politics"} ), (Xiao Lan: Moments of Friends{Name: "小兰", Favorite Books: "Music"}), (Xiaofeng: Moments of Friends{Name: "小峰", Favorite Books: "Travel"}), (Small news: Moments of friends {name: "小 News", favorite books: "Poetry"}), (Xiaodong: Moments of friends {name: "小东", favorite books: "Sequential Art"}) , (Xiao Wei: Moments of Friends {Name: "小唯", Favorite Books: "Young Adult"}), (小窦: Moments of Friends {Name: "小窦", Favorite Books: "Poetry"} ), (Xiao Qi: Moments of Friends {Name: "小齐", Favorite Books: "Default"}), (小林: Moments of Friends {Name: "小林", Favorite Books: "Poetry"}), (Xiao Rui: Moments of Friends{Name: "小锐", Favorite Books: "Default"}), (Xiaowei: Moments of Friends {Name: "小伟", Favorite Books: "Young Adult"}) , (Xiaoling: Moments of Friends{Name: "Xiaoling", Favorite Books: "Business"}), (小讯)-[:Know]->(小窦), (Small News)-[: Know]->(小齐), (Small News)-[: Know]->(小林), (小 News)-[: Know]- >(小鹏), (Small News)-[: Know]->(小伟), (Small News)-[: Know]->(小峰), (小菲)-[: Know]->(小Peng), (小菲)-[: know]->(小峰), (小菲)-[: know]->(小唯), (小峰)-[: know]->(小北), ( Xiaofeng)-[:acquaintance]->(小兰), (小东)-[:acquaintance]->(小林), (小东)-[:acquaintance]->(小锐), (小东)- [: know]->(小菲), (小鹏)-[: know]->(小颖), (小北)-[: know]->(小兰), (小颖)-[: Know]->(小东), (小唯)-[:cognize]->(小鹏), (小唯)-[:cognize]->(小锐), (小伟)-[:cognize] →(Xiaoling)
Show Xiaofeng’s circle of friends:
MATCH n=(:朋友圈{姓名:"小峰"})-[*..6]-() return n
There are several concepts to be introduced here.
<1> Once relationship (direct relationship)
MATCH n=(:朋友圈{姓名:"小讯"})-[:认识]-() return n
<2> Second-degree relationship
MATCH n=(:朋友圈{姓名:"小讯"})-[*..2]-() return n
I once saw a question, if you live in a village, how many people will you have to pass through to meet Obama? The answer is six. Assuming you are in a village, then the village head, town head, county head, mayor, governor, president of the country, Obama, and six people are enough. So you will find that we usually search at 6 degrees depth.
<3>The shortest path of understanding between two strangers
We can use Neo4j to find two people we don’t know and the shortest path to establish contact.
MATCH n=shortestPath((小讯:朋友圈{姓名:"小讯"})-[*..6]-(小锐:朋友圈{姓名:"小锐"})) return n
<4> All the shortest paths of understanding between two strangers
MATCH n = allshortestPaths((小讯:朋友圈{姓名:"小讯"})-[*..6]-(小菲:朋友圈{姓名:"小菲"})) return n
<5> Make recommendation system based on node's influence or other attributes
I don’t know, have you ever thought about how these software such as B station, Taobao, QQ, etc. are used as recommendation systems. For example: at station B, every Up host must select the submission area, type, and keyword tags when uploading a video. This completes the classification of the data. If you often watch the Guanyan District, you must come to the Guanyan District very frequently, so you can recommend these videos to you, and then recommend them to you based on the influence of the Up master (the size of the node). Of course, they must also use many machine learning algorithms. In my opinion, Neo4j can also be used as a simple recommendation function.
For example, make a small book recommendation. When creating the node, I created the type of book they like by the way (you will find that when you use some apps, one thing is to let you determine your preferences).
Combine the data of the second case:
MATCH (n:朋友圈),(m:Books2) where n.喜欢的书类 = m.kinds and toInt(m.Rate)>4 create (m)-[r:推荐]->(n) return m,r,n
One thing we have done here is to select books with a score greater than 4 for them based on the type of book that everyone has filled in.
And from the figure, it is easy to find that the more similar people who recommend books, the more they have the same hobbies. For example, Xiao Qi and Xiao Rui, from the picture, it can be seen that the two people did not know each other directly. Based on this, we can introduce Xiao Rui and Xiao Qi to understand and communicate.
As a result, we have completed two levels of recommendation. One is recommendation of books and the other is recommendation of friends. However, the actual application will definitely not be so hasty and simple, and the consideration will definitely be more detailed. Here is just a simple introduction.
3. Recommended information:
In the process of learning Neo4j, I also found a lot of information. Here is also for your reference.
- "Neo4j Authoritative Guide"-Tsinghua University Press
- "Neo4j Full Stack Development"-Publishing House of Electronics Industry
- "Graph Database"-People's Posts and Telecommunications Press
- "Proficient in Scrapy Web Crawler"-Tsinghua University Press
- Neo4j official documentation
Fourth, the last words
Neo4j is indeed a very good database. Although it has not yet become popular, I believe that its application scenarios will increase in the future. After all, drawing pictures is the most intuitive way for us humans to understand the world. Thank you very much for joining this Chat. This is the second Chat I have done, although I already have the first experience. But to be honest, I was still very nervous. After all, I am still a student, and I will definitely have a lot of negligence, mistakes and omissions in what I do. Please forgive me. If you have any questions, you can leave a message in the comment area. This is also a good experience and growth for me. Finally, thank you for your support. Grateful.