Application of Neo4j graph database in social network and other fields

I. Introduction

Neo4j is a new type of NoSQL database based on graph theory. This kind of database has huge advantages in dealing with social networks, logistics and transportation, recommendation systems, fraud detection, etc., relationship analysis and other fields. In this Chat, I will introduce you:

  1. Comparison of the advantages of Neo4j with relational and other non-relational databases
  2. Which areas are suitable for Neo4j and which areas are not suitable
  3. Neo4j installation
  4. Introducing the Cypher query language
  5. Case actual combat:
  • Bank fraud ring analysis
  • Literature index
  • Find the source of spam
  • Corporate Relationship Construction
  • Social relationship analysis to realize a simple friend recommendation function

2. Text

I think everyone must be familiar with relational (SQL) databases like Mysql. No stranger to MongoDb, a non-relational (NoSQL) database. Today we are going to introduce a new type of database-graph database. This kind of database is based on graph theory.

When it comes to graph theory, some friends may take a breath. If you have ever involved some knowledge of data structures, you will find that graphs are almost the most difficult to learn, involving many obscure algorithms. Don't worry here, Neo4j introduced today has already encapsulated many algorithms. You do not need to involve the bottom layer, it is very convenient. Let's start to compare the advantages and disadvantages of graph databases and other relational and non-relational databases.

Unlike traditional databases, graph data stores nodes (objects) and edges (relationships between objects). When the data has intricate relationships, using this type of database is the best choice.

enter image description here

1. Comparison of the advantages of Neo4j with relational and other non-relational databases

(1) Comparison of Neo4j and relational database

The relational database represented by Mysql has been born for a long time and has always been the driving force in the database field. They store highly structured data in a two-dimensional table, and must operate on the data strictly in accordance with relevant conventions (such as foreign keys constraint). You can understand it as ledgers.

enter image description here

However, it is precisely because relational databases need to formulate related agreements before creating tables, there are often mutual constraints and mutual references between tables. With the continuous increase of the database, the relationship of mutual restriction will continue to increase, and the number of operations to perform search matching will increase exponentially, which will consume a lot of resources. For example, when you want to query a question such as "Xiao Ming's friend", the relational database will involve some expensive indirect layers, such as querying with an index table:

enter image description here

From the table, you can find that Xiao Ming (ID: 2)'s friend is Xiao Hua (ID: 1). You might say that this is not complicated, it's just a table.

But if I ask "Xiao Ming's friend's friend's friend..." the depth continues to increase, and an index table is added for each additional layer, so that the indirect layer continues to increase. The query becomes slower and slower, and the memory overhead required becomes larger.

Another point, if I ask in the opposite direction, "Whose friend is Xiao Ming", you might say, of course it is Xiao Hua. But if you look at the index table carefully, Xiaohua's FriendID is 3 not 1 (Xiao Ming). In other words, relational databases can't handle this kind of reverse questions.

Does this kind of reverse questioning make sense? Of course it is meaningful and useful. For example, Xiao Ming likes programming, so ask in the opposite direction, who still likes programming. Find a god who likes programming (judging by attributes such as closeness centrality) and recommend it to Xiao Ming for attention. Such a simple recommendation function came out.

In contrast, graph database has unique advantages, it stores nodes, node attributes, and node relationships. These relationships are organized by type and direction. The access of the relationship is done directly through the node. Ask "Xiao Ming's friend's friend's friend...", even if the depth increases, it just increases the node. In the query of complex connections, it can reach the millisecond level.

How to understand the concepts of nodes, relationships, and attributes?

Here is an example, create a node, the label is person, his name is Xiaoming. Xiao Ming is a node. Xiao Ming likes watching movies (attribute of node), he and Xiao Hua are friends (relationship). Such a simple relationship is established.

enter image description here

(2) Comparison between Neo4j and other non-relational databases

At present, most NoSQL databases are based on collections and documents. These data are stored in incoherent collections, which makes it more difficult to connect and establish relationships between data. That is to say, the data is discrete. If a relationship is to be established, one set is usually embedded in another set to realize the relationship. This overhead is also great. However, this kind of data does not have much "relationship", and the use of this kind of database is particularly efficient and has good read and write functions.

enter image description here

This article compares 8 different NoSQL databases for reference

2. Which venues are suitable for Neo4j and which are not suitable?

Here, it needs to be explained that the graph database is not only Neo4j, but Neo4j is a very good graph database. It was developed in 2003 and released in 2007. Used by many companies, eBay, Adidas, Wal-Mart, etc.

Neo4j is based on graph theory and naturally has natural advantages in processing maps. Therefore, it is suitable for logistics management and traffic big data.

Since the basic elements of Neo4j are nodes and relationships, it is also particularly suitable for dealing with social networks with complex relationships. In addition, it is also very advantageous in implementing a recommendation system, and it is also very helpful for analyzing transaction customer data. It can also be used to detect fraud. An example will be used below. It is even used in games. For example, "Three Kingdoms 13" by Guangrong Company.

enter image description here

Summarize the suitable areas slightly:

  • Social network
  • Transportation Big Data (Logistics)
  • Recommended system
  • Fraud analysis
  • Web security (spam, etc.)

However, there are also areas that are not suitable for graph databases.

  • Record large amounts of event-based data
  • Need for large-scale distributed data processing
  • Binary data storage
  • Structured data suitable for storage in relational data

3. Neo4j installation

Windows has an exe installation file, which is more convenient, just follow the visual tutorial step by step. Mac installation should not be difficult. I won’t introduce it here. Here is based on Linux Ubuntu16.04, introducing the installation tutorial.

(1) Install JAVA environment

Neo4j is implemented in Java, so you need to install Java Runtime Environment (JRE). If you are already up and running, please continue and skip this step. Open the instruction box:

sudo apt update sudp apt upgrade sudo apt install default-jre default-jre-headless

If the instruction doesn't work, try these two sentences first before using it.

sudo update-alternatives --set java / usr / lib / jvm / java-8-openjdk-amd64 / bin / java sudo update-alternatives --set javac / usr / lib / jvm / java-8 -openjdk-amd64 / bin / javac

Take a look at the java version:

enter image description here

(2) Install Neo4j

First, we add the repository key to our keychain.

wget --no-check-certificate -O -https://debian.neo4j.org/neotechnology.gpg.key| sudo apt-key add -

Then add the repository to the apt source list.

echo 'deb http://debian.neo4j.org/repo stable/' | sudo tee /etc/apt/sources.list.d/neo4j.list

Update:

sudo apt update sudo apt install neo4j

The server should have started automatically, and it should be restarted at startup. If necessary, the server can be stopped:

sudo service neo4j stop

And restart:

sudo service neo4j start

Visit Neo4j

You should now be able to access the database via http://localhost:7474/browser/.

Let's introduce the panel after opening. At the beginning, there is a login interface that allows you to enter your account and password. The default is neo4j when you open it for the first time. After you enter it, it will automatically pop up to let you change your password. It looks like this after logging in.

enter image description here

Here is some sample code:

enter image description here

You can try the tutorial of Example Graphs first and enter the query statement to get this interface:

enter image description here

The table is like this, you can see the attributes of the node: name, born, etc.

enter image description here

Text can see the data of the table:

enter image description here

Code can see the input code and data in json format:

enter image description here

4. Introducing the Cypher query language

Just as Mysql has SQL language, Neo4j also has a corresponding query language Cypher. Cypher draws on the structure of the SQL language and has many familiar keywords. For database operations, it is nothing more than adding, deleting, modifying, and checking, which are introduced one by one below:

(1) Increase

The so-called increase is to create data. The basic elements in a graph database are nodes, relationships, and attributes. Neo4j has two keywords to achieve an increase. One is CREATE (lowercase is also possible):

<1>CREATE to create a node

create (n:User {name:"Dav"})

Here n is the variable name, User is the label (in the graph database, the label can be understood as the table in the relational database), and the attribute in the curly braces.

<2>CREATE creates a relationship:

MATCH (n{name:"a"}),(m{name:"b"}) CREATE (n)-[r:KNOWS]->(m) return n,m

Here are a few explanations:

  • The parentheses are nodes, and the nodes can also be unlabeled (but it’s not a good practice). The square brackets are relations. On both sides of the relations are two nodes, similar to this ()-[]→()
  • MATCH is the query keyword
  • return n,m return two nodes to get the graph

enter image description here

The other is to use MERGE to create a node. The difference between MERGE and CREATE is that MERGE is equal to MATCH + CREATE. It will check in the database before creating this node.

<3>MERGR create node

MERGE (n:Test{name:"c"}) ON CREATE SET n.created = timestamp() return n

First check the Test label, whether the node c whose attribute is the value of name exists, if it exists, use the existing node, otherwise create a new node.

Here, MERGE is used to create a node, and a SET keyword is used. This is to change the properties of the node and belongs to the scope of "change"

<4> MERGR create relationship

Create a relationship by merging, first check whether the relationship exists, if it does not modify any data, otherwise create a new relationship

MATCH (a:Person{name:'Joel Silver'}),(b:Person{name:'J.T. Walsh'}) MERGE (a)-[r:LOVES]->(b)

Match the person named Joel Silver and the person named JT Walsh to establish a relationship LOVES. When this statement appears, it means that the establishment has been successful.

Created 1 relationship, completed after 199 ms.

(2) Delete

<1>The DELETE keyword can delete data:

MATCH (n)DELETE n

This will report an error because the relationship must be deleted to delete the node:

MATCH ()-[r:朋友]->(m) DELETE r,m

To do this, find the friend relationship r and the node m pointed to by the friend, and delete r and m at the same time. This is successful:

Deleted 1 node, deleted 1 relationship, completed after 8 ms.

<2>REMOVE can also delete data:

MATCH (n) REMOVE n:Test

Use REMOVE to remove data and remove all nodes with the Test label:

Removed 3 labels, completed after 9 ms.

Be careful when using REMOVE to remove a node, REMOVE will not be like DELETE, because this node has other relationships and an error is not deleted.

After this node is deleted, the relationship will point to an empty node.

(3) Change

Use SET to change the node. There are already examples just above, so I won’t repeat them.

(4) Check

Use MATCH to query nodes. Cypher has other ways to narrow down the search.

<1>MATCH p=()-[r:LOVES]->() RETURN p LIMIT 25

Use the LIMIT keyword to get only the specified number of nodes.

<2>WHERE implements conditional filtering

MATCH p=(n:Person)-[:LOVES]->() WHERE n.name <> "a" RETURN p

Query all nodes whose attribute name of node n is not equal to a <> means not equal.

<3>Use INDEX index

Use the keyword INDEX ON to create a common index for the node's attribute name:

CREATE INDEX ON :Person(name)

Output:

Added 1 index, completed after 46 ms.

Use the index to query:

MATCH (n:Person) WHERE n.name IN ["a","b"] RETURN n as Person

Get the picture:

enter image description here

Explicitly use index query:

MATCH(n:Person) USING INDEX n:Person(name) WHERE n.name = 'a' RETURN n as Person

Use the DROP keyword to delete an existing index:

DROP INDEX ON :Person(name)

Output:

Removed 1 index, completed after 1 ms.

Cypher can also use some functions to assist queries such as size() any() and so on. For details, please refer to the relevant API documentation. Only two important ones are mentioned here, querying the shortest path and all the shortest paths.

<1>The shortest path shortestpath()

Take this picture as an example, find the shortest path from Joel Silver to Jonathan Lipnicki from some of the pictures:

enter image description here

MATCH (p1:Person {name:"Jonathan Lipnicki"}),(p2:Person{name:"Joel Silver"}), p=shortestpath((p1)-[*..10]-(p2)) RETURN p

Here [*..10] means to find the shortest path relationship among all existing relationships within a path depth of 10

Get the picture:

enter image description here

<2>All shortest paths

MATCH (p1:Person {name:"Jonathan Lipnicki"}),(p2:Person{name:"Joel Silver"}), p=allshortestpaths((p1)-[*..10]-(p2)) RETURN p

enter image description here

The above theoretical part has been finished, let’s move on to the actual combat part.

5. Case actual combat

(1) Analysis of Bank Fraud Ring

A concept is introduced here. First-party bank fraud is essentially the use of others’ real identities to fabricate and forge identities to commit fraud.

It has the following characteristics:

  • <1>Two or more people form a fraud ring
  • <2>People in the fraud ring share part of the information of legitimate contacts, such as phone numbers

We can use Neo4j to identify the existence of fraud rings.

First, we create a fraud ring (due to space limitations, we will not post the code here and  you can refer to this tutorial ).

enter image description here

Query suspicious fraud rings:

enter image description here

From left to right, the members of the fraud ring, the contact information of the suspected fraud, and the size of the fraud ring. Calculate the risk of fraud:

enter image description here

( 2) Literature index

Let me give you a small example. In academic circles, you need to check some papers, usually full-text search, which is not efficient. You can use Neo4j to obtain highly matched papers. Here is a small example. Open Neo4j and insert data manually first.

create (Paper 1: Thesis Atlas {Paper Name: "Paper 1"}), (Paper 2: Thesis Atlas {Paper Name: "Paper 2"}), (Paper 3: Thesis Atlas {Paper Name: "Paper 3"} ),(Paper 4: Paper Atlas {Paper Name: "Paper 4"}), (Paper 5: Paper Atlas {Paper Name: "Paper 5"}), (Paper 6: Paper Atlas {Paper Name: "Paper 6" }),(Paper 7: Thesis Atlas {Paper Name: "Paper 7"}),(Paper 1)-[: Similar]->(Paper 2),(Paper 1)-[: Similar]->(Paper 3 ),(Paper 2)-[: Similar]->(Paper 4),(Paper 2)-[: Similar]->(Paper 5),(Paper 3)-[: Similar]->(Paper 5), (Paper 5)-[: Similar]->(Paper 6), (Paper 7)-[: Similar]->(Paper 2), (Paper 7)-[: Similar]->(Paper 6) return *

enter image description here

Look for similar transfer paths between Paper 1 and Paper 6, so that you can find out which papers are the main references for the paper.

MATCH n=allshortestPaths((论文1:论文图谱{论文名:"论文1"})-[*..6]->(论文6:论文图谱{论文名:"论文6"})) RETURN n

enter image description here

Next, I want to implement a document index through a complete example, from data acquisition, import, and analysis. I will use Scrapy to crawl information about 1,000 books. Save it to csv, import neo4j, and analyze it further.

The first is to use Scrapy to crawl information. After the data climbed down, it was like this.

enter image description here

From left to right is the book's upc code, name, type, storage, price, rating, number of ratings, and the profile target website is this . First use the scrapy shell to operate a crawler, first perform a simple crawling experiment to analyze the web page.

enter image description here

scrapy shell http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html

Design ideas:

  • <1> Set items according to the web page information just analyzed
  • <2>Design a crawler spieder based on the webpage just analyzed
    • Crawlers need to crawl to a single page and need information
    • After crawling a webpage, the crawler needs to crawl the next target webpage
  • <3>Set relevant information in setting
  • <4>Process special data in pipelines

I have uploaded the code to git .

Import data into Neo4j via csv file. First put book2.csv in this directory:

/var/lib/neo4j/import

First read the file to see if you can get:

LOAD CSV WITH HEADERS FROM "file:///books2.csv" AS line WITH line RETURN line

enter image description here

LOAD CSV WITH HEADERS FROM "file:///books2.csv" AS line CREATE (:Books { Id: line.upc, Name: line.name, Price: line.price,Rate:line.review_rating,content:line.jianjie,kinds:line.Kinds,stock:line.stock})

In this way, 1,000 books are stored as nodes. Output:

Added 1000 labels, created 1000 nodes, set 5998 properties, completed after 228 ms.

Check 25 to see the situation:

enter image description here

And you can view various attributes. But it doesn't matter yet. Create several book nodes:

create (n:书类名{n.Name=”Sequential Art”})
..
..

Re-establish relationship:

MATCH (n:Books2),(m:书类名) where n.kinds = m.Name create(n)-[r:属于]->(m) RETURN n,r,m LIMIT 25

enter image description here

A simple book-category diagram has been built, and now we can index by book rating, category, and price. So as to complete a simple bibliographic recommendation system. In the fifth case I will do it together.

(3) Find the source of the spam mailbox

If you don't want to download Neo4j locally, you can go to the official website of Weiyun Shuju Company and try Neo4j online . Here we are based on this platform to make a case of spam.

Enter in the command box:

MATCH m=(s:Person)-->(e:Email)-->(r:Person) WHERE e.title=~'.*普通发票.*' RETURN m LIMIT 15

enter image description here

Only 15 nodes are returned here. What if we want to find the source of the spam mailbox? Usually, the title or content of the spam mailbox will have words about promotion, recruitment, etc.

Here we search for the keyword "invoice" by traversing the headers of all emails. If people send this kind of email frequently, there must be a lot of emails. Here we set that when the number of such letters exceeds 105, it will be output.

MATCH m=(s:Person)-->(e:Email)-->(r:Person) WHERE e.title=~'.发票.' WITH s,COUNT(e) AS num,COLLECT(e) AS emails,COLLECT(r) AS recevies WHERE num > 105 RETURN s,emails,recevies

Get the picture:

enter image description here

Obviously, I found that almost all invoices came from this mailbox [email protected], and the main criminal was found.

Then you can find the main culprit. Is this the end? No, no, I said above that "reverse questioning" is very valuable. Now that we have found the main culprit, we might as well take a look.

What kind of emails does he often send, and what are the characteristics of such emails? This kind of people are excavated from it, and the "routine" of sending spam mailboxes.

MATCH m=(s:Person)-->(e:Email)-->(r:Person)

WHERE s.account=~'[email protected]'

RETURN s,e,r

Get the picture:

enter image description here

Observe carefully and you will find that in addition to the words "invoice", the main criminal's email also contains words such as "call-in-person chat" and "fee discount". Then we can remember such words, and we can filter emails with such words next time.

Through the above example, you should be able to appreciate the advantages of mapping databases in processing spam and finding information. Especially when dealing with "reverse questions". And the query efficiency is at the millisecond level.

4) Corporate relationship construction

This is still based on the platform of Weiyun Data Ju.

MATCH (n:`公司`) RETURN n LIMIT 25

enter image description here

Investment chart:

MATCH a=(:公司 {名称:'中航工业集团公司'})-[r*]->() RETURN nodes(a)

enter image description here

In this way, there is an intuitive grasp of the company. Who invested in who, the flow of cash flow. There are also intuitive displays for the company's financial management.

Guarantee map:

MATCH a=(:公司 {名称:'中航工业集团公司'})-[r:担保*]->() RETURN nodes(a)

enter image description here

and many more

(5) Social relationship analysis to realize a simple friend recommendation function

Here we will use the book library of the second case, first create a circle of friends.

create (Xiaobei: circle of friends {name: "小北", favorite books: "Poetry"}), (Xiaofei: circle of friends {name: "小菲", favorite books: "Science Fiction"} ), (Xiao Peng: Moments of Friends{Name: "小鹏", Favorite Books: "Music"}), (Xiaoying: Moments of Friends {Name: "小颖", Favorite Books: "Politics"} ), (Xiao Lan: Moments of Friends{Name: "小兰", Favorite Books: "Music"}), (Xiaofeng: Moments of Friends{Name: "小峰", Favorite Books: "Travel"}), (Small news: Moments of friends {name: "小 News", favorite books: "Poetry"}), (Xiaodong: Moments of friends {name: "小东", favorite books: "Sequential Art"}) , (Xiao Wei: Moments of Friends {Name: "小唯", Favorite Books: "Young Adult"}), (小窦: Moments of Friends {Name: "小窦", Favorite Books: "Poetry"} ), (Xiao Qi: Moments of Friends {Name: "小齐", Favorite Books: "Default"}), (小林: Moments of Friends {Name: "小林", Favorite Books: "Poetry"}), (Xiao Rui: Moments of Friends{Name: "小锐", Favorite Books: "Default"}), (Xiaowei: Moments of Friends {Name: "小伟", Favorite Books: "Young Adult"}) , (Xiaoling: Moments of Friends{Name: "Xiaoling", Favorite Books: "Business"}), (小讯)-[:Know]->(小窦), (Small News)-[: Know]->(小齐), (Small News)-[: Know]->(小林), (小 News)-[: Know]- >(小鹏), (Small News)-[: Know]->(小伟), (Small News)-[: Know]->(小峰), (小菲)-[: Know]->(小Peng), (小菲)-[: know]->(小峰), (小菲)-[: know]->(小唯), (小峰)-[: know]->(小北), ( Xiaofeng)-[:acquaintance]->(小兰), (小东)-[:acquaintance]->(小林), (小东)-[:acquaintance]->(小锐), (小东)- [: know]->(小菲), (小鹏)-[: know]->(小颖), (小北)-[: know]->(小兰), (小颖)-[: Know]->(小东), (小唯)-[:cognize]->(小鹏), (小唯)-[:cognize]->(小锐), (小伟)-[:cognize] →(Xiaoling)

Show Xiaofeng’s circle of friends:

MATCH n=(:朋友圈{姓名:"小峰"})-[*..6]-() return n

enter image description here

There are several concepts to be introduced here.

<1> Once relationship (direct relationship)

MATCH n=(:朋友圈{姓名:"小讯"})-[:认识]-() return n

enter image description here

<2> Second-degree relationship

MATCH n=(:朋友圈{姓名:"小讯"})-[*..2]-() return n

enter image description here

I once saw a question, if you live in a village, how many people will you have to pass through to meet Obama? The answer is six. Assuming you are in a village, then the village head, town head, county head, mayor, governor, president of the country, Obama, and six people are enough. So you will find that we usually search at 6 degrees depth.

<3>The shortest path of understanding between two strangers

We can use Neo4j to find two people we don’t know and the shortest path to establish contact.

MATCH n=shortestPath((小讯:朋友圈{姓名:"小讯"})-[*..6]-(小锐:朋友圈{姓名:"小锐"})) return n

enter image description here

<4> All the shortest paths of understanding between two strangers

MATCH n = allshortestPaths((小讯:朋友圈{姓名:"小讯"})-[*..6]-(小菲:朋友圈{姓名:"小菲"})) return n

enter image description here

<5> Make recommendation system based on node's influence or other attributes

I don’t know, have you ever thought about how these software such as B station, Taobao, QQ, etc. are used as recommendation systems. For example: at station B, every Up host must select the submission area, type, and keyword tags when uploading a video. This completes the classification of the data. If you often watch the Guanyan District, you must come to the Guanyan District very frequently, so you can recommend these videos to you, and then recommend them to you based on the influence of the Up master (the size of the node). Of course, they must also use many machine learning algorithms. In my opinion, Neo4j can also be used as a simple recommendation function.

For example, make a small book recommendation. When creating the node, I created the type of book they like by the way (you will find that when you use some apps, one thing is to let you determine your preferences).

Combine the data of the second case:

MATCH (n:朋友圈),(m:Books2) where n.喜欢的书类 = m.kinds and toInt(m.Rate)>4  create (m)-[r:推荐]->(n) return m,r,n

One thing we have done here is to select books with a score greater than 4 for them based on the type of book that everyone has filled in.

enter image description here

And from the figure, it is easy to find that the more similar people who recommend books, the more they have the same hobbies. For example, Xiao Qi and Xiao Rui, from the picture, it can be seen that the two people did not know each other directly. Based on this, we can introduce Xiao Rui and Xiao Qi to understand and communicate.

As a result, we have completed two levels of recommendation. One is recommendation of books and the other is recommendation of friends. However, the actual application will definitely not be so hasty and simple, and the consideration will definitely be more detailed. Here is just a simple introduction.

3. Recommended information:

In the process of learning Neo4j, I also found a lot of information. Here is also for your reference.

  • "Neo4j Authoritative Guide"-Tsinghua University Press
  • "Neo4j Full Stack Development"-Publishing House of Electronics Industry
  • "Graph Database"-People's Posts and Telecommunications Press
  • "Proficient in Scrapy Web Crawler"-Tsinghua University Press
  • Neo4j official documentation

Fourth, the last words

Neo4j is indeed a very good database. Although it has not yet become popular, I believe that its application scenarios will increase in the future. After all, drawing pictures is the most intuitive way for us humans to understand the world. Thank you very much for joining this Chat. This is the second Chat I have done, although I already have the first experience. But to be honest, I was still very nervous. After all, I am still a student, and I will definitely have a lot of negligence, mistakes and omissions in what I do. Please forgive me. If you have any questions, you can leave a message in the comment area. This is also a good experience and growth for me. Finally, thank you for your support. Grateful.

Guess you like

Origin blog.csdn.net/litianquan/article/details/82770826