Personalized recommendation system based on practice Pagerank algorithm of articles Neo4j

The new version of Neo4j graph algorithms library (algo) adds support personalized Pagerank, I always wanted to find interesting application of this algorithm to verify the effect. Recently, I see an article Peter Lofgren of "efficient personalized Pagerank algorithm" (Efficient Algorithms for Personalized PageRank) (https://arxiv.org/pdf/1512.04633.pdf), in the paper, there is a more interesting example:

We want to try personalized search citations networks, but to set the parameters of how personalized PageRank, to get a different sort of result? Paper cites data retrieval using Citeseer open. We plan to create a paper application query, the user enters a keyword and an author name, get all the papers contain this keyword, sort is to consider input from the author's point of view. For each author, all their papers in order to give the same weight, and then use the personalized PageRank for search keywords papers are sorted out. For example, the keyword "entropy" for different authors have different meanings, so that we can go to compare results from different angles keyword "entropy" search out.

Next, we use to reconstruct the scene Neo4j

premise

  • Neo4j
  • Neo4j image library (algo)
  • Neo4jAPOC library
  • Graphaware of NLP plug

We need to download any plug-ins and configure as follows:

dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware
com.graphaware.runtime.enabled=true
com.graphaware.module.NLP.1=com.graphaware.nlp.module.NLPBootstrapper
dbms.security.procedures.whitelist=ga.nlp.*,algo.*,apoc.*
dbms.security.procedures.unrestricted=apoc.*,algo.*
apoc.import.file.enabled=true

Graph Model

Personalized recommendation system based on practice Pagerank algorithm of articles Neo4j

 

The figure can be seen from, our model is very simple model where the nodes are divided into two categories, respectively, Author and Article tags, each node has one or more Author AUTHOR relation to Article node while, with the other node Article there REFERENCE relations Article node.

For optimization request, this model also need to define FIG index. The only constraints are established on the name attribute index attribute node and Author of Article node.

CALL apoc.schema.assert(
{},
{Article:['index'],Author:['name']})

data import

We use the paper available on the website aminer.org reference data (https://static.aminer.cn/lab-datasets/citation/dblp.v10.zip), which is the latest version of this data, the most important thing is he json fashion store.

For more information about this database you can see the paper "ArnetMiner: academic social network extraction and mining" (http://keg.cs.tsinghua.edu.cn/jietang/publications/KDD08-Tang-et-al- ArnetMiner.pdf)

Translator's Introduction: First Author: "ArnetMiner academic social network extraction and mining," a paper by Professor Tang Jie of Tsinghua University

Import data into Neo4j in two steps, the first step to import all the papers and their authors, the second step to establish these papers cited relationship.

We used to import data apoc.periodic.iterate imported in bulk.

Import papers and authors

CALL apoc.periodic.iterate(
'UNWIND ["dblp-ref-0.json","dblp-ref-1.json","dblp-ref-2.json","dblp-ref-3.json"] as file
CALL apoc.load.json("file:///neo4j/import/" + file)
yield value return value',
'MERGE (a:Article{index:value.id})
ON CREATE SET a += apoc.map.clean(value,["id","authors","references"],[0])
WITH a,value.authors as authors
UNWIND authors as author
MERGE (b:Author{name:author})
MERGE (b)-[:AUTHOR]->(a)'
,{batchSize: 10000, iterateList: true})

Establishing a reference relationship

CALL apoc.periodic.iterate(
'UNWIND ["dblp-ref-0.json","dblp-ref-1.json","dblp-ref-2.json","dblp-ref-3.json"] as file
CALL apoc.load.json("file:///neo4j/import/" + file)
yield value return value',
'MERGE (a:Article{index:value.id})
WITH a,value.references as references
UNWIND references as reference
MERGE (b:Article{index:reference})
MERGE (a)-[:REFERENCES]->(b)'
,{batchSize: 10000, iterateList: true})

PageRank algorithm

PageRank was designed from the beginning to analyze the importance of a page. It is the main consideration is the number of sites have connections and quality, such as a website home page has a link from reddit to it, and from my blog has a link to it, then the results of these two links is completely different.

And such a process is very easy to apply to paper on a reference network, citations can be viewed as an article for another article cast a "yes" vote, and which articles of the "yes" vote the most? This is the PageRank does best to solve the problem.

Use PageRank algorithm cited papers in the world can find the most important figures in the article and the most influential article on the network.

PageRank operation and stores the result of the attribute node

CALL algo.pageRank('Article', 'REFERENCES')

By pagerank the most important articles

MATCH (a:Article)
RETURN a.title as article,
a.pagerank as score
ORDER BY score DESC
LIMIT 10

Results are as follows:

Personalized recommendation system based on practice Pagerank algorithm of articles Neo4j

 

Natural Language Processing (NLP)

If we want to recommend documents by keywords, then you need to extract keywords from the figure. I would like to thank Graphaware of NLP plug-in, so that this process is very simple, even if you did not understand NLP NLP algorithm can also do related work.

NLP process will increase the number of nodes and relationships in our graphical model, specifically as shown below:

Personalized recommendation system based on practice Pagerank algorithm of articles Neo4j

 

NLP model definitions

In order to optimize NLP process, where the need to define some special constraints and indexes.

CALL ga.nlp.createSchema()

Increased processing pipeline

Define the configuration of the processing pipeline, for more information about the processing pipeline, see here (https://github.com/graphaware/neo4j-nlp#pipelines-and-components)

CALL ga.nlp.processor.addPipeline({
textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor',
name: 'defaultPipeline',
threadNumber: 4
processingSteps: {tokenize: true,
ner: true,
dependency: false}})

Set the default pipeline

CALL ga.nlp.processor.pipeline.default('defaultPipeline')

文本标注

原始的文本被拆成了单词、段落和函数。这里对文本的分析还仅仅只是一个开始。

如果想了解更多关于文本标注,推荐你看Christophe Willemsen写的这篇文章《用Neo4j和NLP插件逆向工程书籍存储》(https://graphaware.com/neo4j/2017/07/24/reverse-engineering-book-stories-nlp.html)

CALL apoc.periodic.iterate(
"MATCH (n:Article) WHERE exists (n.title) RETURN n",
"CALL ga.nlp.annotate({text: n.title, id: id(n)})
YIELD result MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)",
{batchSize:1, iterateList:true})

关键词提取

TextRank算法是一种相对简单、无监督的文本摘要方法,其可以直接进行主题提取。它的目标就是运用检索关键词及构建词共现关系图,得到对文档具有描述性的关键短语,而PageRank算法则对词的重要性进行排序。

---取之《使用图进行高效无监督关键词提取》(https://graphaware.com/neo4j/2017/10/03/efficient-unsupervised-topic-extraction-nlp-neo4j.html)

CALL apoc.periodic.iterate(
"MATCH (a:AnnotatedText) RETURN a",
"CALL ga.nlp.ml.textRank({annotatedText: a}) YIELD result
RETURN distinct 'done' ",
{batchSize:1,iterateList:true}

获取文章标题中出现次数最多的10个关键词

MATCH (k:Keyword)-[:DESCRIBES]->()
WHERE k.numTerms > 1
RETURN k.value as Keyphrase,
count(*) AS n_articles
ORDER BY n_articles DESC
LIMIT 10

结果如下:

Personalized recommendation system based on practice Pagerank algorithm of articles Neo4j

 

最基本的文章推荐

如果你跟着本文一步一步执行下来,那么你现在已经有了一个最基本的基于PageRank分数和NLP关键词提取的文章推荐系统。

关键词“social networks”的前十推荐文章

MATCH (k:Keyword)-[:DESCRIBES]->()<-[:HAS_ANNOTATED_TEXT]-(a:Article)
WHERE k.value = "social networks"
RETURN a.title as title, a.pagerank as p
ORDER BY p DESC
LIMIT 10

结果如下:

Personalized recommendation system based on practice Pagerank algorithm of articles Neo4j

 

个性化PageRank算法

个性化PageRank是从一个或多个源节点的视角给出其他节点的pagerank分。

我们再计算一次pagerank分数,但这次我们把描述中带有关键词“social networks”的文章作为源节点。

MATCH (k:Keyword)-[:DESCRIBES]->()<-[:HAS_ANNOTATED_TEXT]-(a:Article)
WHERE k.value = "social networks"
WITH collect(a) as articles
CALL algo.pageRank.stream('Article', 'REFERENCES', {sourceNodes: articles})
YIELD nodeId, score
WITH nodeId,score order by score desc limit 10
MATCH (n) where id(n) = nodeId
RETURN n.title as article, score
Personalized recommendation system based on practice Pagerank algorithm of articles Neo4j

 

可以看到Sergey Brin和Larry Page所著的《大型超文本搜索引擎解析》(http://infolab.stanford.edu/pub/papers/google.pdf) 排在第一位。因此,可以看出,谷歌早期在图和PageRank方面的研究对社交网络方面有着巨大的影响。

个性化的推荐系统

需要再次重申,本文的目标是要重现这个场景

关键词“entropy”对于不同的人意味着不同的东西,我们希望从不同的角度还比较关键词“entropy"的结果。

首先我们找到某一作者的所有文章,这些文章将会作为个性化Pagerank的源节点。接着,我们运行pagerank算法并投影关键词”entropy“描述的文章节点,同时也投影这些文章节点之间的REFERENCES关系。

我们可以通过cypher投影语句过滤掉不需要的关系

只有在源节点和目标节点都被节点查询语句中所描述时,其在关系查询语句的关系才会被投影。源节点和目标节点任一个不在节点查询语句中描述时,则此关系会被忽略。

推荐示例

下面给出的是Jose C. Principe视角下搜索关键词“entropy”所得到的推荐文章。

MATCH (a:Article)<-[:AUTHOR]-(author:Author)
WHERE author.name="Jose C. Principe"
WITH collect(a) as articles
CALL algo.pageRank.stream(
'MATCH (a:Article)-[:HAS_ANNOTATED_TEXT]->()<-[:DESCRIBES]-(k:Keyword)
WHERE k.value contains "entropy" RETURN distinct id(a) as id',
'MATCH (a1:Article)-[:REFERENCES]->(a2:Article)
RETURN id(a1) as source,id(a2) as target',
{sourceNodes: articles,graph:'cypher'})
YIELD nodeId, score
WITH nodeId,score order by score desc limit 10
MATCH (n) where id(n) = nodeId
RETURN n.title as article, score
Personalized recommendation system based on practice Pagerank algorithm of articles Neo4j

 

HongWang视角下搜索关键词“entropy”所得到的推荐文章

MATCH (a:Article)<-[:AUTHOR]-(author:Author)
WHERE author.name="Hong Wang"
WITH collect(a) as articles
CALL algo.pageRank.stream(
'MATCH (a:Article)-[:HAS_ANNOTATED_TEXT]->()<-[:DESCRIBES]-(k:Keyword)
WHERE k.value contains "entropy" RETURN distinct id(a) as id',
'MATCH (a1:Article)-[:REFERENCES]->(a2:Article)
RETURN id(a1) as source,id(a2) as target',
{sourceNodes: articles,graph:'cypher'})
YIELD nodeId, score
WITH nodeId,score order by score desc limit 10
MATCH (n) where id(n) = nodeId
RETURN n.title as article, score
Personalized recommendation system based on practice Pagerank algorithm of articles Neo4j

 

in conclusion

As we expected, the search from a different perspective of the authors, the results obtained recommend is not the same.

Neo4j itself is very powerful, when using the appropriate plug-in in a particular field, he will become more powerful.

Guess you like

Origin www.cnblogs.com/cuiyubo/p/11297311.html