Hands-on teaching of small-scale financial knowledge map construction: quantitative analysis, graph database neo4j, graph algorithm, relationship prediction, named entity recognition, detailed teaching of Cypher Cheetsheet, etc.

insert image description here

Project design collection (artificial intelligence direction): Help newcomers quickly master skills in actual combat, complete project design upgrades independently, and improve their own hard power (not limited to NLP, knowledge graph, computer vision, etc.): Collect meaningful project design collections to help Newcomers quickly master skills in actual combat, helping users make better use of the CSDN platform, independently complete project design upgrades, and improve their own hard power.

insert image description here

  1. Column Subscription: Encyclopedia of Projects to Improve Your Hard Power

  2. [Detailed introduction of the column: project design collection (artificial intelligence direction): help newcomers quickly master skills in actual combat, complete project design upgrades independently, and improve their own hard power (not limited to NLP, knowledge graphs, computer vision, etc.)

Hands-on teaching of small-scale financial knowledge map construction: quantitative analysis, graph database neo4j, graph algorithm, relationship prediction, named entity recognition, detailed teaching of Cypher Cheetsheet, etc.

Effect preview:

See the source of the article at the top or at the end of the article

https://download.csdn.net/download/sinat_39620217/87987419

1. Knowledge map storage method

Knowledge map storage mainly includes Resource Description Framework ( Resource Description Framework, RDF ) and graph database ( Graph Database ).

1.1 Resource Description Framework Features

  • Stored as triples (Triple)
  • standard inference engine
  • W3C standard
  • Easy to publish data
  • mostly academia

1.2 Graph Database Features

  • Both nodes and relationships can contain attributes
  • no standard inference engine
  • Graph traversal efficiency is high
  • affairs management
  • Mostly industrial scenes

2. Graph database neo4j

neo4j is a NoSQL graph database with high-performance read-write scalability based on an efficient graph query language Cypher. For more information, please visit the official website of neo4j . The official website also provides an Online Sandbox for a quick start-up experience.

2.1 Software download

Download link: https://neo4j.com/download-center/

2.2 Start login

2.2.1 Windows

  • enter neo4jdirectory
cd neo4j/bin
./neo4j start
  • The startup is successful, and the following prompt appears on the terminal, which means the startup is successful
Starting Neo4j.Started neo4j (pid 30914). It is available at http://localhost:7474/ There may be a short delay until the server is ready.

(1) Visit the page: http://localhost:7474

(2) The initial account and password are both neo4j( hosttype selection bolt)

(3) Enter the old password and enter the new password: pay attention to the local installation before starting jdk(recommended installation jdk version 11): https://www.oracle.com/java/technologies/javase-downloads.html

2.2.2 MacOS

After executing Add Local DBMS, open Neo4j Browser

2.3 Reserve knowledge

The Cypher query language is required to perform CRUD on neo4j.

2.4 Possible problems and solutions during Windows installation

  • Problem: After completing the installation of JDK1.8.0_261, neo4jthe following problems occurred during startup:
Unable to find any JVMs matching version "11"
  • Solution: Prompt to install jdk 11 version, so downloaded jdk-11.0.8, Mac OSyou can ls -la /Library/Java/JavaVirtualMachines/check the installed jdkand version information.

3. Knowledge map data preparation

3.1 Free open source financial data interface

Tushare free account may not be able to pull data, please refer to the stock data acquisition method provided by issues:

3.1.1 Tushare

Official website link: http://www.tushare.org/

3.1.2 JointQuant

Official website link: https://www.joinquant.com/

3.1.3 Importing modules

import tushare as ts  # 参考Tushare官网提供的安装方式
import csv
import time
import pandas as pd
# 以下pro_api token可能已过期,可自行前往申请或者使用免费版本
pro = ts.pro_api('4340a981b3102106757287c11833fc14e310c4bacf8275f067c9b82d')

3.2 Data preprocessing

3.2.1 Basic stock information

stock_basic = pro.stock_basic(list_status='L', fields='ts_code, symbol, name, industry')
# 重命名行,便于后面导入neo4j
basic_rename = {
    
    'ts_code': 'TS代码', 'symbol': '股票代码', 'name': '股票名称', 'industry': '行业'}
stock_basic.rename(columns=basic_rename, inplace=True)
# 保存为stock_basic.csv
stock_basic.to_csv('financial_data\\stock_basic.csv', encoding='gbk')

3.2.2 Shareholder Information

holders = pd.DataFrame(columns=('ts_code', 'ann_date', 'end_date', 'holder_name', 'hold_amount', 'hold_ratio'))
# 获取一年内所有上市股票股东信息(可以获取一个报告期的)
for i in range(3610):
   code = stock_basic['TS代码'].values[i]
   holders = pro.top10_holders(ts_code=code, start_date='20180101', end_date='20181231')
   holders = holders.append(holders)
   if i % 600 == 0:
       print(i)
   time.sleep(0.4)# 数据接口限制
# 保存为stock_holders.csv
holders.to_csv('financial_data\\stock_holders.csv', encoding='gbk')
holders = pro.holders(ts_code='000001.SZ', start_date='20180101', end_date='20181231')

3.2.3 Stock concept information

concept_details = pd.DataFrame(columns=('id', 'concept_name', 'ts_code', 'name'))
for i in range(358):
   id = 'TS' + str(i)
   concept_detail = pro.concept_detail(id=id)
   concept_details = concept_details.append(concept_detail)
   time.sleep(0.4)
# 保存为concept_detail.csv
concept_details.to_csv('financial_data\\stock_concept.csv', encoding='gbk')

3.2.4 Stock announcement information

for i in range(3610):
   code = stock_basic['TS代码'].values[i]
   notices = pro.anns(ts_code=code, start_date='20180101', end_date='20181231', year='2018')
   notices.to_csv("financial_data\\notices\\"+str(code)+".csv",encoding='utf_8_sig',index=False)
notices = pro.anns(ts_code='000001.SZ', start_date='20180101', end_date='20181231', year='2018')

3.2.5 Financial news information

news = pro.news(src='sina', start_date='20180101', end_date='20181231')
news.to_csv("financial_data\\news.csv",encoding='utf_8_sig')

3.2.6 Concept information

concept = pro.concept()
concept.to_csv('financial_data\\concept.csv', encoding='gbk')

3.2.7 Composition information of Shanghai-Hong Kong Stock Connect and Shenzhen-Hong Kong Stock Connect

#获取沪股通成分
sh = pro.hs_const(hs_type='SH')
sh.to_csv("financial_data\\sh.csv",index=False)
#获取深股通成分
sz = pro.hs_const(hs_type='SZ')
sz.to_csv("financial_data\\sz.csv",index=False)

3.2.8 Stock price information

for i in range(3610):
   code = stock_basic['TS代码'].values[i]
   price = pro.query('daily', ts_code=code, start_date='20180101', end_date='20181231')
   price.to_csv("financial_data\\price\\"+str(code)+".csv",index=False)

3.2.9 Use the free interface to obtain stock data

import tushare as ts
# 基本面信息
df = ts.get_stock_basics()
# 公告信息
ts.get_notices("000001")
# 新浪股吧
ts.guba_sina()
# 历史价格数据
ts.get_hist_data("000001")
# 历史价格数据(周粒度)
ts.get_hist_data("000001",ktype="w")
# 历史价格数据(1分钟粒度)
ts.get_hist_data("000001",ktype="m")
# 历史价格数据(5分钟粒度)
ts.get_hist_data("000001",ktype="5")
# 指数数据(sh上证指数;sz深圳成指;hs300沪深300;sz50上证50;zxb中小板指数;cyb创业板指数)
ts.get_hist_data("cyb")
# 宏观数据(居民消费指数)
ts.get_cpi()
# 获取分笔数据
ts.get_tick_data('000001', date='2018-10-08', src='tt')

3.3 Data preprocessing

3.3.1 Statistics of stock trading day volume mode

import numpy as np

yaxis = list()
for i in listdir:
    stock = pd.read_csv("financial_data\\price_logreturn\\"+i)
    yaxis.append(len(stock['logreturn']))
counts = np.bincount(yaxis)

np.argmax(counts)

3.3.2 Calculating stock log returns

The calculation formula of stock logarithmic return and Pearson correlation coefficient:

import pandas as pd
import numpy as np
import os
import math

listdir = os.listdir("financial_data\\price")

for l in listdir:
   stock = pd.read_csv('financial_data\\price\\'+l)
   stock['index'] = [1]* len(stock['close'])
   stock['next_close'] = stock.groupby('index')['close'].shift(-1)
   stock = stock.drop(index=stock.index[-1])
   logreturn = list()
   for i in stock.index:
       logreturn.append(math.log(stock['next_close'][i]/stock['close'][i]))
   stock['logreturn'] = logreturn
   stock.to_csv("financial_data\\price_logreturn\\"+l,index=False)

3.3.3 Correlation coefficient of log returns among stocks

from math import sqrt
def multipl(a,b):
   sumofab=0.0
   for i in range(len(a)):
       temp=a[i]*b[i]
       sumofab+=temp
   return sumofab

def corrcoef(x,y):
   n=len(x)
   #求和
   sum1=sum(x)
   sum2=sum(y)
   #求乘积之和
   sumofxy=multipl(x,y)
   #求平方和
   sumofx2 = sum([pow(i,2) for i in x])
   sumofy2 = sum([pow(j,2) for j in y])
   num=sumofxy-(float(sum1)*float(sum2)/n)
   #计算皮尔逊相关系数
   den=sqrt((sumofx2-float(sum1**2)/n)*(sumofy2-float(sum2**2)/n))
   return num/den

Since there are millions of original data, only the first 300 stocks are selected for correlation analysis in order to save calculation

listdir = os.listdir("financial_data\\300stock_logreturn")
s1 = list()
s2 = list()
corr = list()
for i in listdir:
   for j in listdir:
       stocka = pd.read_csv("financial_data\\300stock_logreturn\\"+i)
       stockb = pd.read_csv("financial_data\\300stock_logreturn\\"+j)
       if len(stocka['logreturn']) == 242 and len(stockb['logreturn']) == 242:
           s1.append(str(i)[:10])
           s2.append(str(j)[:10])
           corr.append(corrcoef(stocka['logreturn'],stockb['logreturn']))
           print(str(i)[:10],str(j)[:10],corrcoef(stocka['logreturn'],stockb['logreturn']))
corrdf = pd.DataFrame()
corrdf['s1'] = s1
corrdf['s2'] = s2
corrdf['corr'] = corr
corrdf.to_csv("financial_data\\corr.csv")

4 Building a financial knowledge map

Install third-party libraries

pip install py2neo

4.1 Connection based on python

The specific code can refer to 3.1 python operation neo4j-connection

from pandas import DataFrame
from py2neo import Graph,Node,Relationship,NodeMatcher
import pandas as pd
import numpy as np
import os
# 连接Neo4j数据库
graph = Graph('http://localhost:7474/db/data/',username='neo4j',password='neo4j')

4.2 Read data

stock = pd.read_csv('stock_basic.csv',encoding="gbk")
holder = pd.read_csv('holders.csv')
concept_num = pd.read_csv('concept.csv')
concept = pd.read_csv('stock_concept.csv')
sh = pd.read_csv('sh.csv')
sz = pd.read_csv('sz.csv')
corr = pd.read_csv('corr.csv')

4.3 Padding and deduplication

stock['行业'] = stock['行业'].fillna('未知')
holder = holder.drop_duplicates(subset=None, keep='first', inplace=False)

4.4 Create entity

Concept, Stock, Shareholder, Share Connect

sz = Node('深股通',名字='深股通')
graph.create(sz)  

sh = Node('沪股通',名字='沪股通')
graph.create(sh)  

for i in concept_num.values:
   a = Node('概念',概念代码=i[1],概念名称=i[2])
   print('概念代码:'+str(i[1]),'概念名称:'+str(i[2]))
   graph.create(a)

for i in stock.values:
   a = Node('股票',TS代码=i[1],股票名称=i[3],行业=i[4])
   print('TS代码:'+str(i[1]),'股票名称:'+str(i[3]),'行业:'+str(i[4]))
   graph.create(a)

for i in holder.values:
   a = Node('股东',TS代码=i[0],股东名称=i[1],持股数量=i[2],持股比例=i[3])
   print('TS代码:'+str(i[0]),'股东名称:'+str(i[1]),'持股数量:'+str(i[2]))
   graph.create(a)

4.5 Create relationship

Stock-Shareholders, Stock-Concept, Stock-Announcement, Stock-Stock Connect

matcher = NodeMatcher(graph)
for i in holder.values:    
   a = matcher.match("股票",TS代码=i[0]).first()
   b = matcher.match("股东",TS代码=i[0])
   for j in b:
       r = Relationship(j,'参股',a)
       graph.create(r)
       print('TS',str(i[0]))
           
for i in concept.values:
   a = matcher.match("股票",TS代码=i[3]).first()
   b = matcher.match("概念",概念代码=i[1]).first()
   if a == None or b == None:
       continue
   r = Relationship(a,'概念属于',b)
   graph.create(r)

noticesdir = os.listdir("notices\\")
for n in noticesdir:
   notice = pd.read_csv("notices\\"+n,encoding="utf_8_sig")
   notice['content'] = notice['content'].fillna('空白')
   for i in notice.values:
       a = matcher.match("股票",TS代码=i[0]).first()
       b = Node('公告',日期=i[1],标题=i[2],内容=i[3])
       graph.create(b)
       r = Relationship(a,'发布公告',b)
       graph.create(r)
       print(str(i[0]))
       
for i in sz.values:
   a = matcher.match("股票",TS代码=i[0]).first()
   b = matcher.match("深股通").first()
   r = Relationship(a,'成分股属于',b)
   graph.create(r)
   print('TS代码:'+str(i[1]),'--深股通')

for i in sh.values:
   a = matcher.match("股票",TS代码=i[0]).first()
   b = matcher.match("沪股通").first()
   r = Relationship(a,'成分股属于',b)
   graph.create(r)
   print('TS代码:'+str(i[1]),'--沪股通')

# 构建股票间关联
corr = pd.read_csv("corr.csv")
for i in corr.values:
   a = matcher.match("股票",TS代码=i[1][:-1]).first()
   b = matcher.match("股票",TS代码=i[2][:-1]).first()
   r = Relationship(a,str(i[3]),b)
   graph.create(r)
   print(i)

5 Data Visualization Query

Based on Crypher language, take Ping An Bank as an example to perform visual query.

5.1 View all associated entities

match p=(m)-[]->(n) where m.股票名称="平安银行" or n.股票名称="平安银行" return p;

5.2 Limit the number of displays

After calculating the correlation coefficient of logarithmic returns among stocks, check the entities associated with Ping An Bank stock

match p=(m)-[]->(n) where m.股票名称="平安银行" or n.股票名称="平安银行" return p limit 300;

5.3 Specifying inter-stock logarithmic return correlation coefficients

match p=(m)-[]->(n) where m.股票名称="平安银行" and n.股票名称="万科A" return p;

6 neo4j graph algorithm

6.1. Centrality algorithm (Centralities)

6.2 Community detection algorithm (Community detection)

6.3 Path finding algorithm (Path finding)

6.4 Similarity algorithm (Similarity)

6.5 Link Prediction

6.6 Preprocessing algorithm (Preprocessing)

6.7 Algorithm library installation and import method

Taking Windows OS as an example, the neo4j algorithm library is not provided in the installation package, but the algorithm package needs to be downloaded:

(1) Downloadgraph-algorithms-algo-3.5.4.0.jar

(2) Move to graph-algorithms-algo-3.5.4.0.jarthe root directory of the neo4j databaseplugin

(3) Modify the neo4j database directory confand neo4j.confadd the following configuration

dbms.security.procedures.unrestricted=algo.*

(4) Use the following command to view a list of all algorithms

CALL algo.list()

6.8 Algorithm Practice - Link Prediction

6.8.1 Atomic Adar algorithm

It is mainly based on judging the degree of intimacy between two adjacent nodes as a criterion. It was proposed by Lada Adamic and Eytan Adar in Friends and neighbors on the Web in 2003. The calculation formula of node intimacy is as follows:

Among them N(u), it represents the set of nodes adjacent to node u. If A(x,y)it means that node x and node y are not adjacent, and the larger the value is, the higher the closeness is.

AAA algorithm cypher code reference:

MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})

MERGE (zhen)-[:FRIENDS]-(arya)
MERGE (zhen)-[:FRIENDS]-(praveena)
MERGE (praveena)-[:WORKS_WITH]-(karin)
MERGE (praveena)-[:FRIENDS]-(michael)
MERGE (michael)-[:WORKS_WITH]-(karin)
MERGE (arya)-[:FRIENDS]-(karin)

// 计算 Michael 和 Karin 之间的亲密度
MATCH (p1:Person {name: 'Michael'})
MATCH (p2:Person {name: 'Karin'})
RETURN algo.linkprediction.adamicAdar(p1, p2) AS score
// score: 0.910349

// 基于好友关系计算 Michael 和 Karin 之间的亲密度
MATCH (p1:Person {name: 'Michael'})
MATCH (p2:Person {name: 'Karin'})
RETURN algo.linkprediction.adamicAdar(p1, p2, {relationshipQuery: "FRIENDS"}) AS score
// score: 0.0                                                

6.8.2 Common Neighbors

Based on the calculation of the number of common neighbors between nodes, the calculation formula is as follows:

Among them, N(x) represents the set of nodes adjacent to node x, and the common neighbor represents the intersection of the two sets. If the value of CN(x, y) is higher, it means that the intimacy between node x and node y is higher.

Common Neighbors algorithm cypher code reference:

MATCH (p1:Person {name: 'Michael'})
MATCH (p2:Person {name: 'Karin'})
RETURN algo.linkprediction.commonNeighbors(p1, p2) AS score

6.8.3 Resource Allocation

Resource allocation algorithm, the calculation formula is as follows:

where N(u)is the set of nodes adjacent to the node u, and the higher RA(x,y) indicates the greater the intimacy between node x and node y.

Resource Allocation algorithm cypher code reference:

MATCH (p1:Person {name: 'Michael'})
MATCH (p2:Person {name: 'Karin'})
RETURN algo.linkprediction.resourceAllocation(p1, p2) AS score

6.8.4 Total Neighbors

Refers to the total number of neighbors between adjacent nodes, the calculation formula is as follows:

where N(u)is the set of nodes adjacent to the node u.

Total Neighbors algorithm cypher code reference:

MATCH (p1:Person {name: 'Michael'})
MATCH (p2:Person {name: 'Karin'})
RETURN algo.linkprediction.totalNeighbors(p1, p2) AS score

Official website document>Link algorithm>Introduction: https://neo4j.com/docs/graph-algorithms/3.5/labs-algorithms/linkprediction/

7. Basic syntax of Cypher Cheetsheet

7.1 Creating Nodes

Type is Person(attributes: name, age, and gender)

create (:Person{name:"Tom",age:18,sex:"male"})
create (:Person{name:"Jimmy",age:20,sex:"male"})

7.2 Creating Relationships

Find two Person type nodes whose names are Tom and Jimmy respectively, and create a relationship between the two nodes: the type is Friend, and the relationship value is best

match(p1:Person),(p2:Person)
where p1.name="Tom" and p2.name = "Jimmy"
create(p1) -[:Friend{relation:"best"}] ->(p2);

7.3 Create an index

create index on :Person(name)
// 创建唯一索引(属性值唯一)
create constraint on (n:Person) assert n.name is unique

7.4 Delete node

// 普通删除
match(p:Person_{name:"Jiimmy"}) delete p
match (a)-[r:knows]->(b) delete r,b
// 级联删除(即删除某个节点时会同时删除该节点的关系)
match (n{name: "Mary"}) detach delete n
// 删除所有节点
match (m) delete m

7.5 Delete relationship

// 普通删除
match(p1:Person)-[r:Friend]-(p2:Person)
where p1.name="Jimmy" and p2.name="Tom"
delete r
// 删除所有关系
match p=()-[]-() delete p

7.6 The merge keyword

If it exists, return it directly; if it does not exist, create a new one and return it (usually it is actually used to avoid error reporting when adding attributes to nodes)

// 创建/获取对象
merge (p:Person { name: "Jim1" }) return p;

// 创建/获取对象 + 设置属性值 + 返回属性值
merge (p:Person { name: "Koko" })
on create set p.time = timestamp()
return p.name, p.time

// 创建关系
match (a:Person {name: "Jim"}),(b:Person {name: "Tom"})
merge (a)-[r:friends]->(b)

7.7 Updating Nodes

7.7.1 Updating property values

match (n {name:'Jim'})
set n.name='Tom'
set n.age=20
return n

7.7.2 New properties and property values

match (n {name:'Mary'}) set n += {age:20} return n

7.7.3 Delete attribute value

match(n{name:'Tom'}) remove n.age return n

7.7.4 Update node type (multiple labels allowed)

①match (n{name:'Jim'}) set n:Person return n
②match (n{name:'Jim'}) set n:Person:Student return n

7.8 Matching

7.8.1 Limit node type and attribute matching

match (n:Person{name:"Jim"}) return n
match (n) where n.name = "Jim" return n
match (n:Person)-[:Realation]->(m:Person) where n.name = 'Mary'

7.8.2 Optional matching (Null for missing parts)

optional match (n)-[r]->(m) return m

7.8.3 Matching at the beginning of a string

match (n) where n.name starts with 'J' return n

7.8.4 End of string matching

match (n) where n.name ends with 'J' return n

7.8.5 String Contains Matching

match (n) where n.name contains with 'g' return n

7.8.6 String exclusion matching

match (n) where not n.name starts with 'J' return n

7.8.7 Regular match =~ (fuzzy match)

match (n) where n.name =~ '.*J.*' return n (等价) like '%J%'

7.8.8 Regular match =~ (case insensitive)

match (n) where n.name =~ '(?i)b.*' return n (等价) like 'B/b%'

7.8.9 Attribute value contains (IN)

match (n { name: 'Jim' }),(m) where m.name in ['Tom', 'Koo'] and (n)<--(m) return m

7.8.10 "or" match (|)

match p=(n)-[:knows|:likes]->(m) return p

7.8.11 Arbitrary node and specified range depth relationship

match p=(n)-[*1..3]->(m) return p

7.8.12 Arbitrary nodes and relationships of arbitrary depth

match p=(n)-[*]->(m) return p

7.8.13 Deduplication return

match (n) where n.ptype='book' return distinct n

7.8.14 Sort return (desc descending order; asc ascending order)

match (n) where n.ptype='book' return n order by n.price desc

7.8.15 Rename returns

match (n) where n.ptype='book' return n.pname as name

7.8.16 Multiple conditional restrictions (with), that is, return Zhang% who knows more than 10 people

match (a)-[:knows]-(b)
where a.name =~ '张.*'
with a, count(b) as friends
where friends > 10
return a

7.8.17 Union set deduplication (union)

match (a)-[:knows]->(b) return b.name
union
match (a)-[:likes]->(b) eturn b.name

7.8.18 Union without deduplication (union all)

match (a)-[:knows]->(b) return b.name
union all
match (a)-[:likes]->(b) eturn b.name

7.8.19 View node properties/ID

match (p) where p.name = 'Jim' 
return keys(p)/properties(p)/id(p)

7.8.20 Match page return

match (n) where n.name='John' return n skip 10 limit 10

7.9 Reading files

7.9.1 Read network resource csv file

load csv with header from 'url:[www.download.com/abc.csv](http://www.download.com/abc.csv)' as line

create (:Track{trackId:line.id,name:line.name,length:line.length})

7.9.2 Read network resources in batches

For example csv file (default=1000)

using periodic commit (800)

load csv with header from 'url:[www.download.com/abc.csv](http://www.download.com/abc.csv)' as line

create (:Track{trackId:line.id,name:line.name,length:line.length})

7.9.3 Reading local files

load csv with headers from 'file:///00000.csv' as line
create (:Data{date:line['date'],open:line['open']})
(fieldterminator ';') //自定义分隔符

7.9.4 Precautions

※ 本地csv文件必须是utf-8格式
※ 需要导入neo4j数据库目录的import目录下
※ 本地csv包含column必须添加with headers

7.10 foreach keyword


  • personal summary

1. Node attribute use ()
2. Relationship attribute use []
3.where use 4. "="
use {}in ":"
5. relationship establishment use (m)-[:r]->(n)
6. regular use "=~"
7. node or relationship (/[variable name: type {attribute name: attribute value}]/)
8. When matching the relationship, it is necessary to return p based on p=(m)-[r]->(n) instead of returning r (display empty)

See the source of the article at the top or at the end of the article

https://download.csdn.net/download/sinat_39620217/87987419

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/131608595