Neo4j achieve custom Chinese full-text index

Database retrieval efficiency, optimizing the general approach is to start from the primary index, and then consider a more complex load balancing according to the needs, means to separate read and distributed horizontal / vertical component database / table or the like;
indexes to retrieve information redundancy efficiency , which is space for time and reduce the efficiency of data writing; therefore the selection of the index field is very important.

  • Neo4j can, when the Add / Update qualified Node property, Index will be updated automatically to the specified Label the Node Create Index. Neo4j Index defaults Lucene implementation (can be customized, such as RTree index Spatial Index custom implementation), but the default new index only support an exact match (get), fuzzy query (query), then we need to be full-text indexing, control word Lucene background behavior.
  • Neo4j full-text index default word is for Western languages, such as the default exact query uses lucene KeywordAnalyzer (keyword word breaker), fulltext query uses a white-space tokenizer (space tokenizer), what is case Chinese lacks significance; it is the Chinese word for the need to hang a Chinese word, such as IK Analyzer, Ansj, as for beam-like director of home systems pullword deep learning word-based, it is more powerful friends.

In this paper, IK Analyzer word commonly used as an example, how to create a new full-text index of the field in the realization of fuzzy queries in Neo4j.


IKAnalyzer tokenizer

IKAnalyzer is an open source, lightweight java-based development of Chinese language segmentation toolkit.
IKAnalyzer3.0 features:

  • It uses a unique "forward iteration of the most fine-grained segmentation algorithm" to support fine-grained and maximum word length two kinds of split mode; has 830 000 words / sec (1600KB / S) of high-speed processing capability.
  • Using multi-mode analysis of sub-processors, support: letters, numbers, Chinese vocabulary decile word processing, compatible Korean, Japanese character dictionary storage optimization, smaller memory footprint. Support for user defined dictionary expansion
  • Lucene full-text search for the optimization of Query Analyzer IKQueryParser (author recommended blood); introducing simple search expression analysis algorithms to optimize the use of ambiguous search query permutations and combinations of keywords, can greatly improve the hit rate Lucene search.
    IK Analyser not yet maven libraries have their own install manually download to the local library, next time do a maven empty his private library at github, upload these maven central library there is no toolkit.

IKAnalyzer custom user dictionary

  • Dictionary file
    custom dictionary suffix .dic dictionary file, you must use the saved without BOM of UTF-8 encoded file.
  • Configuring dictionary
    dictionary and IKAnalyzer.cfg.xml profile path problem, IKAnalyzer.cfg.xml must be the src root directory. You can be put in any dictionary, but you want to configure in IKAnalyzer.cfg.xml in pairs. Following this configuration, ext.dic and stopword.dic be in the same directory.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
    <properties>
    < Comment> Analyzer expanded configuration the IK </ Comment>
     
    <! - where users can configure their own extensions Dictionary ->
    <entry key="ext_dict">/ext.dic;</entry>
     
    <! - where users can configure their own extensions stop word dictionary ->
    <entry key="ext_stopwords">/stopword.dic</entry>
    </properties>

Neo4j full-text index building

As specified IKAnalyzer analyzer luncene the word, and specify the attributes of the new full-text index of all Node

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[@Override](/user/Override)
public void createAddressNodeFullTextIndex () {
try (Transaction tx = graphDBService.beginTx()) {
IndexManager index = graphDBService.index();
Index<Node> addressNodeFullTextIndex =
index.forNodes( "addressNodeFullTextIndex", MapUtil.stringMap(IndexManager.PROVIDER, "lucene", "analyzer", IKAnalyzer.class.getName()));
 
ResourceIterator<Node> nodes = graphDBService.findNodes(DynamicLabel.label( "AddressNode"));
while (nodes.hasNext()) {
Node node = nodes.next ();
// create a new full-text indexing of text fields
Object text = node.getProperty( "text", null);
addressNodeFullTextIndex.add(node, "text", text);
}
tx.success();
}
}

 

Neo4j full-text index test

For keywords (such as 'Limited'), multi-keyword fuzzy queries (such as 'education Suzhou company') can retrieve the default, and the search results according to relevance has been sorted.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
package uadb.tr.neodao.test;
 
import org.junit.Test;
import org.junit.runner.RunWith;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Transaction;
import org.neo4j.graphdb.index.Index;
org.neo4j.graphdb.index.IndexHits import;
import org.neo4j.graphdb.index.IndexManager;
import org.neo4j.helpers.collection.MapUtil;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.test.context.ContextConfiguration;
import org.springframework.test.context.junit4.SpringJUnit4ClassRunner;
import org.wltea.analyzer.lucene.IKAnalyzer;
 
import com.lt.uadb.tr.entity.adtree.AddressNode;
import com.lt.util.serialize.JsonUtil;
 
/**
* AddressNodeNeoDaoTest
*
* [@author](/user/author) geosmart
*/
@RunWith(SpringJUnit4ClassRunner. class)
@ ContextConfiguration(locations = { "classpath:app.neo4j.cfg.xml" })
public class AddressNodeNeoDaoTest {
[@Autowired](/user/Autowired)
GraphDatabaseService graphDBService;
 
[@Test](/user/Test)
public void test_selectAddressNodeByFullTextIndex() {
try (Transaction tx = graphDBService.beginTx()) {
IndexManager index = graphDBService.index();
Index<Node> addressNodeFullTextIndex = index.forNodes( "addressNodeFullTextIndex" ,
MapUtil. stringMap(IndexManager.PROVIDER, "lucene", "analyzer" , IKAnalyzer.class.getName()));
IndexHits<Node> foundNodes = addressNodeFullTextIndex.query( "text" , "苏州 教育 公司" );
for (Node node : foundNodes) {
AddressNode entity = JsonUtil.ConvertMap2POJO(node.getAllProperties(), AddressNode. class, false, true);
System. out.println(entity.getAll地址实全称());
}
tx.success();
}
}
}

 

CyperQL中使用自定义全文索引查询

正则查询

1
2
3
4
profile
match (a:AddressNode{ruleabbr:'TOW',text:'唯亭镇'})<-[r:BELONGTO]-(b:AddressNode{ruleabbr:'STR'})
where b.text=~ '金陵.*'
return a,b

全文索引查询

1
2
3
4
5
profile
START b=node:addressNodeFullTextIndex("text:金陵*")
match (a:AddressNode{ruleabbr:'TOW',text:'唯亭镇'})<-[r:BELONGTO]-(b:AddressNode)
where b.ruleabbr='STR'
return a,b

LegacyIndex中建立联合exact和fulltext索引

对label为AddressNode的节点,根据节点属性ruleabbr的分类addressnode_fulltext_index(省->市->区县->乡镇街道->街路巷/物业小区)/addressnode_exact_index(门牌号->楼幢号->单元号->层号->户室号),对属性text分别建不同类型的索引

1
2
3
4
profile
START a=node:addressnode_fulltext_index("text:商业街"),b=node:addressnode_exact_index("text:二期19")
match (a:AddressNode{ruleabbr:'STR'})-[r:BELONGTO]-(b:AddressNode{ruleabbr:'TAB'})
return a,b limit 10
Original Address: http: //neo4j.com.cn/topic/58184ea2cdf6c5bf145675c3

Guess you like

Origin www.cnblogs.com/jpfss/p/11411128.html