Lucene's question about Chinese word segmentation

Regarding the problem of Chinese word segmentation, because Lucene is developed by foreign big cows after all, it will obviously focus more on English articles, but fortunately, in the Lucene download package, the SmartCN word segmentation device is released for Chinese. Every time Lucene has When a new version is released, this package is updated at the same time.

The Chinese tokenizer recommended by the author is the IK tokenizer. Before entering the formal explanation, let's first understand the built-in analyzers in Lucene.


Basic introduction of analyzer types
WhitespaceAnalyzer uses spaces as the standard for word segmentation, not lexical units. Perform other normalization processing.
SimpleAnalyzer divides text information with non-letter characters, unifies lexical units into lowercase, and removes numeric characters
. StopAnalyzer This analyzer will remove some common a, the, an, etc., and can also be customized Stopwords
StandardAnalyzer The built-in standard analyzer of Lucene will convert lexical units into lowercase and remove stopwords and punctuation marks
. CJKAnalyzer is a tokenizer that can analyze Chinese, Japanese, and Korean languages, and has general support for Chinese.
SmartChineseAnalyzer has better Chinese support, but poor scalability
 

To evaluate the performance of a word segmenter, the key is to look at its word segmentation efficiency, flexibility, and scalability. Usually, a good Chinese word segmenter should have an extended thesaurus, a forbidden word database and a thesaurus. Of course, the most The key is to be in line with your own business, because sometimes we don't use some custom thesaurus, so you can ignore this when choosing a tokenizer. The latest version of the IK tokenizer released by the IK official website has good support for Lucene, but the support for solr is not good enough. You need to change the source code to support the solr4.x version. Another IK package used by the author is a version modified by some people that can support solr4.3, and fully supports extended thesaurus, forbidden thesaurus, and thesaurus, and the configuration in solr is very simple, only need to be in schmal. xml for simple configuration, you can use the powerful customization features of the IK tokenizer. However, the IK package released by the IK author on the official website does not support the extension function of the thesaurus in lucene. If you want to use it, you need to modify the source code yourself, but it is very easy to modify and expand the synonym by yourself.


Below, the author gives the test done in Lucene using the last version of IK released on the official website. The author has expanded the thesaurus part, and the source code will be given later.

Let's take a look at the first pure word segmentation test



Java code Copy code Collection code
1.package com.ikforlucene; 
2. 
3.import java.io.StringReader; 
4. 
5.import org.apache.lucene.analysis.TokenStream; 
6 .import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; 
7. 
8. 
9. public class Test { 
10.     
11.      
12. public static void main(String[] args) throws Exception { 
13. //The following tokenizer is a tokenizer that has been modified to support synonyms 
14. IKSynonymsAnalyzer analyzer=new IKSynonymsAnalyzer( ); 
15. String text="Sanji Sanxian is a rookie"; 
16. TokenStream ts=analyzer.tokenStream("field", new StringReader(text)); 
17. CharTermAttribute term=ts.addAttribute(CharTermAttribute.class) ; 
18. ts.reset();//Prepare for reset 
19. while(ts.incrementToken()){ 
20. System.out.println(term.toString()); 
21. } 
22. ts.end( );// 
23. ts.close();//Close the stream 
24.         
25.          
26. }  27. 
28. 
Running 


result: Java  code 



copy code  collection  code
1.3  , so that Sanjie is a word and Sanxian is a word. You need to add Sanjie and Sanxian to the thesaurus (note that it is read by line). Note that the saved format is UTF-8 or no BOM format can be added The running results after expanding the thesaurus are as follows: Java code copy code collection code 1. Three robbery  2. Sanxian  3. Yes  4. One  5. Rookie  The third step is to test the forbidden thesaurus, we block the two words of rookie, One word per line, the save format is the same as above. After adding the stop word library, the running result is  as  follows 









































Finally, let’s test the synonyms part. Now the author adds Henan and Luoyang people as synonyms of the word “one” to the thesaurus (the author is just doing a test here, the synonyms in the real production environment must be formal ), pay attention to synonyms, which are also read line by line, and use commas to separate the synonyms on each line.


After adding the thesaurus, the running results are as follows:



Java code copy code collection code
1. Three robbery 
2. Sanxian 
3. Yes 
4. One 
5. Henan people 
6. Luoyang people 



So far, most of the functions in Lucene4.3 have been used by IK If the test is passed, the source code of the extended synonym part is given below. Interested friends can refer to it for reference.




1.package
com.ikforlucene; 
2. 
3.import java.io.IOException; 
4.import java.io.Reader; 
5.import java.util.HashMap; 
6.import java.util.Map ; 
7. 
8.import org.apache.lucene.analysis.Analyzer; 
9.import org.apache.lucene.analysis.Tokenizer; 
10.import org.apache.lucene.analysis.synonym.SynonymFilterFactory; 
11.import org.apache.solr.core.SolrResourceLoader; 
12.import org.wltea.analyzer.lucene.IKTokenizer; 
13./**
14. *Yes Lucene that loads thesaurus
15. * Private IK tokenizer
16. * 
17. * 
18. * */ 
19. public class IKSynonymsAnalyzer extends Analyzer { 
20. 
21.      
22. @Override 
23. protected TokenStreamComponents createComponents(String arg0, Reader arg1 ) { 
24.         
25. Tokenizer token=new IKTokenizer(arg1, true);//Enable smart word segmentation 
26.         
27. Map<String, String> paramsMap=new HashMap<String, String>(); 
28.        paramsMap.put("luceneMatchVersion", "LUCENE_43"); 
29.        paramsMap.put("synonyms", "E:\\同义词\\synonyms.txt"); 
30.        SynonymFilterFactory factory=new SynonymFilterFactory(paramsMap); 
31.         SolrResourceLoader loader= new SolrResourceLoader(""); 
32.        try { 
33.            factory.inform(loader); 
34.        } catch (IOException e) { 
35.            // TODO Auto-generated catch block 
36.            e.printStackTrace(); 
37.        } 
38.      
39.        return new TokenStreamComponents(token, factory.create(token)); 
40.    } 
41.     
42.     
43.     
44. 
45.} 


Regarding the use of the synonym part, you can go to the official website to download the source code first, and then put this synonym extension part in it, it is very simple and convenient

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326473500&siteId=291194637