Lucene learning: introduction and construction projects

1. The  full-text search Introduction

1.1.  What is the full-text search

Finally have time to introduce full-text search of previously learned. According to the interpretation of Baidu, the concept of full-text search are:

Full-text search is a computer program by scanning the article, every word, to establish an index of every word, indicating the number and location of the word appears in the article. When the user queries based indexing search, similar to the process of the search word by word dictionary retrieval table.

Popular, full-text search is to break down the information query, and then check out the information recorded with information matching the query decomposition. The best example is Baidu:

 

By the query "Before you come to our already a champion of the" broken into "Before you come", "already", "champion" match information and other information to go check out. This search method is the full-text search.

1.2  full-text search and database query What is the difference

See here, maybe some people will say full-text search and database queries are queries should be little difference now!

This is where the difference can be expressed in the following expression package.

 

 

 

 If the database can solve the problem of full-text search, it should not have lucene they should occur. Database queries old-fashioned way of comparison, before and after the match most is able to restrict larger.

select xxx from xxx where xxx like ' % before you come to our already a champion % '

Such statements query word and a little more than a sub-query results difference is relatively large. And when the data is relatively large, met tens of millions of billions of dollars of data, the efficiency of database queries will be relatively low.

The full-text search is to deal with this situation just born.

Full-text search can handle millions of data while maintaining efficiency, and search out the results will be more intelligent.

2.  Lucene Introduction

Official website: http://lucene.apache.org /

Lucene what is it? Why do you want to introduce it?

Because Lucene is a full-text search and search for open-source library by Apache and provide software support Foundation. Lucene provides a simple yet powerful application program interface ( API ), can do full-text indexing and search, in the Java development environment Lucene is a mature free open-source tool.

Of course, you can also use Solr and Elasticsearch to develop, but we have talked about is lucene to develop, the other will not start. But it is worth mentioning that, Solr and Elasticsearch are based on Lucene developed search engine products.

2.1.  Lucene full-text search principles

Here briefly about Lucene principle of full-text search:

Lucene technology can construct a database analogy, but lucene through access to information text word by word is indexed constitute index database (database analogy). Then you can lucene to build a good index of the query data or document object. lucene one of the document corresponding to a record of a database, and a feild corresponds to a database field.

Note: Lucene is suitable for plain text (txt) index, theoretically doc, ppt, cls , etc. can be, but poor test results. To pdf and other documents should be used poi tools such as data conversion, re-indexing.

 

FIG from the reference: https://blog.csdn.net/weixin_42633131/article/details/82873731

2.2.  Tokenizer

Word is lucene the text information used to extract information, indexing algorithm implement. Although lucene itself owned several tokenizer, but only valid for English, the effect on Chinese are not good. So here recommend a Chinese word breaker: IKAnalyzer

2.2.1.  IKAnalyzer tokenizer

IK as a relatively old-fashioned, and the more well-known Chinese word breaker. Good segmentation effect, and can develop new term is a better choice.

The disadvantage is that since the update to 2012 after the release, it stopped updating. Although there is a saying that, IK word has reached the bottleneck of the Chinese word, that this product was perfect, no need to update, followed by nothing more than to add new words to expand the files inside it. However, IK word only supports Lucene4 version does not support back version of lucene, to support the high version of the lucene you need to IK be modified source code word breaker.

In recent github see update-related details, you can view: https://github.com/magese/ik-analyzer-solr

3.  the Java introduced Lucene8.1.0

3.1.  Import associated packages

To the official website to download Lucene 's jar package : http: //lucene.apache.org/

To github download IKAnalyzer the jar package

Be imported jar package as follows:

 

While also introducing IKAnalyzer related files:

IKAnalyzer.cfg.xml

ext.dic

stopword.dic

3.1.1. IKAnalyzer.cfg.xml

IKAnalyzer.cfg.xml file should be placed src directory, otherwise the word will not get to stop the expansion of word or words. The configuration file path ext.dic and stopword.dic these two documents.

. 1  <? XML Version = "1.0" encoding = "UTF-. 8" ?> 
2  
. 3  <! DOCTYPE the SYSTEM Properties "http://java.sun.com/dtd/properties.dtd" >   
. 4  
. 5  < Properties >   
. 6  
. 7  < Comment > the IK Analyzer expanded configuration </ Comment > 
. 8  
. 9  <-! where users can configure their extension dictionary -> 
10  
. 11  < entry Key = "ext_dict" > ext.dic; </ entry >  
12 is  
13 is  < !- where users can configure their own extensions stop word dictionary -> 
14 
15 <entry key="ext_stopwords">stopword.dic;</entry> 
16 
17 </properties>

3.1.2. ext.dic

ext.dic is expanding lexicon, the words will expand the vocabulary word is no longer inside the word.

3.1.3. stopword.dic

stopword.dic is to stop the thesaurus, word filtered out and the device will stop words thesaurus match.

3.1.4.  Participle test

Here you can write a method to test the effect of word:

. 1  Import java.io.IOException;
 2  Import java.io.StringReader;
 . 3  Import org.wltea.analyzer.core.IKSegmenter;
 . 4  Import org.wltea.analyzer.core.Lexeme;
 . 5  
. 6  public  class IkanalyzerTest {
 . 7  
. 8  Private  static  void analysisString () {
 9  
10 String text = "before you come to is that we have a champion" ;
 11  
12 StringReader SR = new new StringReader (text);
 13  
14 IKSegmenter IK = new new IKSegmenter (SR, to true );
15 
16 Lexeme lex=null;
17 
18 try {
19 
20 while((lex=ik.next())!=null){
21 
22 // System.out.print(lex.getLexemeText()+"|");
23 
24 System.out.println(lex.getLexemeText());
25 
26 }
27 
28 } catch (IOException e) {
29 
30 e.printStackTrace();
31 
32 } finally {
33 
34 // 关闭流资源
35 
36 if(sr != null) {
37 
38 sr.close();
39 
40 }
41 
42 }
43 
44 }
45 
46 
47 public static void main(String[] args) {
48 
49 analysisString();
50 
51 
52 }
53 
54 }

 

 

Test Results:

Load extension dictionary: ext.dic

Stop load extension dictionary: stopword.dic

over you

Come

prior to

we

already

champion

Guess you like

Origin www.cnblogs.com/bestlmc/p/11865658.html