[Lucene] Apache Lucene full-text search engine architecture entry and actual combat

Lucene is a set of open source libraries for full-text search and search, supported and provided by the Apache Software Foundation. Lucene provides a simple but powerful application programming interface that can do full-text indexing and searching. Lucene is a mature free and open source tool in the Java development environment. For its part, Lucene is currently and in recent years the most popular free Java information retrieval library. --"Baidu Encyclopedia"

  This blog post mainly starts from two aspects. First, it introduces the principle of full-text search in Lucene, and then shows how to use Lucene through program examples. I searched the Internet about the principle of full-text search, and read several articles. Finally, when writing this article, I partially referred to two of them (the address I put at the end of the article), thanks to the original author.

1. Full-text search

  What is full-text search? For example, if you want to find a certain string in a file, the most direct idea is to search from the beginning, and it will be OK if you find it. This is very practical for files with small amounts of data, but for large amounts of data. As far as the file is concerned, it's a bit hehe. In other words, it is the same to find a file containing a certain character string. If you find it in a hard disk with dozens of gigabytes, the efficiency is conceivable, and it is very low.
  The data in the file belongs to unstructured data, which means that it has no structure at all. To solve the efficiency problem mentioned above, we must first extract part of the information from the unstructured data and reorganize it. Make it have a certain structure, and then search the data with a certain structure, so as to achieve the purpose of searching relatively fast. This is called full-text search. That is, the process of establishing an index first, and then searching the index.
  So how is indexing in Lucene? Suppose there are now two documents, the content is as follows:

The content of Article 1 is: Tom lives in Guangzhou, I live in Guangzhou too.
The content of Article 2 is: He once lived in Shanghai.

  The first step is to pass the document to the Tokenizer, which divides the document into words and removes punctuation and stop words. The so-called stop words refer to words with no special meaning, such as a, the, too, etc. in English. After word segmentation, Token is obtained. as follows:

The result of article 1 after word segmentation: [Tom] [lives] [Guangzhou] [I] [live] [Guangzhou] The result of
article 2 after word segmentation: [He] [lives] [Shanghai]

  Then the tokens are passed to the language processing component (Linguistic Processor). For English, the language processing component generally turns the letters into lowercase and reduces the word to the root form, such as "lives" to "live", etc., and turns the word into a root The format, such as "drove" to "drive", etc. Then get the word (Term). as follows:

The processed result of Article 1: [tom] [live] [guangzhou] [i] [live] [guangzhou] The
processed result of Article 2: [he] [live] [shanghai]

  Finally, the obtained words are passed to the index component (Indexer), and the index component is processed to obtain the following index structure:

Key words Article number [frequency of occurrence] Appearance position
guangzhou 1[2] 3,6
he 2[1] 1
i 1[1] 4
live 1[2],2[1] 2,5,2
shanghai 2[1] 3
tom 1[1] 1

  The above is the core part of the Lucene index structure. Its keywords are arranged in alphabetical order, so Lucene can use binary search algorithm to quickly locate keywords. When implemented, Lucene saves the above three columns as a dictionary file (Term Dictionary), frequency file (frequencies) and position file (positions) respectively. Among them, the dictionary file not only saves each keyword, but also retains the pointer to the frequency file and the location file, and the frequency information and location information of the keyword can be found through the pointer.
  The search process is to first search the dictionary binary, find the word, read all the article numbers through the pointer to the frequency file, and then return the result, and then the word can be found in the specific article according to the position of occurrence. So Lucene may be slower when indexing for the first time, but there is no need to index every time in the future, it will be faster. Of course, this is a search for English, and the rules for Chinese will be different. I will look at the relevant information later.

2. Sample code

  According to the above analysis, there are two steps for full-text search, indexing first, and then searching. So to test this process, I wrote two java classes, one for testing indexing, and the other for testing retrieval. First create a maven project, pom.xml is as follows:

 
  1. <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

  2. <modelVersion>4.0.0</modelVersion>

  3. <groupId>demo.lucene</groupId>

  4. <artifactId>Lucene01</artifactId>

  5. <version>0.0.1-SNAPSHOT</version>

  6. <build/>

  7.  
  8. <dependencies>

  9. <!-- lucene核心包 -->

  10. <dependency>

  11. <groupId>org.apache.lucene</groupId>

  12. <artifactId>lucene-core</artifactId>

  13. <version>5.3.1</version>

  14. </dependency>

  15. <!-- lucene查询解析包 -->

  16. <dependency>

  17. <groupId>org.apache.lucene</groupId>

  18. <artifactId>lucene-queryparser</artifactId>

  19. <version>5.3.1</version>

  20. </dependency>

  21. <!-- lucene解析器包 -->

  22. <dependency>

  23. <groupId>org.apache.lucene</groupId>

  24. <artifactId>lucene-analyzers-common</artifactId>

  25. <version>5.3.1</version>

  26. </dependency>

  27. </dependencies>

  28. </project>

  Before writing the program, I must first get some files. I randomly found some English documents (the Chinese will be studied later) and put them in the D:\lucene\data\ directory, as follows: the
Documentation
documents are all dense English, I will not take a screenshot.
Next, start to write a java program for indexing:

 
  1. /**

  2. * 建立索引的类

  3. * @author Ni Shengwu

  4. *

  5. */

  6. public class Indexer {

  7.  
  8. private IndexWriter writer; //写索引实例

  9.  
  10. //构造方法,实例化IndexWriter

  11. public Indexer(String indexDir) throws Exception {

  12. Directory dir = FSDirectory.open(Paths.get(indexDir));

  13. Analyzer analyzer = new StandardAnalyzer(); //标准分词器,会自动去掉空格啊,is a the等单词

  14. IndexWriterConfig config = new IndexWriterConfig(analyzer); //将标准分词器配到写索引的配置中

  15. writer = new IndexWriter(dir, config); //实例化写索引对象

  16. }

  17.  
  18. //关闭写索引

  19. public void close() throws Exception {

  20. writer.close();

  21. }

  22.  
  23. //索引指定目录下的所有文件

  24. public int indexAll(String dataDir) throws Exception {

  25. File[] files = new File(dataDir).listFiles(); //获取该路径下的所有文件

  26. for(File file : files) {

  27. indexFile(file); //调用下面的indexFile方法,对每个文件进行索引

  28. }

  29. return writer.numDocs(); //返回索引的文件数

  30. }

  31.  
  32. //索引指定的文件

  33. private void indexFile(File file) throws Exception {

  34. System.out.println("索引文件的路径:" + file.getCanonicalPath());

  35. Document doc = getDocument(file); //获取该文件的document

  36. writer.addDocument(doc); //调用下面的getDocument方法,将doc添加到索引中

  37. }

  38.  
  39. //获取文档,文档里再设置每个字段,就类似于数据库中的一行记录

  40. private Document getDocument(File file) throws Exception{

  41. Document doc = new Document();

  42. //添加字段

  43. doc.add(new TextField("contents", new FileReader(file))); //添加内容

  44. doc.add(new TextField("fileName", file.getName(), Field.Store.YES)); //添加文件名,并把这个字段存到索引文件里

  45. doc.add(new TextField("fullPath", file.getCanonicalPath(), Field.Store.YES)); //添加文件路径

  46. return doc;

  47. }

  48.  
  49. public static void main(String[] args) {

  50. String indexDir = "D:\\lucene"; //将索引保存到的路径

  51. String dataDir = "D:\\lucene\\data"; //需要索引的文件数据存放的目录

  52. Indexer indexer = null;

  53. int indexedNum = 0;

  54. long startTime = System.currentTimeMillis(); //记录索引开始时间

  55. try {

  56. indexer = new Indexer(indexDir);

  57. indexedNum = indexer.indexAll(dataDir);

  58. } catch (Exception e) {

  59. e.printStackTrace();

  60. } finally {

  61. try {

  62. indexer.close();

  63. } catch (Exception e) {

  64. e.printStackTrace();

  65. }

  66. }

  67. long endTime = System.currentTimeMillis(); //记录索引结束时间

  68. System.out.println("索引耗时" + (endTime-startTime) + "毫秒");

  69. System.out.println("共索引了" + indexedNum + "个文件");

  70. }

  71. }

  I wrote the program in accordance with the process of indexing. It has been explained clearly in the comments, so I won't repeat it here. Then run the main method to see the results, as follows:
Index result
  A total of 7 files were indexed, which took 649 milliseconds, which is still quite fast, and the path of the index file is also correct, and then you can see that D:\lucene\ will generate some files , These are the generated indexes.
index
  Now that we have the index, we can retrieve the characters we want to query. I just opened a file and found an ugly string "generate-maven-artifacts" as the search object. Before retrieving, take a look at the retrieved java code:

 
  1. public class Searcher {

  2.  
  3. public static void search(String indexDir, String q) throws Exception {

  4.  
  5. Directory dir = FSDirectory.open(Paths.get(indexDir)); //获取要查询的路径,也就是索引所在的位置

  6. IndexReader reader = DirectoryReader.open(dir);

  7. IndexSearcher searcher = new IndexSearcher(reader);

  8. Analyzer analyzer = new StandardAnalyzer(); //标准分词器,会自动去掉空格啊,is a the等单词

  9. QueryParser parser = new QueryParser("contents", analyzer); //查询解析器

  10. Query query = parser.parse(q); //通过解析要查询的String,获取查询对象

  11.  
  12. long startTime = System.currentTimeMillis(); //记录索引开始时间

  13. TopDocs docs = searcher.search(query, 10);//开始查询,查询前10条数据,将记录保存在docs中

  14. long endTime = System.currentTimeMillis(); //记录索引结束时间

  15. System.out.println("匹配" + q + "共耗时" + (endTime-startTime) + "毫秒");

  16. System.out.println("查询到" + docs.totalHits + "条记录");

  17.  
  18. for(ScoreDoc scoreDoc : docs.scoreDocs) { //取出每条查询结果

  19. Document doc = searcher.doc(scoreDoc.doc); //scoreDoc.doc相当于docID,根据这个docID来获取文档

  20. System.out.println(doc.get("fullPath")); //fullPath是刚刚建立索引的时候我们定义的一个字段

  21. }

  22. reader.close();

  23. }

  24.  
  25. public static void main(String[] args) {

  26. String indexDir = "D:\\lucene";

  27. String q = "generate-maven-artifacts"; //查询这个字符串

  28. try {

  29. search(indexDir, q);

  30. } catch (Exception e) {

  31. e.printStackTrace();

  32. }

  33. }

  34. }

Run the main method to see the result:
Search
  Lucene has correctly retrieved it for us, and then I removed the "-" in the middle, it can also help us retrieve it, but I removed all the characters in front, leaving only "rtifacts" "It can't be retrieved. This also shows that the indexing in Lucene is divided by words, but this problem can be solved. I will write about it in a follow-up article.

Part of reference from:
http://blog.csdn.net/forfuture1978/article/details/4711308
http://www.cnblogs.com/dewin/archive/2009/11/24/1609905.html

Guess you like

Origin blog.csdn.net/keepfriend/article/details/113592047