Java zero-based self-study website, click to understand: https://how2j.cn
Aliyun server, click to learn: https://www.aliyun.com/minisite/goods
Step 1: About JDK version
Use at least JDK8 version, please download JDK8 or higher: Download and configure JDK environment
Step 2: Lucene concept
Lucene, an open source project, makes it easy for Java developers to get search results like the search engine google baidu.
Step 3: Run first, see the effect, and then learn
The old rule is to download the runnable project in the download area (click to enter) first , configure it and run it, and then learn what steps have been taken to achieve this effect.
Run the TestLucene class and expect to see the effect as shown in the figure.
There are a total of 10 pieces of data, and 6 hit results are searched by keywords. Different hit results have different matching scores. For example, the first one has high hits. It has both eye protection and light source . Other hits are relatively low. There is no match for eye-protecting keywords, only matching for light source keywords.
Step 4: imitate and troubleshoot
After ensuring that the runnable project can run correctly, follow the steps of the tutorial and imitate the code again.
The imitation process will inevitably have code discrepancies, resulting in the failure to obtain the expected running results. At this moment, compare the correct answer (runnable project) with your own code to locate the problem.
In this way, learning is effective and troubleshooting is efficient , which can significantly increase the speed of learning and cross all barriers on the learning path.
It is recommended to use diffmerge software for folder comparison. Compare your own project folder with my runnable project folder.
This software is very powerful, you can know which two files in the folder are wrong, and you can clearly mark them.
Here is a green installation and usage tutorial: diffmerge download and usage tutorial
Step 5: Lucene version
The currently used Lucene version is the latest version 7.2.1 as of 2018.3.9
Step 6: jar package
A series of required jar packages are placed in the project, just use it directly, including the Chinese word segmenter compatible with Lucene 7.2.1
Step 7: TestLucene.java
This is the complete code of TestLucene.java, the code will be explained in detail later
|
Step 8: tokenizer
Prepare a Chinese word segmenter. More concepts about the word segmenter are explained in detail in the word segmenter concept . Use it here first
|
Step 9: create index
1. First prepare 10 pieces of data.
These 10 pieces of data are all strings, which are equivalent to the data in the product table.
2. Add it to the index through the createIndex method.
Create an in-memory index. Why is Lucene faster than a database? Because it searches from the memory, it is naturally much faster than the database
|
Create a configuration object based on the Chinese tokenizer
|
Create index writer
|
Traverse the 10 data and put them into the index one by one
|
每条数据创建一个Document,并把这个Document放进索引里。 这个Document有一个字段,叫做"name"。 TestLucene.java 第49行创建查询器,就会指定查询这个字段
|
|
|
|
步骤 10 : 创建查询器
根据关键字 护眼带光源,基于 "name" 字段进行查询。 这个 "name" 字段就是在创建索引步骤里每个Document的 "name" 字段,相当于表的字段名
|
步骤 11 : 执行搜索
接着就执行搜索:
创建索引 reader:
|
基于 reader 创建搜索器:
|
指定每页要显示多少条数据:
|
执行搜索
|
|
步骤 12 : 显示查询结果
每一个ScoreDoc[] hits 就是一个搜索结果,首先把他遍历出来
|
然后获取当前结果的docid, 这个docid相当于就是这个数据在索引中的主键
|
再根据主键docid,通过搜索器从索引里把对应的Document取出来
|
接着就打印出这个Document里面的数据。 虽然当前Document只有name一个字段,但是代码还是通过遍历所有字段的形式,打印出里面的值,这样当Docment有多个字段的时候,代码就不用修改了,兼容性更好点。
scoreDoc.score 表示当前命中的匹配度得分,越高表示匹配程度越高
|
|
步骤 13 : 运行结果
As shown in the figure, there are a total of 10 pieces of data, and 6 hit results are queried by keywords. Different hit results have different matching scores. For example, the first one has high hits. There are both eye protection and belts. Light source . Other hits are relatively low. There is no match for eye-protecting keywords, only matching for light source keywords.
Step 14: the difference between like
Like can also be queried, so what is the difference between using lucene? There are two main points:
1. Relevance
By observing the running results , you can see that the results of different relevance will be queried, but using like, you can’t do this.
2. When the
amount of performance data is small, there will be like Very good performance, but with a large amount of data, the performance of like is much worse. In the next tutorial will demonstrate the query of 140,000 data
Step 15: Idea map
Now that I have done Lucene again by myself, I have a perceptual understanding, and then I will sort out the idea of Lucene.
1. Collect data first. The
data can be manually entered in the file system, database, network, or written directly in the memory like this example
2. Create an index from the data
3. The user enters a keyword
4. The query is created by a keyword
5. Get data from the index according to the querier
6. Then display the query results to the user
For more information, click to understand: https://how2j.cn/k/search-engine/search-engine-intro/1672.html