Getting started with search engine technology Lucene7

Java zero-based self-study website, click to understand: https://how2j.cn

Aliyun server, click to learn: https://www.aliyun.com/minisite/goods

Step 1: About JDK version

Use at least JDK8 version, please download JDK8 or higher:  Download and configure JDK environment

Step 2: Lucene concept

Lucene, an open source project, makes it easy for Java developers to get search results like the search engine google baidu.

Step 3: Run first, see the effect, and then learn

The old rule is to download the runnable project in the download area (click to enter) first , configure it and run it, and then learn what steps have been taken to achieve this effect. 
Run the TestLucene class and expect to see the effect as shown in the figure.
There are a total of 10 pieces of data, and 6 hit results are searched by keywords. Different hit results have different matching scores. For example, the first one has high hits. It has both  eye protection and  light source . Other hits are relatively low. There is no match for eye-protecting keywords, only matching for light source keywords.

Run first, see the effect, and then learn

Step 4: imitate and troubleshoot

After ensuring that the runnable project can run correctly, follow the steps of the tutorial and imitate the code again. 
The imitation process will inevitably have code discrepancies, resulting in the failure to obtain the expected running results. At this moment, compare the correct answer  (runnable project) with your own code to locate the problem. 
In this way, learning is effective and troubleshooting is efficient , which can significantly increase the speed of learning and cross all barriers on the learning path. 

It is recommended to use diffmerge software for folder comparison. Compare your own project folder with my runnable project folder. 
This software is very powerful, you can know which two files in the folder are wrong, and you can clearly mark them. 
Here is a green installation and usage tutorial: diffmerge download and usage tutorial

Step 5: Lucene version

The currently used Lucene version is the latest version 7.2.1 as of 2018.3.9

Step 6: jar package

A series of required jar packages are placed in the project, just use it directly, including the Chinese word segmenter compatible with Lucene 7.2.1

jar 包

Step 7: TestLucene.java

This is the complete code of TestLucene.java, the code will be explained in detail later

package com.how2java;

 

import java.io.IOException;

import java.io.StringReader;

import java.util.ArrayList;

import java.util.List;

 

import org.apache.lucene.analysis.TokenStream;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.document.TextField;

import org.apache.lucene.index.DirectoryReader;

import org.apache.lucene.index.IndexReader;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.index.IndexWriterConfig;

import org.apache.lucene.index.IndexableField;

import org.apache.lucene.queryparser.classic.QueryParser;

import org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.Query;

import org.apache.lucene.search.ScoreDoc;

import org.apache.lucene.search.highlight.Highlighter;

import org.apache.lucene.search.highlight.QueryScorer;

import org.apache.lucene.search.highlight.SimpleHTMLFormatter;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.RAMDirectory;

import org.wltea.analyzer.lucene.IKAnalyzer;

 

public class TestLucene {

 

    public static void main(String[] args) throws Exception {

        // 1. 准备中文分词器

        IKAnalyzer analyzer = new IKAnalyzer();

 

        // 2. 索引

        List<String> productNames = new ArrayList<>();

        productNames.add("飞利浦led灯泡e27螺口暖白球泡灯家用照明超亮节能灯泡转色温灯泡");

        productNames.add("飞利浦led灯泡e14螺口蜡烛灯泡3W尖泡拉尾节能灯泡暖黄光源Lamp");

        productNames.add("雷士照明 LED灯泡 e27大螺口节能灯3W球泡灯 Lamp led节能灯泡");

        productNames.add("飞利浦 led灯泡 e27螺口家用3w暖白球泡灯节能灯5W灯泡LED单灯7w");

        productNames.add("飞利浦led小球泡e14螺口4.5w透明款led节能灯泡照明光源lamp单灯");

        productNames.add("飞利浦蒲公英护眼台灯工作学习阅读节能灯具30508带光源");

        productNames.add("欧普照明led灯泡蜡烛节能灯泡e14螺口球泡灯超亮照明单灯光源");

        productNames.add("欧普照明led灯泡节能灯泡超亮光源e14e27螺旋螺口小球泡暖黄家用");

        productNames.add("聚欧普照明led灯泡节能灯泡e27螺口球泡家用led照明单灯超亮光源");    

        Directory index = createIndex(analyzer, productNames);

 

        // 3. 查询器

        String keyword = "护眼带光源";

        Query query = new QueryParser("name", analyzer).parse(keyword);

         

        // 4. 搜索

        IndexReader reader = DirectoryReader.open(index);

        IndexSearcher searcher = new IndexSearcher(reader);

        int numberPerPage = 1000;

        System.out.printf("当前一共有%d条数据%n",productNames.size());

        System.out.printf("查询关键字是:\"%s\"%n",keyword);

        ScoreDoc[] hits = searcher.search(query, numberPerPage).scoreDocs;

 

        // 5. 显示查询结果

        showSearchResults(searcher, hits, query, analyzer);

        // 6. 关闭查询

        reader.close();

    }

 

    private static void showSearchResults(IndexSearcher searcher, ScoreDoc[] hits, Query query, IKAnalyzer analyzer)

            throws Exception {

        System.out.println("找到 " + hits.length + " 个命中.");

        System.out.println("序号\t匹配度得分\t结果");

        for (int i = 0; i < hits.length; ++i) {

            ScoreDoc scoreDoc= hits[i];

            int docId = scoreDoc.doc;

            Document d = searcher.doc(docId);

            List<IndexableField> fields = d.getFields();

            System.out.print((i + 1));

            System.out.print("\t" + scoreDoc.score);

            for (IndexableField f : fields) {

                System.out.print("\t" + d.get(f.name()));

            }

            System.out.println();

        }

    }

 

    private static Directory createIndex(IKAnalyzer analyzer, List<String> products) throws IOException {

        Directory index = new RAMDirectory();

        IndexWriterConfig config = new IndexWriterConfig(analyzer);

        IndexWriter writer = new IndexWriter(index, config);

 

        for (String name : products) {

            addDoc(writer, name);

        }

        writer.close();

        return index;

    }

 

    private static void addDoc(IndexWriter w, String name) throws IOException {

        Document doc = new Document();

        doc.add(new TextField("name", name, Field.Store.YES));

        w.addDocument(doc);

    }

}

Step 8: tokenizer

Prepare a Chinese word segmenter. More concepts about the word segmenter  are explained in detail in the word segmenter concept . Use it here first

// 1. 准备中文分词器

IKAnalyzer analyzer = new IKAnalyzer();

Step 9: create index

1. First prepare 10 pieces of data.
These 10 pieces of data are all strings, which are equivalent to the data in the product table.
2. Add it to the index through the createIndex method.

Create an in-memory index. Why is Lucene faster than a database? Because it searches from the memory, it is naturally much faster than the database

Directory index = new RAMDirectory();


Create a configuration object based on the Chinese tokenizer

IndexWriterConfig config = new IndexWriterConfig(analyzer);


Create index writer

IndexWriter writer = new IndexWriter(index, config);


Traverse the 10 data and put them into the index one by one

for (String name : products) {

    addDoc(writer, name);

}


每条数据创建一个Document,并把这个Document放进索引里。 这个Document有一个字段,叫做"name"。 TestLucene.java 第49行创建查询器,就会指定查询这个字段

private static void addDoc(IndexWriter w, String name) throws IOException {

    Document doc = new Document();

    doc.add(new TextField("name", name, Field.Store.YES));

    w.addDocument(doc);

}

// 2. 索引

List<String> productNames = new ArrayList<>();

productNames.add("飞利浦led灯泡e27螺口暖白球泡灯家用照明超亮节能灯泡转色温灯泡");

productNames.add("飞利浦led灯泡e14螺口蜡烛灯泡3W尖泡拉尾节能灯泡暖黄光源Lamp");

productNames.add("雷士照明 LED灯泡 e27大螺口节能灯3W球泡灯 Lamp led节能灯泡");

productNames.add("飞利浦 led灯泡 e27螺口家用3w暖白球泡灯节能灯5W灯泡LED单灯7w");

productNames.add("飞利浦led小球泡e14螺口4.5w透明款led节能灯泡照明光源lamp单灯");

productNames.add("飞利浦蒲公英护眼台灯工作学习阅读节能灯具30508带光源");

productNames.add("欧普照明led灯泡蜡烛节能灯泡e14螺口球泡灯超亮照明单灯光源");

productNames.add("欧普照明led灯泡节能灯泡超亮光源e14e27螺旋螺口小球泡暖黄家用");

productNames.add("聚欧普照明led灯泡节能灯泡e27螺口球泡家用led照明单灯超亮光源");    

Directory index = createIndex(analyzer, productNames);

private static Directory createIndex(IKAnalyzer analyzer, List<String> products) throws IOException {

    Directory index = new RAMDirectory();

    IndexWriterConfig config = new IndexWriterConfig(analyzer);

    IndexWriter writer = new IndexWriter(index, config);

 

    for (String name : products) {

        addDoc(writer, name);

    }

    writer.close();

    return index;

}

private static void addDoc(IndexWriter w, String name) throws IOException {

    Document doc = new Document();

    doc.add(new TextField("name", name, Field.Store.YES));

    w.addDocument(doc);

}

步骤 10 : 创建查询器

根据关键字 护眼带光源,基于 "name" 字段进行查询。 这个 "name" 字段就是在创建索引步骤里每个Document的 "name" 字段,相当于表的字段名

String keyword = "护眼带光源";

Query query = new QueryParser("name", analyzer).parse(keyword);

步骤 11 : 执行搜索

接着就执行搜索:
创建索引 reader:

IndexReader reader = DirectoryReader.open(index);


基于 reader 创建搜索器:

IndexSearcher searcher = new IndexSearcher(reader);


指定每页要显示多少条数据:

int numberPerPage = 1000;


执行搜索

ScoreDoc[] hits = searcher.search(query, numberPerPage).scoreDocs;

// 4. 搜索

IndexReader reader = DirectoryReader.open(index);

IndexSearcher searcher = new IndexSearcher(reader);

int numberPerPage = 1000;

System.out.printf("当前一共有%d条数据%n",productNames.size());

System.out.printf("查询关键字是:\"%s\"%n",keyword);

ScoreDoc[] hits = searcher.search(query, numberPerPage).scoreDocs;

步骤 12 : 显示查询结果

每一个ScoreDoc[] hits 就是一个搜索结果,首先把他遍历出来

for (int i = 0; i < hits.length; ++i) {

ScoreDoc scoreDoc= hits[i];


然后获取当前结果的docid, 这个docid相当于就是这个数据在索引中的主键

int docId = scoreDoc.doc;


再根据主键docid,通过搜索器从索引里把对应的Document取出来

Document d = searcher.doc(docId);


接着就打印出这个Document里面的数据。 虽然当前Document只有name一个字段,但是代码还是通过遍历所有字段的形式,打印出里面的值,这样当Docment有多个字段的时候,代码就不用修改了,兼容性更好点。
scoreDoc.score 表示当前命中的匹配度得分,越高表示匹配程度越高

List<IndexableField> fields = d.getFields();

System.out.print((i + 1));

System.out.print("\t" + scoreDoc.score);

for (IndexableField f : fields) {

        System.out.print("\t" + d.get(f.name()));

}

private static void showSearchResults(IndexSearcher searcher, ScoreDoc[] hits, Query query, IKAnalyzer analyzer)

        throws Exception {

    System.out.println("找到 " + hits.length + " 个命中.");

    System.out.println("序号\t匹配度得分\t结果");

    for (int i = 0; i < hits.length; ++i) {

        ScoreDoc scoreDochits[i];

        int docId scoreDoc.doc;

        Document d searcher.doc(docId);

        List<IndexableField> fields = d.getFields();

        System.out.print((i + 1));

        System.out.print("\t" + scoreDoc.score);

        for (IndexableField f : fields) {

            System.out.print("\t" + d.get(f.name()));

        }

        System.out.println();

    }

}

步骤 13 : 运行结果

As shown in the figure, there are a total of 10 pieces of data, and 6 hit results are queried by keywords. Different hit results have different matching scores. For example, the first one has high hits. There are both  eye protection and  belts. Light source . Other hits are relatively low. There is no match for eye-protecting keywords, only matching for light source keywords.

operation result

Step 14: the difference between like

Like can also be queried, so what is the difference between using lucene? There are two main points:
1. Relevance
By observing the running results , you can see that the results of different relevance will be queried, but using like, you can’t do this.
2. When the
amount of performance data is small, there will be like Very good performance, but with a large amount of data, the performance of like is much worse. In the next tutorial will demonstrate  the query of  140,000 data

Step 15: Idea map

Now that I have done Lucene again by myself, I have a perceptual understanding, and then I will sort out the idea of ​​Lucene.
1. Collect data first. The
data can be manually entered in the file system, database, network, or written directly in the memory like this example
2. Create an index from the data
3. The user enters a keyword
4. The query is created by a keyword
5. Get data from the index according to the querier
6. Then display the query results to the user

Idea map


For more information, click to understand:  https://how2j.cn/k/search-engine/search-engine-intro/1672.html

Guess you like

Origin blog.csdn.net/java_zdc/article/details/105861056