Lucene专题-开发实践

1.配置开发环境

1.1.Lucene下载

Lucene是开发全文检索功能的工具包，从官方网站下载lucene-7.4.0，并解压。
在这里插入图片描述
官方网站：http://lucene.apache.org/
版本：lucene-7.4.0
Jdk要求：1.8以上

1.2.使用的jar包

lucene-core-7.4.0.jar
在这里插入图片描述 lucene-analyzers-common-7.4.0.jar

2.入门程序

2.1.需求

实现一个文件的搜索功能，通过关键字搜索文件，凡是文件名或文件内容包括关键字的文件都需要找出来。还可以根据中文词语进行查询，并且需要支持多个条件查询。
本案例中的原始内容就是磁盘上的文件，如下图：
在这里插入图片描述

2.2.创建索引

2.2.1.实现步骤

第一步：创建一个java工程，并导入jar包。
第二步：创建一个indexwriter对象。

1）指定索引库的存放位置Directory对象
2）指定一个IndexWriterConfig对象。

第二步：创建document对象。
第三步：创建field对象，将field添加到document对象中。
第四步：使用indexwriter对象将document对象写入索引库，此过程进行索引创建。并将索引和document对象写入索引库。
第五步：关闭IndexWriter对象。

2.2.2.代码实现

在这里插入图片描述

package com.bruceliu.test;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;

import java.io.File;

/**
 * @Auther: bruceliu
 * @Date: 2019/12/10 16:32
 * @QQ:1241488705
 * @Description:
 */
public class TestLucene {

    //创建索引
    @Test
    public void createIndex() throws Exception {

        //指定索引库存放的路径
        //D:\temp\index
        Directory directory = FSDirectory.open(new File("D:\\temp\\index").toPath());
        //索引库还可以存放到内存中
        //Directory directory = new RAMDirectory();
        //创建indexwriterCofig对象
        IndexWriterConfig config = new IndexWriterConfig();
        //创建indexwriter对象
        IndexWriter indexWriter = new IndexWriter(directory, config);
        //原始文档的路径
        File dir = new File("C:\\Users\\bruceliu\\Desktop\\Luncen\\searchsource");
        for (File f : dir.listFiles()) {
            //文件名
            String fileName = f.getName();
            //文件内容
            String fileContent = FileUtils.readFileToString(f);
            //文件路径
            String filePath = f.getPath();
            //文件的大小
            long fileSize  = FileUtils.sizeOf(f);
            //创建文件名域
            //第一个参数：域的名称
            //第二个参数：域的内容
            //第三个参数：是否存储
            Field fileNameField = new TextField("filename", fileName, Field.Store.YES);
            //文件内容域
            Field fileContentField = new TextField("content", fileContent, Field.Store.YES);
            //文件路径域（不分析、不索引、只存储）
            Field filePathField = new TextField("path", filePath, Field.Store.YES);
            //文件大小域
            Field fileSizeField = new TextField("size", fileSize + "", Field.Store.YES);

            //创建document对象
            Document document = new Document();
            document.add(fileNameField);
            document.add(fileContentField);
            document.add(filePathField);
            document.add(fileSizeField);
            //创建索引，并写入索引库
            indexWriter.addDocument(document);
        }
        //关闭indexwriter
        indexWriter.close();

        System.out.println("索引创建成功~");
    }

}

2.3.使用Luke工具查看索引文件

在这里插入图片描述

我们使用的luke的版本是luke-7.4.0，跟lucene的版本对应的。可以打开7.4.0版本的lucene创建的索引库。需要注意的是此版本的Luke是jdk9编译的，所以要想运行此工具还需要jdk9才可以。

2.4.查询索引

2.4.1.实现步骤

第一步：创建一个Directory对象，也就是索引库存放的位置。
第二步：创建一个indexReader对象，需要指定Directory对象。
第三步：创建一个indexsearcher对象，需要指定IndexReader对象
第四步：创建一个TermQuery对象，指定查询的域和查询的关键词。
第五步：执行查询。
第六步：返回查询结果。遍历查询结果并输出。
第七步：关闭IndexReader对象

2.4.2.代码实现


    //查询索引库
    @Test
    public void searchIndex() throws Exception {
        //指定索引库存放的路径
        //D:\temp\index
        Directory directory = FSDirectory.open(new File("D:\\temp\\index").toPath());
        //创建indexReader对象
        IndexReader indexReader = DirectoryReader.open(directory);
        //创建indexsearcher对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        //创建查询
        Query query = new TermQuery(new Term("filename", "apache"));
        //执行查询
        //第一个参数是查询对象，第二个参数是查询结果返回的最大值
        TopDocs topDocs = indexSearcher.search(query, 10);
        //查询结果的总条数
        System.out.println("查询结果的总条数："+ topDocs.totalHits);
        //遍历查询结果
        //topDocs.scoreDocs存储了document对象的id
        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
            //scoreDoc.doc属性就是document对象的id
            //根据document的id找到document对象
            Document document = indexSearcher.doc(scoreDoc.doc);
            System.out.println(document.get("filename"));
            //System.out.println(document.get("content"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
            System.out.println("-------------------------");
        }
        //关闭indexreader对象
        indexReader.close();
    }

3.分析器

3.1.分析器的分词效果

  //查看标准分析器的分词效果
  @Test
  public void testTokenStream() throws Exception {
      //创建一个标准分析器对象
      Analyzer analyzer = new StandardAnalyzer();
      //获得tokenStream对象
      //第一个参数：域名，可以随便给一个
      //第二个参数：要分析的文本内容
      TokenStream tokenStream = analyzer.tokenStream("test", "The Spring Framework provides a comprehensive programming and configuration model.");
      //添加一个引用，可以获得每个关键词
      CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
      //添加一个偏移量的引用，记录了关键词的开始位置以及结束位置
      OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
      //将指针调整到列表的头部
      tokenStream.reset();
      //遍历关键词列表，通过incrementToken方法判断列表是否结束
      while(tokenStream.incrementToken()) {
          //关键词的起始位置
          System.out.println("start->" + offsetAttribute.startOffset());
          //取关键词
          System.out.println(charTermAttribute);
          //结束位置
          System.out.println("end->" + offsetAttribute.endOffset());
      }
      tokenStream.close();
  }

运行结果:

start->4
spring
end->10
start->11
framework
end->20
start->21
provides
end->29
start->32
comprehensive
end->45
start->46
programming
end->57
start->62
configuration
end->75
start->76
model
end->81

4.中文分析器

4.2.Lucene自带中文分词器

StandardAnalyzer：
单字分词：就是按照中文一个字一个字地进行分词。如：“我爱中国”，
效果：“我”、“爱”、“中”、“国”。
SmartChineseAnalyzer
对中文支持较好，但扩展性差，扩展词库，禁用词库和同义词库等不好处理

4.3.IKAnalyzer

在这里插入图片描述使用方法：
第一步：把jar包添加到工程中

第二步：把配置文件和扩展词典和停用词词典添加到classpath下

扫描二维码关注公众号，回复： 9362811 查看本文章

注意：hotword.dic和ext_stopword.dic文件的格式为UTF-8，注意是无BOM 的UTF-8 编码。
也就是说禁止使用windows记事本编辑扩展词典文件

使用EditPlus.exe保存为无BOM 的UTF-8 编码格式，如下图：在这里插入图片描述

4.4.测试自定义分析器


 //查看Ik分析器的分词效果
 @Test
 public void testTokenStream() throws Exception {
     //创建一个标准分析器对象
     Analyzer analyzer = new IKAnalyzer();
     //获得tokenStream对象
     //第一个参数：域名，可以随便给一个
     //第二个参数：要分析的文本内容
     //TokenStream tokenStream = analyzer.tokenStream("test", "The Spring Framework provides a comprehensive programming and configuration model.");
     TokenStream tokenStream = analyzer.tokenStream("test", "我爱中国");
     //添加一个引用，可以获得每个关键词
     CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
     //添加一个偏移量的引用，记录了关键词的开始位置以及结束位置
     OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
     //将指针调整到列表的头部
     tokenStream.reset();
     //遍历关键词列表，通过incrementToken方法判断列表是否结束
     while(tokenStream.incrementToken()) {
         //关键词的起始位置
         System.out.println("start->" + offsetAttribute.startOffset());
         //取关键词
         System.out.println(charTermAttribute);
         //结束位置
         System.out.println("end->" + offsetAttribute.endOffset());
     }
     tokenStream.close();
 }

分词效果：

start->0
我
end->1
start->1
爱
end->2
start->2
中国
end->4

4.5.使用自定义分析器

@Test
public void addDocument() throws Exception {
    //索引库存放路径
    Directory directory = FSDirectory.open(new File("D:\\temp\\index").toPath());
    IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
    //创建一个indexwriter对象
    IndexWriter indexWriter = new IndexWriter(directory, config);
//...
}

bruceliu9527

发布了274 篇原创文章 · 获赞 80 · 访问量 2万+

私信关注