Lucene实现全文检索的流程

创建索引

获得文档
- 原始文档：要基于那些数据来进行搜索，那么这些数据就是原始文档。
- 搜索引擎：使用爬虫获得原始文档
- 站内搜索：数据库中的数据。
- 本地搜索：直接使用io流读取磁盘上的文件。 * 分析文档（每个文档拆分成不同的域，再把每个域进行分词）
构建Trem对象
- 每个关键词都封装成一个Term对象中（Term中包含两部分内容：关键词所在的域（字段名）、关键词本身（字段值）
- 把Trem根据空格进行字符串分词，得到一个单词列表，把单词统一转换成小写，去除标点符号，去除停用词
构建Document对象
- 对应每个原始文档创建一个Document对象
- 每个document对象中包含多个Trem（即上一步要进行索引或者保存的Trem）
创建索引
- 基于关键词列表创建一个索引，保存到索引库中
- 索引库中：索引、document对象、关键词和文档的对应关系
- 通过词语找文档，这种索引的结构叫倒排索引结构。

查询索引

用户查询接口
- 用户输入查询条件的地方（例如：百度的搜索框）
把关键词封装成一个查询对象
- 要查询的域、要搜索的关键词
执行查询
- 根据要查询的关键词到对应的域上进行搜索。
- 找到关键词，根据关键词找到对应的文档
渲染结果
- 根据文档的id找到文档对象
- 对关键词进行高亮显示、分页处理，最终展示给用户看。

注意：

Lucene保存的数据没有结构，只是会根据能够索引的Term建立索引树，然后每次查询索引树找到对应的Document（每个文档都有一个唯一的编号，就是文档id）
要保存结构化数据还是关系型数据库

环境准备

依赖

 <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>7.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>7.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>7.4.0</version>
  </dependency>
  <!--使用FileUtils-->
  <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-io</artifactId>
        <version>1.3.2</version>
    </dependency>

IK分词器（中文分词器）

链接：https://pan.baidu.com/s/1Rsled6iBdgOZAbjzxS_BCA
提取码：f5wh
复制这段内容后打开百度网盘手机App，操作更方便哦

IKAnalyzer.cfg.xml 必须在classpath 一级路径下，而hotword.dic（热词）、stopword.dic（禁用词）是由IKAnalyzer.cfg.xml 指定路径

增加索引库

步骤

创建一个Directory对象，指定索引库保存的位置。
基于Directory对象创建一个IndexWriter对象
读取磁盘上的文件，对应每个文件创建一个文档对象。
向文档对象中添加域
把文档对象写入索引库
关闭indexwriter对象

public void addDocument() throws Exception {
		/**
		 * 1、创建一个Director对象，指定索引库保存的位置。
         * 把索引库保存在内存中：Directory directory = new RAMDirectory();
         * 把索引库保存在磁盘：
         */
        Directory directory = FSDirectory.open(new File("C:\\temp\\index").toPath());
        /**
         * 2、基于Directory对象创建一个IndexWriter对象
         * IndexWriterConfig config=new IndexWriterConfig() //不指定为IKAnalyzer会默认使用自带的英文分词器
         */
        IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
        IndexWriter indexWriter = new IndexWriter(directory, config);
        /**
         * 3、读取磁盘上的文件，对应每个文件创建一个文档对象。
         */
        File dir = new File("C:\\temp\\searchsource");
        File[] files = dir.listFiles();
        for (File f : files) {
            //取文件名
            String fileName = f.getName();
            //文件的路径
            String filePath = f.getPath();
            //文件的内容
            String fileContent = FileUtils.readFileToString(f, "utf-8");
            //文件的大小
            long fileSize = FileUtils.sizeOf(f);
            /**
             * 4、创建Field
             * 参数1：域的名称，参数2：域的内容，参数3：是否存储
             */
            Field fieldName = new TextField("name", fileName, Field.Store.YES);
            Field fieldPath = new StoredField("path", filePath);
            Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
            Field fieldSizeValue = new LongPoint("size", fileSize);
            //创建文档对象
            Document document = new Document();
            //向文档对象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            document.add(fieldSizeValue);
            /**
             * 5、把文档对象写入索引库
             */
            indexWriter.addDocument(document);
        }
        /**
         * 6、关闭indexwriter对象
         */
        indexWriter.close();
 }

常用Filed类型
Stored 属性说明
- 如果有的字段要在索引库只做索引（即只在搜索的时候使用），而不用在搜到对应Document的时候获取这个字段的值（即documet.get() 取不到值）
Indexed 属性说明
- 这个字段是否做查询时的索引（比如：url、path等就不用做索引，所以应该用StroedFiled）
Analyzed 属性说明
- 这个字段表明是否分词，因为有的句子不可拆分（比如身份证号）

分词器说明

测试分词

 public void testTokenStream() throws Exception {
        //1）创建一个Analyzer对象，StandardAnalyzer对象
		// Analyzer analyzer = new StandardAnalyzer(); //自带的英文分词器
        Analyzer analyzer = new IKAnalyzer();
        //2）使用分析器对象的tokenStream方法获得一个TokenStream对象
        //第一个参数指定分词字段（没有document使用的话不用指定）
        //第二个参数是分词的句子
        TokenStream tokenStream = analyzer.tokenStream("", "Lucene是一款高性能的、可扩展的信息检索(IR)工具库。信息检索是指文档搜索、文档内信息搜索或者文档相关的元数据搜索等操作。");
        //3）向TokenStream对象中设置一个每个词类型引用，相当于数一个指针
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //4）调用TokenStream对象的rest方法。如果不调用抛异常
        tokenStream.reset();
        //5）使用while循环遍历TokenStream对象
        while(tokenStream.incrementToken()) {
            System.out.println(charTermAttribute.toString());
        }
        //6）关闭TokenStream对象
        tokenStream.close();
    }

修改热词或者禁用词
- 增加热词（在分词时把该词当一个整体）：hotword.dic
- 增加禁用词（把句子分词后会删除该词）：stopword.dic
注意
- hotword.dic和ext_stopword.dic文件的格式为UTF-8，注意是无BOM 的UTF-8 编码。
  也就是说禁止使用windows记事本编辑扩展词典文件
- 使用EditPlus.exe保存为无BOM 的UTF-8 编码格式，如下：

删除索引库

删除整个索引库

 public void deleteAllDocument() throws Exception {
        //创建一个IndexWriter对象，使用IKAnalyzer作为分析器
        IndexWriter indexWriter = new IndexWriter(
                FSDirectory.open(new File("C:\\temp\\index").toPath()),
                new IndexWriterConfig(new IKAnalyzer()));
        //删除全部文档
        indexWriter.deleteAll();
        //关闭索引库
        indexWriter.close();
    }

把查询到的删除

public void deleteDocumentByQuery() throws Exception {
        //创建一个IndexWriter对象，使用IKAnalyzer作为分析器
        IndexWriter indexWriter = new IndexWriter(
                FSDirectory.open(new File("C:\\temp\\index").toPath()),
                new IndexWriterConfig(new IKAnalyzer()));
        indexWriter.deleteDocuments(new Term("name", "apache"));
        indexWriter.close();
    }

更新索引库

 public void updateDocument() throws Exception {
        //创建一个IndexWriter对象，使用IKAnalyzer作为分析器
        IndexWriter indexWriter = new IndexWriter(
                FSDirectory.open(new File("C:\\temp\\index").toPath()),
                new IndexWriterConfig(new IKAnalyzer()));
        //创建一个新的文档对象
        Document document = new Document();
        //向文档对象中添加域
        document.add(new TextField("name", document.get("name"), Field.Store.YES));
        document.add(new TextField("name1", "新增字段name2", Field.Store.YES));
        //更新操作
        indexWriter.updateDocument(new Term("name", "spring"), document);
        //关闭索引库
        indexWriter.close();
    }

说明

Lucene的查找是先删除查找到的Document，然后再增加
Lucene的Document 没有结构，所以可以增加任意Trem（字段），同样查找时也是通过某个Trem找到整个Document

槑！

发布了109 篇原创文章 · 获赞 47 · 访问量 3万+

私信关注

Lucene全文检索（一）

Lucene实现全文检索的流程

环境准备

增加索引库

删除索引库

更新索引库

猜你喜欢