一、什么是全文检索

1、我们生活中的数据总体分为两种：

结构化数据：指具有固定格式或有限长度的数据，如数据库，元数据等；
非结构化数据：指不定长或无固定格式的数据，如邮件，word文档等。
其中，非结构化数据有一种叫法：全文数据。

2、非结构化数据的搜索方法

顺序扫描法(Serial Scanning)：在一系列文件中，对逐个文档从头到尾搜寻某一个字符串，若此文档包含此字符串，则此文档为我们要找的文件，接着看下一个文件，直到扫描完所有的文件；
全文检索：将非结构化数据中的一部分信息提取出来，行程有一定结构的索引数据，再对索引进行搜索；
字典示例：比如字典，字典的拼音表和部首检字表就相当于字典的索引。

二、关于Lucene

1、Lucene是apache软件基金会4 jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包。现在企业中常用的Solr、ElasticSearch等全文搜索引擎，均基于Lucene。
2、相关概念：

索引：相当于数据库；
Document：文档，相当于一个文件、数据库中的一条数据；
Field：域，相当于数据库表中的字段；
Term：对域中的内容进行提取单词/词语、将字母转为小写、去除标点符号、去除停用词等过程生成最终的词汇单元，每个语汇单元理解为一个一个的单词/词语，每个单词/词语成为一个Term。

三、使用方法

1、新建一个Java工程，导入相关的jar包：
在这里插入图片描述
2、第三方中文分词器：IK-analyzer（使用默认的分词器时，可忽略此步骤）：
（1）把IK-analyzer的jar包添加到工程中；
（2）在网上寻找资源，下载IKAnalyzer2012FF_u1的压缩包，解压后将其中的
IKAnalyzer.cfg.xml(配置文件)和ext.dic(扩展词词典)和stopword.dic(停用词词典)添加到项目的src目录下（停用词词典与扩展词词典名称可自行定义，只要在配置文件中配置好就可以了）。如图：
在这里插入图片描述
以下为相关的dependency

<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
    <dependency>
      <groupId>commons-io</groupId>
      <artifactId>commons-io</artifactId>
      <version>2.5</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common -->
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-analyzers-common</artifactId>
      <version>4.10.3</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-core -->
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-core</artifactId>
      <version>4.10.3</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-queryparser</artifactId>
      <version>4.10.3</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.janeluo/ikanalyzer -->
    <dependency>
      <groupId>com.janeluo</groupId>
      <artifactId>ikanalyzer</artifactId>
      <version>2012_u6</version>
    </dependency>

3、创建索引：
可在F:\myworkspace\document文件夹下存放几个txt文件，作为原始文档。

	@Test
    public void testLucene(){
        try {
       	    //指定索引库的存放位置Directory对象
            Directory directory = FSDirectory.open(new File("F:\\myworkspace\\store"));
            //指定一个标准分析器（会对每个字进行分词），对文档内容进行分析
            //Analyzer analyzer = new StandardAnalyzer();
            
			//指定第三方中文分词器，对文档内容进行分析
            Analyzer analyzer = new IKAnalyzer();
            //创建indexwriterConfig对象
            IndexWriterConfig config = new IndexWriterConfig(Version.LATEST,analyzer);
            //创建一个indexwriter对象
            IndexWriter indexWriter = new IndexWriter(directory,config);
            //原始文档的路径
            File file = new File("F:\\myworkspace\\document");
            File[] fileList = file.listFiles();
            for (File file1 : fileList) {
                //创建document对象
                Document document = new Document();
                //创建field对象，将field添加到document对象中
                //文件名称
                String fileName = file1.getName();
                //创建文件名域
                //第一个参数：域的名称
                //第二个参数：域的内容
                //第三个参数：是否存储
                Field fileNameField = new TextField("fileName", fileName, Field.Store.YES);
                 //文件的大小
                long fileSize  = FileUtils.sizeOf(file1);
                //文件大小域
                Field fileSizeField = new LongField("fileSize", fileSize, Field.Store.YES);
                //文件路径
                String filePath = file1.getPath();
                //文件路径域（不分析、不索引、只存储）
                Field filePathField = new StoredField("filePath", filePath);
                //文件内容
//                String fileContent = FileUtils.readFileToString(file1);
                String fileContent = FileUtils.readFileToString(file1, "utf-8");
                //文件内容域
                Field fileContentField = new TextField("fileContent", fileContent, Field.Store.YES);
                document.add(fileNameField);
                document.add(fileSizeField);
                document.add(filePathField);
                document.add(fileContentField);
                //使用indexwriter对象将document对象写入索引库，此过程进行索引创建。并将索引和document对象写入索引库。
                indexWriter.addDocument(document);
             }
                 //关闭IndexWriter对象。
                 indexWriter.close();
            } catch (IOException e) {
                e.printStackTrace();
        }
    }

4、上述步骤执行完毕后，可在索引的存放位置，即：F:\myworkspace\store查看到有索引文件生成，这些文件无法直接查看，需要下载Lucene索引查看器查看。附下载地址：https://download.csdn.net/download/qq_36413084/10569588
5、检索：
可通过两种方法创建查询对象：
1）使用Lucene提供的Query子类
Query是一个抽象类，lucene提供了很多查询对象，比如TermQuery项精确查询，NumericRangeQuery数字范围查询等。如下代码：

    　　Query query = new TermQuery(new Term("name", "lucene"));

上述的Term的第一个参数，是指要搜索的字段属于什么类型，第一个参数可填：fileContent、filePath、fileName、fileSize。
第二个参数即是填写要搜索的文字。
完整代码如下：

	//组合条件查询
    @Test
    public void testBooleanQuery() {
        try{
            //创建一个Directory对象，指定索引库存放的路径
            Directory directory = FSDirectory.open(new File("F:\\myworkspace\\store"));
            //创建IndexReader对象，需要指定Directory对象
            IndexReader indexReader = DirectoryReader.open(directory);
            //创建Indexsearcher对象，需要指定IndexReader对象
            IndexSearcher indexSearcher = new IndexSearcher(indexReader);

			//精准查询，单条件查询
			//Query query = new TermQuery(new Term("fileName", "apache"));

            //创建一个布尔查询对象，即多条件查询
            BooleanQuery query = new BooleanQuery();
            //创建第一个查询条件
            Query query1 = new TermQuery(new Term("fileContent", "apache"));
             //创建第二个查询条件
            Query query2 = new TermQuery(new Term("fileContent", "lucene"));
            //组合查询条件
            query.add(query1, BooleanClause.Occur.MUST);
            query.add(query2, BooleanClause.Occur.MUST);
            //执行查询
            //第一个参数是查询对象，第二个参数是查询结果返回的最大值
            TopDocs topDocs = indexSearcher.search(query, 10);

            //查询结果的总条数
            System.out.println("查询结果的总条数："+ topDocs.totalHits);
            //遍历查询结果
            //topDocs.scoreDocs存储了document对象的id
            //ScoreDoc[] scoreDocs = topDocs.scoreDocs;
            for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
                //scoreDoc.doc属性就是document对象的id
                //int doc = scoreDoc.doc;
                //根据document的id找到document对象
                Document document = indexSearcher.doc(scoreDoc.doc);
                //文件名称
                System.out.println(document.get("fileName"));
                //文件内容
                System.out.println(document.get("fileContent"));
                //文件大小
                System.out.println(document.get("fileSize"));
                //文件路径
                System.out.println(document.get("filePath"));
                System.out.println("----------------------------------");
            }
            //关闭indexreader对象
            indexReader.close();
        }catch (Exception e){
            e.printStackTrace();
        }
    }

2）使用QueryParse解析查询表达式
QueryParse会将用户输入的查询表达式解析成Query对象实例。如下代码：

           QueryParser queryParser = new QueryParser("name", new IKAnalyzer());
           Query query = queryParser.parse("name:lucene");

完整代码如下：

	 @Test
    public void testQueryParser(){
        try{
            //创建一个Directory对象，指定索引库存放的路径
            Directory directory = FSDirectory.open(new File("F:\\myworkspace\\store"));
            //创建IndexReader对象，需要指定Directory对象
            IndexReader indexReader = DirectoryReader.open(directory);
            //创建Indexsearcher对象，需要指定IndexReader对象
            IndexSearcher indexSearcher = new IndexSearcher(indexReader);

            //创建queryparser对象
            //第一个参数默认搜索的域
            //第二个参数就是分析器对象
            QueryParser queryParser = new QueryParser("fileContent", new IKAnalyzer());
            //使用默认的域
            Query query = queryParser.parse("测试");
            //不使用默认的域，可以自己指定域
            //Query query = queryParser.parse("fileContent:apache");
            //执行查询
            //第一个参数是查询对象，第二个参数是查询结果返回的最大值
            TopDocs topDocs = indexSearcher.search(query, 10);

            //查询结果的总条数
            System.out.println("查询结果的总条数："+ topDocs.totalHits);
            //遍历查询结果
            //topDocs.scoreDocs存储了document对象的id
            //ScoreDoc[] scoreDocs = topDocs.scoreDocs;
            for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
                //scoreDoc.doc属性就是document对象的id
                //int doc = scoreDoc.doc;
                //根据document的id找到document对象
                Document document = indexSearcher.doc(scoreDoc.doc);
                //文件名称
                System.out.println(document.get("fileName"));
                //文件内容
                System.out.println(document.get("fileContent"));
                //文件大小
                System.out.println(document.get("fileSize"));
                //文件路径
                System.out.println(document.get("filePath"));
                System.out.println("----------------------------------");
            }
            //关闭indexreader对象
            indexReader.close();
        }catch (Exception  e){
            e.printStackTrace();
        }
    }

6、查看查询的结果：本文中用的源文档为txt文件，一个txt文件为一条数据。检索时，将会搜索出符合条件的一整个文档的全文，而不是某一个文档的局部内容。

参考文献：
[1]:https://www.cnblogs.com/xiaobai1226/p/7652093.html

基于Java的全文搜索引擎学习笔记----------Lucene

一、什么是全文检索

二、关于Lucene

三、使用方法

猜你喜欢