Getting Started with Lucene (Full Text Search)

Lucene implements full-text retrieval (part of the indexing process, part of the search process):

The process of creating an index:

1. Obtain the original document

The original document refers to the content to be indexed and searched. Raw content includes web pages on the Internet, data in databases, files on disk, and more.

2. Create the document object

The purpose of obtaining the original content is for indexing. Before indexing, the original content needs to be created into a document (Document). The document includes one field (Field), and the content is stored in the field.

Here we can treat a file on the disk as a document, and the Document includes some Fields (file_name file name, file_path file path, file_size file size, file_content file content).

3. Analyze documents

The original content is created as a document (document) containing a field (Field), and the content in the field needs to be analyzed. The process of analysis is to extract words from the original document, convert letters to lowercase, remove punctuation marks, and remove deactivation Words and other processes generate final lexical units, which can be understood as words one by one.

Fourth, create an index

The lexical units obtained from all document analysis are indexed. The purpose of indexing is to search, and ultimately only the indexed lexical units must be searched to find the Document.

The process of querying the index:

1. User query interface

The vector used to enter the query content.

2. Create a query

Before a user enters a query keyword to perform a search, a query object needs to be constructed. The query object can specify the Field document domain and query keyword to be searched for the query, and the query object will generate a specific query syntax.

3. Execute the query

According to the query syntax, the indexes corresponding to the search words are respectively found in the inverted index dictionary table, so as to find the linked list of documents linked by the indexes.

4. Rendering results

Display the query results to users with a friendly interface. Users can find the information they want according to the search results. In order to help users quickly find their own results, it provides a lot of display effects, such as highlighting keywords in the search results. , snapshots provided by Baidu, etc.

Lucene uses

1. Guide package

2. Create an index library and query the index library

When creating an index library, the subclasses related to domain objects are introduced:

Field subclass introduction
Field class	type of data	Analyzed Whether to analyze	Indexed whether to index	Stored Whether to store	illustrate
StringField(FieldName, FieldValue,Store.YES))	string	N	AND	Y or N	This Field is used to build a string Field , but it will not be analyzed, and the entire string will be stored in the index, such as ( order number , name, etc. ) Whether to store in the document is determined by Store.YES or Store.NO
LongField(FieldName,FieldValue,Store.YES)	Long type	AND	AND	Y or N	This Field is used to construct a Long numeric Field for analysis and indexing, such as ( price ) Whether to store in the document is determined by Store.YES or Store.NO
StoredField(FieldName,FieldValue)	Overloaded methods to support multiple types	N	N	AND	This Field is used to construct different types of Field Do not analyze, do not index, but want Field to be stored in the document
TextField(FieldName, FieldValue, Store.NO) or TextField(FieldName, reader)	string or flow	AND	AND	Y or N	如果是一个Reader,lucene猜测内容比较多,会采用Unstored的策略

查询索引库时，索引搜索器的搜索方法介绍：

方法	说明
indexSearcher.search(query, n)	根据Query搜索，返回评分最高的n条记录
indexSearcher.search(query,filter,n)	根据Query搜索，添加过滤规则，返回评分最高的n条记录
indexSearcher.search(query,n,sort)	根据Query搜索，添加排序规则，返回评分最高的n条记录
indexSearcher.search(booleanQuery,filter,n,sort)	根据Query搜索，添加过滤该规则，添加排序规则，返回评分最高的n条记录

代码：

package com.xushuai.lucene;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;

import java.io.File;
import java.io.IOException;

/**
 * Lucene初次使用
 * Author: xushuai
 * Date: 2018/5/6
 * Time: 15:08
 * Description:
 */
public class LuceneDemo {
    /*
     * 创建索引库的步骤
     *   第一步：创建一个java工程，并导入jar包。
     *   第二步：创建一个indexwriter对象。
     *      1）指定索引库的存放位置Directory对象
     *      2）指定一个分析器，对文档内容进行分析。
     *   第三步：创建document对象。
     *   第四步：创建field对象，将field添加到document对象中。
     *   第五步：使用indexwriter对象将document对象写入索引库，此过程进行索引创建。并将索引和document对象写入索引库。
     *   第六步：关闭IndexWriter对象。
     */

    /**
     * 创建索引库
     * @auther: xushuai
     * @date: 2018/5/6 15:12
     * @throws: IOException
     */
    @Test
    public void luceneCreateIndexRepository() throws IOException {
        //存放索引库的路径
        Directory directory = FSDirectory.open(new File("D:\\lucene&solr\\lucene\\index"));

        //创建分析器(使用其子类，标准分析器类)
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LATEST, analyzer);

        //使用索引库路径和分析器构造索引库写入流
        IndexWriter indexWriter = new IndexWriter(directory,indexWriterConfig);

        //读取原始文档创建相应文档对象，并设置相关field域
        File dir = new File("D:\\lucene&solr\\lucene\\searchsource");
        //遍历该dir下所有的文件创建文档对象
        for(File file:dir.listFiles()){
            //获取其文件名、文件大小、文件位置、文件内容
            String file_name = file.getName();
            String file_path = file.getPath();
            Long file_size = FileUtils.sizeOf(file);
            String file_content = FileUtils.readFileToString(file);

            //为获取到的文件属性创建相应域(参数分别为：域名城、域值以及是否保存)
            Field fileNameField = new TextField("filename",file_name, Field.Store.YES);
            Field filePathField = new StoredField("filepath",file_path);
            Field fileSizeField = new LongField("filesize",file_size, Field.Store.YES);
            Field fileContentField = new TextField("filecontent",file_content,Field.Store.YES);

            //创建document对象
            Document document = new Document();
            //保存域对象
            document.add(fileContentField);
            document.add(fileNameField);
            document.add(filePathField);
            document.add(fileSizeField);

            //将document绑定给写入流
            indexWriter.addDocument(document);
        }
        //释放资源
        indexWriter.close();
    }



    /*
     * 查询索引
     *   第一步：创建一个Directory对象，也就是索引库存放的位置。
     *   第二步：创建一个indexReader对象，需要指定Directory对象。
     *   第三步：创建一个indexsearcher对象，需要指定IndexReader对象
     *   第四步：创建一个TermQuery对象，指定查询的域和查询的关键词。
     *   第五步：执行查询。
     *   第六步：返回查询结果。遍历查询结果并输出。
     *   第七步：关闭IndexReader对象
     */

    /**
     * 查询索引
     * @auther: xushuai
     * @date: 2018/5/6 15:48
     * @throws: IOException
     */
    @Test
    public void luceneSearchIndexRepository() throws IOException {
        //指定索引库位置
        Directory directory = FSDirectory.open(new File("D:\\lucene&solr\\lucene\\index"));

        //创建索引库读取流
        IndexReader indexReader = DirectoryReader.open(directory);

        //创建索引搜索器对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        //创建查询条件对象,第一个参数为：域名称  第二个参数为：域值 这里查询条件为：文件名中含有apache的文档
        Query query = new TermQuery(new Term("filename","apache"));

        //执行查询,第一个参数为：查询条件  第二个参数为：结果返回最大个数
        TopDocs topDocs = indexSearcher.search(query, 10);
        //打印结果集长度
        System.out.println("查询结果总条数：" + topDocs.totalHits);
        
        //遍历结果集
        for (ScoreDoc doc:topDocs.scoreDocs) {
            //获取其查询到的文档对象,ScoreDoc对象的doc属性可以获取document的id值
            Document document = indexSearcher.doc(doc.doc);
            //打印文件名
            System.out.println("文件名：  " + document.get("filename"));
            //打印文件大小
            System.out.println("文件大小：" + document.get("filesize"));
            //打印文件路径
            System.out.println("文件路径：" + document.get("filepath"));
            //打印文件内容
            System.out.println(document.get("filecontent"));

            //分割线
            System.out.print("------------------------------------------------------------------------------");
        }
        //释放资源
        indexReader.close();
    }
}

创建索引库的结果（可使用工具查看索引库）：

索引数据库的结果：

查询结果总条数：2
文件名： apache lucene.txt
文件大小：724
文件路径：D:\lucene&solr\lucene\searchsource\apache lucene.txt
# Apache Lucene README file

## Introduction

Lucene is a Java full-text search engine. Lucene is not a complete
application, but rather a code library and API that can easily be used
to add search capabilities to applications.

* The Lucene web site is at: http://lucene.apache.org/
* Please join the Lucene-User mailing list by sending a message to:
[email protected]

## Files in a binary distribution

Files are organized by module, for example in core/:

* `core/lucene-core-XX.jar`:
The compiled core Lucene library.

To review the documentation, read the main documentation page, located at:
`docs/index.html`

To build Lucene or its documentation for a source distribution, see BUILD.txt

------------------------------------------------------------------------------
文件名： Welcome to the Apache Solr project.txt
文件大小：5464
文件路径：D:\lucene&solr\lucene\searchsource\Welcome to the Apache Solr project.txt
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Welcome to the Apache Solr project!
-----------------------------------

Solr is the popular, blazing fast open source enterprise search platform
from the Apache Lucene project.

For a complete description of the Solr project, team composition, source
code repositories, and other details, please see the Solr web site at
http://lucene.apache.org/solr

Getting Started
---------------

See the "example" directory for an example Solr setup. A tutorial
using the example setup can be found at
http://lucene.apache.org/solr/tutorial.html
or linked from "docs/index.html" in a binary distribution.
Also, there are Solr clients for many programming languages, see
http://wiki.apache.org/solr/IntegratingSolr

Files included in an Apache Solr binary distribution
----------------------------------------------------

example/
A self-contained example Solr instance, complete with a sample
configuration, documents to index, and the Jetty Servlet container.
Please see example/README.txt for information about running this
example.

dist/solr-XX.war
The Apache Solr Application. Deploy this WAR file to any servlet
container to run Apache Solr.

dist/solr-<component>-XX.jar
The Apache Solr libraries. To compile Apache Solr Plugins,
one or more of these will be required. The core library is
required at a minimum. (see http://wiki.apache.org/solr/SolrPlugins
for more information).

docs/index.html
The Apache Solr Javadoc API documentation and Tutorial

Instructions for Building Apache Solr from Source
-------------------------------------------------

1. Download the Java SE 7 JDK (Java Development Kit) or later from http://java.sun.com/
You will need the JDK installed, and the $JAVA_HOME/bin (Windows: %JAVA_HOME%\bin)
folder included on your command path. To test this, issue a "java -version" command
from your shell (command prompt) and verify that the Java version is 1.7 or later.

2. Download the Apache Ant binary distribution (1.8.2+) from
http://ant.apache.org/ You will need Ant installed and the $ANT_HOME/bin (Windows:
%ANT_HOME%\bin) folder included on your command path. To test this, issue a
"ant -version" command from your shell (command prompt) and verify that Ant is
available.

You will also need to install Apache Ivy binary distribution (2.2.0) from
http://ant.apache.org/ivy/ and place ivy-2.2.0.jar file in ~/.ant/lib -- if you skip
this step, the Solr build system will offer to do it for you.

3. Download the Apache Solr distribution, linked from the above web site.
Unzip the distribution to a folder of your choice, e.g. C:\solr or ~/solr
Alternately, you can obtain a copy of the latest Apache Solr source code
directly from the Subversion repository:

http://lucene.apache.org/solr/versioncontrol.html

4. Navigate to the "solr" folder and issue an "ant" command to see the available options
for building, testing, and packaging Solr.

NOTE:
To see Solr in action, you may want to use the "ant example" command to build
and package Solr into the example/webapps directory. See also example/README.txt.

Export control
-------------------------------------------------
This distribution includes cryptographic software. The country in
which you currently reside may have restrictions on the import,
possession, use, and/or re-export to another country, of
encryption software. BEFORE using any encryption software, please
check your country's laws, regulations and policies concerning the
import, possession, or use, and re-export of encryption software, to
see if this is permitted. See <http://www.wassenaar.org/> for more
information.

The U.S. Government Department of Commerce, Bureau of Industry and
Security (BIS), has classified this software as Export Commodity
Control Number (ECCN) 5D002.C.1, which includes information security
software using or performing cryptographic functions with asymmetric
algorithms. The form and manner of this Apache Software Foundation
distribution makes it eligible for export under the License Exception
ENC Technology Software Unrestricted (TSU) exception (see the BIS
Export Administration Regulations, Section 740.13) for both object
code and source code.

The following provides more details on the included cryptographic
software:
Apache Solr uses the Apache Tika which uses the Bouncy Castle generic encryption libraries for
extracting text content and metadata from encrypted PDF files.
See http://www.bouncycastle.org/ for more details on Bouncy Castle.

------------------------------------------------------------------------------