Lucene full-text search

1. What is full-text search

The data we come into contact with in daily life can be roughly divided into:
structured data: refers to data with a fixed length and format. Such as: database data, metadata and other
unstructured data: there is no fixed length, format data. Such as: word document, txt file, etc.
We can query structured data through SQL statements, etc., which can easily query the data we need, but for unstructured data, it will be very troublesome when we want to obtain the vocabulary or data we want from it. They There is no rule to find.
To find data from unstructured data, we can:
extract unstructured data in a certain format, reorganize it, and form structured data , so that we can find it easily like structured data The data we want, this part of the information extracted from unstructured data and then reorganized, we call it an index.
For example: in a dictionary, the explanation of each word is unstructured data, which will be very troublesome to find. We can first extract the pronunciation or radicals of each word to form an index, each corresponding to a page number, so that we When searching, you can first find the corresponding page number through radicals or pronunciation, so that it is very convenient to find words.

This process of building an index first and then searching the index is called Full-text Search.

Full-text search process:
Insert picture description here
1. Green indicates the indexing process, indexing the original content to be searched to build an index library, the indexing process includes:
determining the original content that is the content to be searched, collecting documents, creating document analysis, document indexing documents

2. Red indicates the search process, searching for content from the index library, the search process includes: the
user creates a query through the search interface to perform the search, and the search results are rendered from the index library

2. Some concepts in lucene

Original file:
That is, we want to find in those data, and these data are the original data.
Such as: the data in the database, the txt file in the disk...

Document object: After
obtaining the content in the original file, we need to create an index, and add the content to the document object, and then store it in the index library, and the document object Contains many field objects. The document object can be seen as a database table, and the field can be seen as an attribute in the database.
Insert picture description here
Note: Each Document can have multiple Fields, different Documents can have different Fields, and the same Document can have the same Field (the domain name and field value are the same)

Each document has a unique number, which is the document id.

3. Lucene's self-understanding

Lucene actually reads the original data, and then encapsulates each original data as a separate document object. When adding a direct index library, it will automatically segment each field according to the tokenizer to form a dictionary, and then The dictionary performs index processing, and then the index (dictionary) and document objects are added to the index library, as well as the relationship between them (inverted index structure). When a user searches for a vocabulary, he first searches the index (dictionary), then finds the corresponding document object according to the relationship between the dictionary and the document object, and then searches for the corresponding field object to find out the content.

Inverted index structure:
Insert picture description here
on the left is the dictionary list, on the right is the inverted table, that is, the document id of each word in those documents. That is, the query word, find the corresponding document, and find the content, which is the inverted index structure.

4. Steps to use lucene

1. Import the jar package.
Insert picture description here

2. Create the original file
Insert picture description here
3. Create the index library, parse and record all the above files into the index library

@Test
    public void createLucene() throws IOException {
    
    
        //1.创建directory对象，指定索引库的存放位置
        //此方式为在本地磁盘存储索引库
        Directory dir = FSDirectory.open(new File("D:\\luncene").toPath());
        //此方式为索引库存放于内存中（不建议：每次加载太慢）
        //Directory dir1 = new RAMDirectory();
        //2.配置设置：默认使用的是StandardAnalyzer解析器，对中文不友好，后期可以在构造参数中指定中文解析器
        IndexWriterConfig config = new IndexWriterConfig();
        //3.根据directory和indexWriterConfig创建IndexWriter对象，用于向索引库中写数据
        IndexWriter writer = new IndexWriter(dir,config);
        //4.加载数据，并封装为document对象
        File filePaths = new File("D:\\lucenePro");
        //获取文件夹下所有的file
        File[] files = filePaths.listFiles();
        //遍历files数组，获取每一个file
        for (File file : files) {
    
    
            //获得文件名
            String fileName = file.getName();
            //获取文件的路径
            String filePath = file.getPath();
            //获取文件的内容
            String fileContent = FileUtils.readFileToString(file);
            //获取文件的大小
            long fileSize = FileUtils.sizeOf(file);

            //创建域（属性）对象，并在域对象中存放对应的值
            Field fieldName = new TextField("name",fileName, Field.Store.YES);
            Field fieldPath = new TextField("path",filePath, Field.Store.YES);
            Field fieldContent = new TextField("content",fileContent, Field.Store.YES);
            Field fieldSize = new TextField("size",fileSize+"", Field.Store.YES);
            //创建document对象,并把域对象赋值
            Document document = new Document();
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            document.add(fieldSize);

            //写入到索引库
            writer.addDocument(document);

        }

        //释放写资源
        writer.close();

    }

Note: If you use FileUtis, you need to import the commons.io.jar package

After execution, find the folder location where the Directory object is created, and you can see:
Insert picture description here
the upload is successful.

5. Use luke to view the index library

luke download address:
luke download address

After downloading, you must enter the directory, open the black window in the directory where the pom.xml file is located, and run the command: mvn package command to use luke, otherwise
ERROR, disabled to access jarfile: .\target\luke-swings-with- deps.jar error

In addition: download the latest lucene, it will bring luke, just double-click to use it.

6. Query the index library

@Test
    public void queryLucene() throws IOException {
    
    
        //创建Directory对象，指向索引库
        Directory dir = FSDirectory.open(new File("D:\\luncene").toPath());
        //创建reader对象
        IndexReader reader = DirectoryReader.open(dir);
        //创建sercher查找对象
        IndexSearcher searcher = new IndexSearcher(reader);
        //封装查询条件对象 term：第一个参数：你要查询那个域的，第二个参数：你要查询的参数
        Query query = new TermQuery(new Term("name","apache"));
        //查询 第一个参数：你封装的查询条件，第二个参数：最多查询多少条的数据
        TopDocs topDocs = searcher.search(query, 10);
        //获取查询的总条数
        TotalHits totalHits = topDocs.totalHits;
        System.out.println(totalHits);

        //根据获取的topDocs对象，获得ScoreDoc[]，ScoreDoc里面存放的是document的id
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
    
    
            //根据document的id查询document对象
            Document doc = searcher.doc(scoreDoc.doc);
            //获取内容
            String name = doc.get("name");
            String path = doc.get("path");
            String content = doc.get("content");
            String size = doc.get("size");
            System.out.println(name);
            System.out.println(path);
            System.out.println(content);
            System.out.println(size);
            System.out.println("-----------------------------------");
        }
        reader.close();

    }

7. Chinese word segmenter

Using the built-in standard word segmentation device, it can segment English words well, but Chinese is not very friendly, and will classify each character as a word. If you want to segment Chinese words, you need to use the Chinese word segmenter: IKAnalyzer How to
use:
1. Import the jar package, import the configuration
Insert picture description here
Configuration:
hotword.dic: indicates some special words, after adding them, you can be segmented by the parser
stopword.dic: Stop words and sensitive words list
IkAnalyzer.cfg.xml: IkAnalyzer configuration information

Use Cases:

 /**
     * 使用中文分析器分词
     */
    @Test
    public void createIndexByIk() throws IOException {
    
    
        //创建directory对象
        Directory dir = FSDirectory.open(new File("D:\\luncene").toPath());
        //创建IndexWriterConfig对象，并指定分词器
        IndexWriterConfig writerConfig = new IndexWriterConfig(new IKAnalyzer());
        //创建IndexWriter对象
        IndexWriter writer = new IndexWriter(dir,writerConfig);
        //获取数据
        String text = "但是老外写的分词器对中文分词一般都是单字分词,分词的效果不好。 国人林良益写的IK Analyzer应该是最好的Lucene中文分词器之一,而且随着Lucene的版本更新而不断更新";
        Field field = new TextField("name",text, Field.Store.YES);
        Document document = new Document();
        document.add(field);

        writer.addDocument(document);
        writer.close();
    }

Effect:
Insert picture description here
If you need to perform special segmentation on some words, such as company name, etc., you need to maintain it separately in hotword.

8. The index library must be maintained-added

Same as the initial index library, except that this is just a single file.

9. Index library delete index

1. Delete all indexes

 @Test
    public void deleteAllIndex() throws IOException {
    
    
        IndexWriter writer = new IndexWriter(FSDirectory.open(new File("D:\\luncene").toPath()),new IndexWriterConfig(new IKAnalyzer()));
        //删除所有的索引，此方法慎用，一旦删除，无法恢复！！！
        writer.deleteAll();
        writer.close();
    }

2. Delete the index according to Query

@Test
    public void deleteByQuery() throws IOException {
    
    
        IndexWriter writer = new IndexWriter(FSDirectory.open(new File("D:\\luncene").toPath()),new IndexWriterConfig(new IKAnalyzer()));
        //声明Query
        Query query = new TermQuery(new Term("name","web"));
        //根据query条件删除document
        writer.deleteDocuments(query);
        writer.close();
    }

10. Update the index library

The update here is actually to delete the index first and add the index.

11. View the index

1.Query:
TermQuery: keyword query

No demonstration, ibid.

RangeQuery: Range query

@Test
    public void selectQuery() throws IOException {
    
    
        //Query有两个子类：
        //1.TermQuery：关键词查询，需要提供查询的域和要查询的关键词，不演示
        //2.RangeQuery：范围查询，一般用于整数
        IndexReader indexReader = DirectoryReader.open(FSDirectory.open(new File("D:\\luncene").toPath()));
        //获取查询对象
        IndexSearcher searcher = new IndexSearcher(indexReader);
        //获取查询条件LongPoint：这个要看你存储的时候，使用的是什么类型的，如果是int，就替换为intPoint
        Query query = LongPoint.newRangeQuery("size",10l,100l);
        TopDocs topDocs = searcher.search(query, 10);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
    
    
            System.out.println(searcher.doc(scoreDoc.doc).get("name"));
        }
        indexReader.close();

    }

2.queryparser: first segment the query conditions,
and then import a jar package after querying according to the segmentation structure :
Insert picture description here
test:

 @Test
    public void QueryParserTest() throws IOException, ParseException {
    
    
        IndexReader indexReader = DirectoryReader.open(FSDirectory.open(new File("D:\\luncene").toPath()));
        IndexSearcher searcher = new IndexSearcher(indexReader);
        //创建QueryParser对象  参数1：默认的搜索域，当没有提供搜索域的时候，去这个域搜索，参数2：使用的分析器
        QueryParser queryParser = new QueryParser("name",new IKAnalyzer());
        //设置查询条件
        Query query = queryParser.parse("Lucene是java开发的");
        //查询
        TopDocs topDocs = searcher.search(query, 10);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
    
    
            Document doc = searcher.doc(scoreDoc.doc);
            System.out.println(doc.get("name"));
        }
        indexReader.close();
    }

The search result is:
first segment your query conditions, and query according to each word after the segmentation, as long as the query finds a relevant one, it will be queried.

The initial use of Lucene is over, and most projects later use frameworks developed based on Lucene: es, solr, etc.

Lucene preliminary use, understanding and code