To optimize ES retrieval, first understand the underlying Lucene, a list of Lucene source code structures

Analysis of Lucene source code structure

foreword

I haven’t written any blog for half a year. If it wasn’t for the many challenges I encountered at work, I probably wouldn’t bother to study the source code of Lucene. Sure enough, DDL and difficulties are the primary productivity. That’s right, I want to write a series of blogs about storage engines.

When I was working in a start-up company, there was a need to optimize Elasticsearch. At that time, I had a cursory look at the underlying principles of ES and Lucene (Lucene is the most basic retrieval unit of ES, and a shard corresponds to an ES instance), but it was only on the surface, and I didn’t dig its source code. Later, after entering Baidu, I joined the Chuisou team, hoping to have a deeper understanding of general search engines. What about general search engines?

And I held a point of view before that the best way to learn something is to tell others about it, so I have the motivation to write Lucene source code analysis. At the same time, because it is written while learning, it is inevitable that due to insufficient level, there will be some mistakes, and I hope readers can point them out. Another important motivation is that there are only a handful of blogs on the market explaining the source code of Lucene, and the number of related books is also 0. The Lucene version based on the few books is still 10 years ago; most of the blogs of many people do not explain Lucene very systematically, but only pick one or two knowledge points. In addition, I have read most of the Lucene source code analysis articles on the market, and many important points will be omitted. And the series of articles I wrote can be said without hesitation to be the most comprehensive and easy-to-understand of all blogs.

This series of articles is mainly aimed at readers who have experience in using ES or Lucene. Those who understand the basic concepts of Term, Doc, Field, and PostingList are suitable for reading these. The blog will not tell you how to build a search engine with Lucene, but tell you the principles behind the search engine through the Lucene source code.

Looking back, this article does not intend to directly talk about the code details, but to give a general framework, talk about how I plan to explain the Lucene source code, and the project structure of the Lucene source code.

Table of contents

The belly draft is roughly divided into these 12 chapters, but it should be expanded according to the actual situation. Some things are not very thorough in one chapter, such as the special data structures such as BKD Tree. In fact, it is necessary to pull out a separate chapter.

0. Overview

That is to go through the code structure of the entire lucene project to figure out what each major category is roughly responsible for.

1. Stored Field storage method

The so-called front row is the Stored Field, which is to store the attribute information of the field. Here, we focus on the following questions: 1. How are the data types stored? 2. How is the final written index compressed?

2. Storage method of Doc Value

The Doc Value here is a key-value pair, which is a structure born to speed up screening and sorting. The main concerns are: 1. What are the types of DocValue? SortedNumericDocValue? SortedSet? Application scenarios, etc. 2. How is DocValue stored?

3. Point Values ​​storage method

This is a data structure proposed by Lucene6 version to speed up RangeQuery. The bottom layer is made of BKTree, so focus on how to store PointValue? How to optimize RangeQuery?

4. Norm Value storage method

5. Term Vector storage method

4. Storage method of inverted index

The first three blocks are mainly for the front row. Let’s talk about the inverted row here. The core part of the search engine is the storage of the inverted index. Focus on how to compress the inverted index? How to store it? What data structure to use?

6. Workflow of Index Construction

This chapter mainly talks about the process that Lucene goes through from receiving the construction request to the final execution of encoding and storage? Focus on flush, commit, how to use multi-threading?

7. Retrieval method of forward index

8. RangeQuery retrieval method

9. Retrieval method of inverted index

10. Merge method of inverted zipper

11. The method of segment merging

12. Other

Lucene source code project structure

The version based on this blog is Lucene7.7.3 , and there are still many iterative updates for each major version of Lucene, so you must choose a suitable stable version when studying the source code. In addition to reading the source code, the main way to pick up the source code is to open the Debug mode and follow the demo. You can basically understand everything. As we all know, the hierarchical structure of open source projects is often very deep, and the packaging and abstraction are very good. In addition, many common design patterns in the industry are used in it, which is very worth learning.

Here, first sort out which classes are contained in each directory of the Lucene project, and what are their main uses.

.
├── LucenePackage.java
├── analysis
├── codecs
├── document
├── geo
├── index
├── package-info.java
├── search
├── store
└── util

analysis

This package is mainly used to parse query and document and disassemble them into individual tokens. This package is not the focus of our research. One very important reason is that we often don’t want to use the word segmentation method provided by Lucene by default. This part is often done offline.

codecs

Encoding package, which includes the definition and implementation of encoding and decoding of various data, and also includes the implementation of some data structures similar to BKD Tree jump table. It can be regarded as the core class.

Briefly talk about what each class is used for. There are a large number of abstract classes, that is, there are only declarations and no implementation. This is a very good and scalable design. Developers can implement some classes that meet their own needs based on these abstract classes according to their needs.

├── BlockTermState.java    # 记录Term在一个Block中的状态
├── Codec.java  # 这是一个抽象类,定义索引的压缩方式//todo
├── CodecUtil.java  # 用来读取version header的类
├── CompoundFormat.java  # 抽象类,定义压缩格式
├── DocValuesConsumer.java # 抽象类,声明DocValues创建、merge接口
├── DocValuesFormat.java # 抽象类, 定义Docvalue格式//todo
├── DocValuesProducer.java # 抽象类,声明DocValue读取接口
├── FieldInfosFormat.java # 抽象类,声明FieldInfo读写接口
├── FieldsConsumer.java # 抽象类,声明写入所有Fields的写接口和merge接口
├── FieldsProducer.java # 抽象类,//todo 
├── FilterCodec.java # 抽象类, 和Codec构成一种委托者模式, 这是委托者,Codec是受托者
├── LegacyDocValuesIterables.java # 废弃
├── LiveDocsFormat.java # 抽象类,这是对于live/deleted documents 读写操作的声明, todo
├── MultiLevelSkipListReader.java 跳表读取类
├── MultiLevelSkipListWriter.java 跳表写入类
├── MutablePointValues.java # 抽象类, 定义不可变的PointValues类型
├── NormsConsumer.java # 抽象类,声明写Norms信息的方法
├── NormsFormat.java # 抽象类, 定义Norm格式
├── NormsProducer.java # 抽象类,声明NormValue读取接口
├── PointsFormat.java # 抽象类, 声明Points格式定义
├── PointsReader.java  # 抽象类, 声明PointsValue读取方法
├── PointsWriter.java  #抽象类, 声明PointsValue写入方法
├── PostingsFormat.java # 抽象类,声明倒排格式
├── PostingsReaderBase.java # 抽象类, 声明Posting倒排表读取方法
├── PostingsWriterBase.java # 抽象类, 声明Posting倒排表写入方法
├── PushPostingsWriterBase.java  # 相比上面那种多了PushAPI, 是一种SAX API, 上面是DOM API
├── SegmentInfoFormat.java  # 抽象类,声明segmentInfo格式
├── StoredFieldsFormat.java # 抽象类, 声明StoredField格式
├── StoredFieldsReader.java # 抽象类,声明读取StoredField相关方法
├── StoredFieldsWriter.java # 抽象类,声明写入StoredField相关方法
├── TermStats.java  # 数据类,用于记录docFreq和termFreq
├── TermVectorsFormat.java # 抽象类, 声明TermVector相关方法
├── TermVectorsReader.java # 抽象类, 声明读取TermVector相关方法
├── TermVectorsWriter.java #抽象类, 声明写入TermVector相关方法
├── blocktree # TermDict相关编码都在这个目录下
├── compressing # StoredField和TermVector相关的抽象类的最终实现都在这里,一些压缩算法也在这里
├── lucene50 # PostingReader, PostingWriter, SkipReader, SkipWriter最终实现在这里, 
├── lucene60  # PointsReader, PointsWriter,PointsFormat最终实现在这里
├── lucene62 # SegmentInfoFormat最终实现在这里
├── lucene70 # DocValueWriter, DocValueReader, NormsConsumer, NormsProducer最终实现在这里
├── package-info.java
└── perfield # 支持单个Field格式的实现

The inheritance diagram is as follows:

Don't be fooled by so many categories above, but in fact it can be roughly divided into two dimensions:

The first dimension is data . In fact, the entire Lucene divides the data to be processed into the following categories: 1.  PostingList  inverted list, that is, term->[doc1, doc3, doc5] such inverted index data 2.  BlockTree , from the mapping relationship between term and PostingList, this mapping is generally represented by a data structure such as FST. This data structure is actually a tree structure, similar to a Tier tree, so Lucene is called BlockTree here. In fact, I am more used to calling it Term Dict. 3.   The original information stored in  StoredField ; 4. DocValue  key-value data, which is mainly used to speed up the sorting and filtering of fields. 5.  TermVector , word vector information, mainly records information such as the global occurrence frequency of a different term. 6.  Norms , used to store Normalisation information, such as weighting certain fields. 7.  PointValue  is used to speed up range Query information.

The second dimension is behavior , that is, Writer and Reader that define data. Format is essentially a medium used to evoke Writer and Reader.

Based on the arrangement and combination of these two dimensions, these classes in the directory are derived, which is also easy to understand.

document

This package mainly defines some data types, such as Field, Docuemnt, Point, DocValue, etc., and their permutations and combinations with basic types such as Int Float String.

geo

Some practical classes about geographic information

index

The classes here are very rich, and I think Lucene is a very, very core package. There are too many files in the package, so I won’t go into detail here, and post the class inheritance relationship diagram:

It can be broken down into these categories:

  1. Reader-related, specifically IndexReader, LeafReader CompositeReader, etc.
  2. DocValue related, that is, SortedDocValues, SortedNumericDocValues, NumericDocValues, BinaryDocValues, these need to be seen together with TermsEnum, DocValuesWriter will be clearer.
  3. MergePolicy is related, that is, segment merging strategies. These classes define when to merge? How to merge? Merge size and other details.
  4. TermsEnum is related, in fact, it defines a terms collection class, and puts terms in the order of fields
  5. Related to IndexDeletionPolicy, index deletion policy.
  6. Related to TermsHash, the main function of this class is to assume the base class of TermVectorConsumer and FreqProxTermsWriter
  7. DocValuesWriter, this should be very familiar, under the jurisdiction of SortedDocValueWriters, BinaryDocValuesWriters, SortedSetDocValuesWriters, SortedNumericDocValuesWriters, NumericDocValuesWriters
  8. MergeScheduler is related and defines the details of segment merging. By default, the subclass ConcurrentMergeScheduler is used, which is a multi-threaded merging class. SerialMergeScheduler is basically not used, that is, serial to merge segments
  9. DocValuesFieldUpdates, save the update information of DocValue of all documents in a segment.
  10. In addition, there are some necessary structural information, such as SegmentInfo, WriterState and other state information classes, and the definition of core classes such as Term, DocWriter, IndexWriter, etc., which will be involved in the future

Search

The classes related to retrieval are all here

1. There is a large category that is the implementation of the Query class. Lucene provides various queries for developers, such as the simplest MatchQuery, BoolQuery that can do Boolean retrieval, and SynoymQuery that supports synonyms, etc. These will be introduced later. er is used in the PhraseQuery scenario. 3. Collector-related, provides a variety of Collector implementations for collecting final results, the most commonly used is TotalHitCountCollector 4. Others are some tool classes that support the above classes, such as DocIDSet, HitQueue, etc. 5. There are two folders that you can pay attention to. One is similarities, which provides some algorithm packages for calculating similarity, and span, which is used to build some advanced queries, which will be discussed later.

Store

And finally some classes for off-disk and read-disk are here.

I think the best point of Lucene's design is that it separates the logic of placement from the behavior of generating placement data, and provides the abstraction of DataInput, DataOutput and Disk Directory. 1. DataInput, which provides an abstraction of data reading, defines how to read vint? How to read vlong? vfloat etc. 2. DataOutput, which provides an abstraction of data writing, defines how to write vint? How to write vlong? Etc. 3. Directory, which provides methods to write data, such as simple, NIOFSD, mmap, etc.

utils

Some coding implementations and algorithm implementations are in this library, specifically including: 1. Automation, the implementation of a state machine, is actually a regular implementation, which is an essential thing to support regular query. 2. bkd, the realization of bkd tree, in order to speed up the realization of range query 3. fst, the realization of fst, the actual purpose of fst is to provide term->id mapping, but it has the advantages of high retrieval speed, less memory consumption, and support for prefix query. 4. graph will be mentioned later, which provides auxiliary for automation. 5. mutable, the implementation of immutable values ​​6. packed, provides several methods of encoding integers, such as two-byte encoding, four-byte encoding, etc., which will be mentioned later. 7. Others.

I feel a bit regretful after finishing this chapter, but I still feel that it is more appropriate to come back and write the table of contents after the main article is finished. Anyway, leave a hole first, and fill it in slowly later.

Edited at 2021-06-27 19:28

Guess you like

Origin blog.csdn.net/star1210644725/article/details/130098068