Ali cloud InfluxDB® based TSI index data query process

Brief introduction

Time-series database technology is one of the fastest growing database type. As the industry's most popular database timing InfluxDB, its deployment is very simple and convenient operation, supports read and write high performance time series data, application monitoring, Internet of Things (IoT) fields have a wide range of applications.

Ali cloud InfluxDB® is based on open source version InfluxDB optimized version, the internal organization of the data is consistent with the open source version can be summarized in two parts with the index data. User Influxql statements for data query, InfluxDB parses this statement and generate AST syntax tree, through the syntax tree traversal, find the key metadata measurement to be queried, tag kv and other information, corresponding to the time series data where to find by index InfluxDB of file (TSM), to obtain data points, returned to the user. The index is achieved, there are two, one is based inverted index memory (inmem), limited by memory use, if the situation downtime occurs, you need to scan documents and resolve all TSM rebuilt in memory, the recovery time long; the other is based on inverted index file (TSI), a small memory footprint, can support millions or even thousands of levels timeline, downtime recovery has little effect.

This paper will be based on the type of file index inverted index (TSI), in-depth introduction once the user data query, process details in the interior InfluxDB.

Process Overview

Query entrance

Internal InfluxDB registered a number of services, which is responsible for handling the service httpd service external requests, usually, read and write requests will reach services / httpd / handler.go here. For the select statement, the call is serveQuery.

Queries ready

Preparations inquiry, including the update counter, parameter parsing, AST parse tree generation, certification four aspects.

1, counting statistics

statistics count, h.stats.QueryRequests inquiry counter is incremented by one. Second, the definition of a defer, for statistical time-consuming at the end of the current query.

defer func(start time.Time) {
    atomic.AddInt64(&h.stats.QueryRequestDuration, time.Since(start).Nanoseconds())
}(time.Now())

2, parameter parsing

form Influxdb need to resolve the query submitted by a number of critical configuration parameters in order to determine the details of database access, storage strategy, such as the output format:

parameter name effect
q Check for phrases
node_id Cluster Edition node id, stand-alone version is invalid
epoch Query output format, selectable values ​​are epoch = [h, m, s, ms, u, ns]
db To query the database
rp To be queried storage strategy
params Additional parameters
chunked For controlling the flow return point in the batch instead of a single point response. If set to true, InfluxDB or each series will chunk_size point (default 10000) response packet.
chunk_size Packet size corresponding points
async Whether synchronous query returns

3, generate AST tree

to extract from query form, InfluxDB has its own set Influxql analytical framework that can be generated within the AST tree similar to the traditional relational database. Since Influxql huge part of the code, little to do with the query logic core, this text does not start.

4, authentication

for security reasons, users may be on the authentication configuration InfluxDB. After only user permissions and corresponding matching correct password to access the relevant database.

Execute the query

After the preparations are completed, will eventually execute executeSelectStatement function coordinator / statement_executor.go, this function is the processing of the query. From services / httpd / handler.go of serveQuery to coordinator / statement_executor.go of executeSelectStatement, through the layers of function calls. For convenience of the reader to read the code, the following figure shows the call stack (function call encountered Should go interface part, the following figure has been replaced with the real function of the structure of the interface).


test


Inside executeSelectStatement function, there are several key actions:
(1) to create an iterator createIterators, the process of creating internal access TSM part by TSI and decode the data content ( "Inquiry details: iterators and TSI, TSM" section in-depth analysis);
( 2) Emitter iterator created based call Emitter cursor function by circulating Scan () function to read a data from one of the cursor.

The following are several situations that may be encountered when reading data Emitter:
(A) If a plurality of series, it is generally the first one to take a series of data, and then take a second data ...... after take, a series of retake second data, and so on;
(B) if the query returns data reaches the set chunkSize (no query parameters, the default 10,000), the first advance return to this returns the result to do batch;
(C) if the query is similar to count this polymerization operation, it may also return a total result data from cur, the polymerization work placed in reducer operation related iterator;
(D ) If you have decode the data is scanned and the need to continue to read, it will force the cur do decode the underlying TSM file again. Until you need to complete the data read.


test2

The query returns

When the end of the query, the results are sequentially removed from the pipe the synchronous case. And protocol processing chunked return protocol, eventually returned to the user.

Inquiry details: iterators and TSI, TSM

CreateIterator

我们提到执行select的主函数executeSelectStatement一个关键操作是创建迭代器(或者说是游标)cursor,这一思想与传统关系型数据库执行计划树的执行过程有相似之处。cursor的创建依赖于查询准备工作中由InfluxQL解析生成的AST树(SelectStatement)。创建过程也类似于传统关系型数据库,分为Prepare、Select两部,其中Prepare过程还可以细分为Compile与Compile之后的Prepare;Select则是基于Prepare后的SelectStatement来构建迭代器。

Prepare过程

Prepare过程先执行了Compile,Compile主要进行以下操作:
1、预处理,主要是解析、校验和记录当前查询状态的全局属性,例如解析查询的时间范围、校验查询条件合法性、校验聚合条件的合法性等等。
2、fields预处理,例如查询带time字段的会自动替换为timstamp。
3、重写distinct查询条件。
4、重写正则表达式查询条件。

Compile之后的Prepare主要进行以下操作:
1、如果查询带聚合,且配置了max-select-buckets限制,且查询时间范围下界未指定时,需要根据限制重写查询时间范围下界。
2、如果配置了额外的查询间隔配置,修正查询时间范围上下界。
3、根据时间上下界获取需要查询的shards map(LocalShardMapping结构体对象)。
4、如果是模糊查询则替换*为所有可能的tag key与field key(获取tag key时,实际上已经访问了TSI索引)。
5、校验查询类型是否合法。
6、确定与查询间隔(group by time())匹配的开始和结束时间并再次校验查询buckets是否超过了max-select-buckets配置项。

Select过程

在InfluxDB内部,由5种Iterator,它们分别是buildFieldIterator、buildAuxIterator、buildExprIterator、buildCallIterator和buildVarRefIterator。它们根据不同的查询产生不同的创建过程,彼此组成互相调用的关系,并组成了最终的cursor。

在Select函数构建cursor的过程中,调用不断向下,我们最终会来到Engine.CreateIterator函数的调用。Engine与一个shard对应,如果查询跨多个shard,在外层会遍历所有涉及本次查询的shards(LocalShardMapping),对每个shard对应的Engine执行CreateIterator,负责查询落在本shard的数据。根据查询究竟是查数据,还是聚合函数,Engine.CreateIterator里会调用createVarRefIterator或者createCallIterator。

它们最终都会调用createTagSetIterators函数,调用之前,会查询出所有的series作为调用参数(这里访问了索引TSI)。接下来,程序会以series和cpu核数的较小值为除数平分所有的series,然后调用createTagSetGroupIterators函数继续处理,其内部会对分配到的series进行遍历,然后对每一个seriesKey调用createVarRefSeriesIterator函数。在createVarRefSeriesIterator函数中,如果ref有值,则直接调用buildCursor函数。如果ref为nil,则opt.Aux参数包含了要查询的fields,所以对其进行遍历,对每个feild,再调用buildCursor函数。

buildCursor函数中,先根据measurement查询到对应的fields,再根据field name从fields中查询到对应的Field结构(包含feild的ID、类型等)。然后,根据field是什么类型进行区别,比如是float类型,则调用buildFloatCursor函数。各类型的buildCursor底层调用的实际是TSM文件访问的函数,新建cursor对象时(newKeyCursor)使用fs.locations函数返回所有匹配的TSM文件及其中的block(KeyCursor.seeks),读取数据时则是根据数据类型调用peekTSM()、nextTSM()等TSM访问函数。

查询倒排索引TSI

我们先看一张基于TSI的InfluxDB索引组织图(如下所示)。其中db(数据库)、rp(存储策略)、shard、Index在文件组织下都是以目录形式表现,TSI使用了分区策略,所以在Index文件夹下是0~7共计8个partition文件夹,partition文件夹则是TSI文件与它的WAL(TSL):


test3


TSI using LSM-based mechanism, the operation is written in the format LogEntry WAL in append-only mode (TSL file, the corresponding level is 0), modify and delete operations, too, the background compaction (Level0 ~ Level 1) the process will be a true data deletion. When the compaction operation performed TSL file is actually written to the WAL TSI TSI format conversion to a new file. TSI file (Level1-Level6) regularly from the low level to high level for compact, compact nature of the process is to work together in the same measurement of tagbock, all tagvalue same tagkey same measurement corresponding to put together, and the same tagkey same measurement of the same tagvalue different series id for the combined together.

Let's look at the internal structure of TSI files in order to understand when the query is executed, InfluxDB how to use it to find the corresponding measurement, tag key, tag value of TSM and associated data files.

index_file

First, we see tsdb / index / tsi1 / index_file.go have a very interesting constants IndexFileTrailerSize.

IndexFileTrailerSize = IndexFileVersionSize +
    8 + 8 + // measurement block offset + size
    8 + 8 + // series id set offset + size
    8 + 8 + // tombstone series id set offset + size
    8 + 8 + // series sketch offset + size
    8 + 8 + // tombstone series sketch offset + size
    0

From its definition it is easy to draw:
. 1, IndexFileTrailerSize accounting for fixed 82bytes TSI at the end of the file, it is easy to read when parsing the Trailer TSI
2, by its definition we basically know what a TSI contains section:


test4


Here we carefully analyze later 1.7 code, I found a very interesting question, series id set in this area did not play by the series id to check the series key role during the inquiry. In fact, 1.7 influxdb, there is a special folder _series, only in this folder to store the series id series key mappings.

measurement block

We look at measurement block, which is defined in tsdb / index / tsi1 / measurement_block.go, we are also very easy to find measurement block also made a similar meta information stored Trailer and various other parts.

MeasurementTrailerSize = 0 +
    2 + // version
    8 + 8 + // data offset/size
    8 + 8 + // hash index offset/size
    8 + 8 + // measurement sketch offset/size
    8 + 8 // tombstone measurement sketch offset/size


test5


(1) Trailer MeasuermentBlock portion is the entire index, offset and size stored in the other portions.

(2) data offset / size portion of the set of all the MeasurementBlockElement. MeasurementBlockElement contained within the name of the measurement, the corresponding tag collection as well as offset and size in the file, all of the current measurement series id information.

(3) hash index portion stored in the hash index MeasurementBlockElement offset in the file, the file can be quickly locate a MeasurementBlockElement offset without reading the entire tsi.

(4) measurement sketch and tombstone measurement sketch is HyperLogLog ++ algorithm to use as a base for statistical purposes.

Tag block

We look Tag block, which is defined in tsdb / index / tsi1 / tag_block.go, the trailer also has a similar definition:

const TagBlockTrailerSize = 0 +
  8 + 8 + // value data offset/size
  8 + 8 + // key data offset/size
  8 + 8 + // hash index offset/size
  8 + // size
  2 // version

(1) Trailer tag block corresponding to the meta information, the offset and size of each file that holds other components.

(2) key data is part of a data block portion Key tag, which has two internal hash index, can be quickly positioned to a specified portion through the tag key block tag key, Data offset, Data size section points all tag corresponding to the current tag Key value block file area.

(3) value data and key data similar to the design portion. Inside the tag value block, there are series id set our primary concern.

(4) hash index portion, can quickly locate the offsset tag key block by the tag key.

_series folder

Original series id set an area for storing the entire database for all the SeriesKey, this may be a problem left over from history. 1.7.7 version of influxdb, there is a special folder _series, only in this folder to store the series id series key mappings.


image


View _series folder directory structure, and also tsi similar, divided into eight partition. The latest version of influxdb by _series folder series series id to retrieve the file series key mappings.

Packet Concurrency

After the query to get all the SeriesKey inverted index files according to TSI, should be grouped according to SeriesKey groupby conditions, the grouping algorithm for the hash. After the group of packets of different SeriesKey allow for parallel query execution and ultimately executed independently of the polymerization, thereby significantly enhance the performance of the query.

summary

Finally, we come to sum up, TSI file format design, is a multi-level index manner, each level are designed Trailer, fast and easy to find offsets in different regions; each partition and have their own Trailer, measurement block , Tag block, key data in a Tag block are designed hash index to speed up queries file offset.

For a query, we find based on measurement in the measurement block in the tag set in correspondence MeasurementBlockElement, filtered according to the query condition tag key, and then find all the tag value block associated with the corresponding tag block in. Made series id set in the tag value block, depending on series id _serise folder to find all the SeriesKey The inquiry involved.

TSM data retrieval

We first look at the TSM design. TSM is designed as a four areas: Header, Blocks, Index and Footer.


test6


Wherein 4-bit Header magic number (used to define the file type) and a version number one.

Blocks in the region, a series of independent data Block, each Block comprising a CRC32 checksum algorithm generated to ensure the integrity of the block. Internal Data, time stamp data separately, compressed according to different compression algorithms. FIG dismantling follows:


test7


In the Index area, it is stored in the index Blocks region. Index region by a series of index entries, the latter are key press stamp according to the dictionary order. An internal index entry consists of: key length, key, data type, number of Blocks current, maximum and minimum time of this block, block file offset, block length.


test7


footer file area stores offset index area.

TSM index layer file, after InfluxDB start will all be loaded into memory, the data portion due to excessive memory consumption does not load. Retrieving installation data following general steps:

1, first find all the Entry Index Index corresponding to the queries from all SeriesKey TSI obtained, since the Key is ordered, it is possible to use a binary search implementation.
2, after finding all Index Entry, then according to the time to find, using [MinTime, MaxTime] Index Entries remaining filter needs.
3, navigate to the list of possible Data Blocks by Index Entries.
4, the condition of the Data Blocks loaded into memory, the decompressed data corresponding to the data type using decompression algorithm, using a binary search algorithm further lookup to find.

to sum up

InfluxDB query process is a more complex process, the logical source to achieve the exquisite, module division clear, very suitable for the development in the field of in-depth study which thought the timing database. Because space is limited, in part related to this article is not complete, many details need to compare research community open source reference reader in the reading process, due to the limited knowledge I may have described elsewhere mistakes, we welcome the treatise.

Ali cloud InfluxDB® now officially commercialized, please visit purchase page ( https://common-buy.aliyun.com/?commodityCode=hitsdb_influxdb_pre#/buy) document ( https://help.aliyun.com/document_detail/ 113093.html? SPM = a2c4e.11153940.0.0.57b04a02biWzGa ).

Guess you like

Origin yq.aliyun.com/articles/719105