Protecting TDengine query performance——How does 3.0 greatly reduce out-of-sequence data interference?

In the case of a time series database (Time Series Database), the definition of out-of-sequence data is: "time stamp (timestamp) does not arrive at the database data in increasing order." Although its definition is very simple, time series databases need to have corresponding processing logic To ensure the order of data storage, this will inevitably increase the complexity of the database architecture, thereby affecting the performance of the database.

It is known that completely out-of-order data processing is a difficult problem in the industry, so the problem that TDengine needs to focus on should be - starting from the actual business scenario, how to efficiently process occasional out-of-order data (such as equipment damage and power outage, network abnormalities) , data supplementary recording, etc.).

The data flow direction of TDengine is: hard disk (WAL) → memory (Vnode Buffer) → hard disk (Data File).

In WAL, we record the order in which data arrives in the database, but after the data is stored in the database, we must ensure that the timestamps are in order. Therefore, the processing of out-of-order data occurs in two stages after being written to the WAL:

Based on the above logic, we divide out-of-order data into two categories:

1. Memory out-of-order data: Within the scope of the same table, it refers to the data whose timestamp intersects the time range of the memory data.


2. Hard disk out-of-sequence data: within the scope of the same table, it refers to the data whose time stamp intersects with the time range of the hard disk data.

For the out-of-order data of type 1, TDengine will sort the data by creating a jump table structure for each table in memory, so as to solve the problem.

Scenario: After creating a table, we write a small batch of data from 1970 to 2023 (since the amount of data is small enough to trigger the disk). When we write another piece of out-of-order data with a time stamp of 1998, it will be sorted by the jump table, and the sorted data will be stored in the memory in the order of "1970-1998-2023". The cost of this sort operation is borne by the write operation, but since very little data is kept in memory, the impact is minimal.

When it comes to the placement stage (for details of placement, please refer to: Article: Directly related to TDengine performance - 3.0 placement mechanism optimization and usage principles ), these three ordered data may become disordered data The existence of - that is, type two:

First of all, let's explain the design background of the data files on the hard disk: Since TDengine uses the database construction parameter duration (days) to partition the data, assuming that the duration of a certain database is set to 10 days, then since 0:00 on January 1, 1970, every 10 days will divide a data file group. Which time range the written data timestamp belongs to will be written into which data file group— that is, each data file group contains data within a fixed time range. These data exist in the form of data blocks, and the time range included in each data block depends on the actual situation at the time of placing the order.

When data is placed to the disk, the data will find their respective data file groups, and calculate whether the data is out of sequence or not according to the time range calculation of the existing data blocks in the data file group. From this, various situations arise. We select three pieces of data belonging to 2023, 1998, and 1970, and give examples of three situations:

  1. When the data "2023-01-01 xxxxxxx" is placed on the disk, it does not overlap with the generation time of any data block in this table—that is, normal data that is not out of order .
     
  2. "1998-01-01 xxxxxxxx": Although this piece of data intersects with the overall time range (1970--2023) of the table when it is placed, it does not intersect with any data block in the table in the local time range. This kind of disorder usually occurs in the supplementary data recording of historical data. For example: the collected equipment is restored to use after a long period of power outage. Therefore, although this piece of data seems out of order, it is actually not much different from normal data .
     
  3. "1970-01-01 xxxxxxxx", this piece of data intersects with the time range of the table data block when it is placed: as early as 2.0, if the data at the time of placing intersects with the time stamp of the existing data block, Out-of-order data will form a sub-block and append it to the data file. When querying, the data of the sub-block needs to be read into the memory and then sorted. When there are many sub-blocks, the query performance will be affected. After redesign, in In 3.0, the out-of-order data and the original data will be merged and rewritten into new data blocks, which will be appended to the data file and the index will be rewritten, while the old data blocks will be regarded as fragmented files. In this way, the cost of processing data is passed on to the order placement operation, so there is basically no impact on subsequent queries .

In summary, when out-of-order data is written to the hard disk, since the data blocks and index files are newly generated, it is more friendly to subsequent queries. Considering that out-of-order data must be an occasional scenario in business, such processing will basically not cause performance burden. Even if there is a small amount of fragmented data and invalid data blocks caused by disorder, they can be cleared or reorganized by the enterprise version function compact.

From many perspectives, TDengine 3.0 has achieved great optimization. Therefore, as the 2.0 era is gradually coming to an end, we also hope that everyone can switch from 2.0 to 3.0 as soon as possible: How to migrate data from TDengine 2.x to 3.x? .

Guess you like

Origin blog.csdn.net/taos_data/article/details/131910108