Welfare: Are you hungry? "Distributed Time Series Database - LinDB"!

background

  1. Ele.me's demand for time series databases mainly comes from various monitoring systems, which are mainly used to store monitoring indicators. Originally, graphite was used, and later, there was a need for multi-dimensional indicators (mainly reflected in adding multiple tags to an indicator to form a Series, and then performing Filter and Group calculations on the tags). At this time, it is basically difficult for graphite to meet the requirements.

  2. There are mainly the following types of TSDBs that are currently used in the industry:

  • InfluxDB: Many companies are using InfluxDB, including Ele.me. Some monitoring systems also use InfluxDB. Advantages, support multi-dimensional and multi-field, storage is also optimized according to the characteristics of TSDB. However, the open source part does not support it. Many companies do clustering by themselves, but most of them are based on the name of the indicator, so there will be a single-point hot issue. Now Ele.me is doing a similar approach, but the hotspot problem is very serious. The best server has been used for large indicators, but the query performance is still not ideal. If it is done by Series Sharding, the cost is still a little high;

  • Graphite: There are many calculation functions based on index writing and query, but it is difficult to support multi-dimensional, including query of computer room or multi-cluster. It turns out that Ele.me stores the monitoring indicators of the business layer in Graphite, and it works very well, but there are many After living, it is basically difficult to meet some needs. Due to the characteristics of its storage structure, it occupies a lot of IO. According to the current online data write amplification, it is almost dozens of times or more;

  • OpenTSDB: Based on HBase, the advantages of the storage layer do not need to be considered by yourself. You can do query aggregation. There will also be hot issues of HBase. In the past, the company also made TSDB based on HBase to solve some problems of OpenTSDB, such as hot spots, some Query aggregation is decentralized to HBase, etc., in order to optimize its query performance, but it is still very important to rely on HBase/HDFS;

  • HiTSDB: The TSDB provided by Alibaba also uses HBase for storage. Many optimizations have been made in the data structure and index. There is no specific research. Interested students can try it on Alibaba Cloud;

  • Druid: Druid is actually an OLAP system, but it can also be used to store time series data, but it was abandoned when I saw its architecture diagram;

  • ES: There are also companies that use ES directly for storage without actual testing, but they always feel that ES is not a real TSDB;

  • atlas: produced by Netflix, full-memory TSDB, all data in the last few hours is in memory, historical data needs external storage, no detailed research;

  • beringei: produced by facebook, full-memory TSDB, the latest data like atlas is in memory, and it should still be in the incubation period;

3. In the end, we decided to implement a distributed time series database by ourselves. Specifically, we need to solve the following problems:

    • Lightweight, currently only depends on Zookeeper;

    • Sharding based on Series, solves hot spots, and can truly scale horizontally;

    • Real-time writing, real-time query, because most of them are used for monitoring systems, the query performance is better;

    • Since Ele.me is currently active, and the monitoring system is active, it is necessary to support writing in a single computer room, aggregation query in multiple computer rooms, etc.;

    • Automatic Rollup function, such as users can write 10s precision, the system automatically Rollup to minutes, hours, days to support large time range queries, such as reports, etc.;

    • Supports SQL-like query methods;

    • Supports multiple copies to improve the reliability of the entire system. Normally, as long as one copy survives, services can be provided normally, and the number of copies is specified;

overall design

It adopts an architecture that separates computing and storage, and is divided into computing layer LinProxy and storage layer LinStorage.

illustrate:

  1. LinProxy mainly does some SQL parsing and some intermediate combined re-aggregation calculations. If it is not cross-cluster, LinProxy does not need it. For each node of a single cluster, a LinProxy is embedded to provide query services;

  2. LinDB Client is mainly used for data writing, and also has some query APIs;

  3. Each node of LinStorage forms a cluster, the nodes are replicated, and the replica Leader node provides read and write services. This design is mainly based on the design of Kafka. LinDB can be understood as Kafka-like data write replication + underlying time Sequence storage layer;

  4. LinMaster is mainly responsible for the allocation of database, shard, and replica, so the scheduling of LinStorage storage and the management of MetaData (currently stored in Zookeeper); Since LinStorage Nodes are all peer-to-peer, we choose a node based on Zookeeper in the cluster to become the Master, each Node reports its own state to the Master in the form of heartbeat, and the Master schedules according to these states. If the Master hangs up, another Master will be automatically selected. This process is basically lossless to the entire service, so the user basically No perception.

write

The whole writing process is divided into the following two parts:

  1. WAL replication, this part of the design refers to Kafka. As long as the user's writing is successful in WAL, it is considered successful (because it is mainly used for monitoring the system, so there is not much guarantee for the consistency of the data), which can provide System write throughput;

  2. Local writing, this process is to parse and write the WAL data into its own storage structure, and only the data written to the local storage can be found;

The whole process is not like some systems are completed in each writing process, we divided this process into 2 steps and made it asynchronous;

WAL copy

At present, the replica replication protocol of LinDB adopts a multi-channel replication protocol, which is mainly based on the replication of WAL between multiple nodes. The writing of WAL on each node is completed by an independent write operation, so the WAL of the corresponding Leader is successfully written to the Client. It is considered that this write operation is successful, and the node where the leader is located is responsible for copying the corresponding WAL to the corresponding follower. Similarly, if the WAL is successfully written, the copying is successful, as shown below:

Multi-Channel Replication Protocol

Even if it is successful to write to the leader copy to improve the write rate, it also brings the following problems:

  • Data Consistency Issues

  • data loss problem

In the above figure, Server1 is the Leader, and 3 Replications are used to replicate 1-WAL as an example:

  1. Currently, Server1 is the leader of the shard that accepts the writing of the Client. Both Server2 and Server3 are followers that accept the replication request of Server1. At this time, the 1-wal channel is used as the current data writing channel, and Server2 and Server3 may lag behind Server1 at this time.

illustrate:

  • The whole process needs to pay attention to the following indexes;

  1. The Append Index when the Client writes, indicating where the current Client writes;

  2. There will be a Replica Index corresponding to each Follower, indicating where the corresponding Follower consumes the Leader to synchronize to;

  3. Follower's Ack Index, indicating that the Follower has been successfully copied to the local WAL;

  4. For the follower's replication request, it is actually equivalent to the writing of a special Client, so there is also a corresponding Append Index;

  • Only the Index that has been Acked is marked as completed. For the Leader, WAL data smaller than the smallest Ack Index can be deleted;

  • During this process, if there is a problem with one of Server2 or Server3, the corresponding Consume Index will not move, and will only continue processing after the corresponding service is restored;

  • During the whole process, the following situations may occur;

If you think you have something to gain, you can click, follow, favorite, and forward a wave! If you want to understand and learn more knowledge, enter the java architecture exchange group: 725633148, which has Ali Java senior Daniel live to explain knowledge points, share knowledge, sort out and summarize years of work experience, and lead you to build your own technical system and technology comprehensively and scientifically Cognitive! There are many more videos and dry goods to share!

  1. Leader Replica Index > Follower Append Index. At this time, it is necessary to reset the Leader Replica Index according to the Follower Append Index. There may be two situations. The specific situation is described in the replication sequence;

  2. Leader Replica Index < Follower Append Index, there are also two situations, the specific situation is described in the replication sequence;

If Server1 hangs at this time, a new Leader is selected from Server2 and Server3, and Server2 is selected as the Leader at this time.

  • Server2 will open the 2-wal replication channel to replicate to server1 and server3. Since server1 is currently down, it will only replicate to Server3 for the time being. At this time, the data writing channel is 2-wal.

  • After Server1 starts and recovers, Server2 will open the 2-wal replication channel to Server1, and server1 will copy the remaining data in 1-wal that has not been replicated to Server2 and Server3 to them.

For abnormal situations, the data in the WAL cannot be deleted normally due to the ACK, causing the WAL to occupy too much disk. Therefore, a SIZE and TTL cleaning process is required for the WAL. Once the WAL is cleaned due to SIZE and TTL, it will lead to several Indexes. Confusion, the specific confusion is as described above.

Problems brought by the multi-channel replication protocol:

  • Each channel has a corresponding index sequence, saving the last index of each channel. The single-channel copy only needs to save 1 last index. The price is actually good.

local write

background

  • To achieve shard-level write isolation, that is, each shard will have an independent thread to be responsible for writing. It will not cause other databases to be written due to the increase of a certain database or a certain shard. The number of shards carried by a single machine is too large, resulting in too many threads. If this happens, you should expand the machine to solve the problem, or allocate the number of shards reasonably when creating a new database.

  • Since it is a single-threaded write operation, in many cases, there is no need to consider the lock contention problem caused by multi-threaded writing.

data storage structure

Explanation, take the data structure of a single database on a single node, for example:

  • A database will have multiple shards on a single node, and all shards share one index data;

  • All data is calculated according to the interval of the database to store specific data including data files and index files by time slice.

  1. This design is mainly to facilitate the processing of TTL. If the data expires, just delete the corresponding directory directly;

  2. There is a segment under each shard, and the segment stores the data of the corresponding time slice according to the interval;

  3. Why are there many data families stored by interval under each segment? This is mainly because the main problem that LinDB solves is to store massive monitoring data. The general monitoring data is basically written at the latest time, and historical data is basically not written, and the data storage of the entire LinDB is similar to the LSM method, so in order to reduce the number of data files between The merge operation leads to write amplification, so it is finally measured, and then the segment time slice is sharded.

The following is an example of an interval of 10s:

  1. segment is stored by day;

  2. Each segment is divided into data families by hour, one family per hour, and the files in each family store specific data in columns.

write process

illustrate:

  • The system will start a write thread for each shard, and this thread is responsible for all write operations on this shard.

  • First, write the data corresponding to measurement, tags, fields into the index file of the database, and generate the corresponding measurement id, time series id and field id, mainly to complete the conversion of string->int. The advantage of this is that all data storage is stored in data types, which can reduce the overall storage size, because for each data point, metadata such as measurement/tags/field takes up, such as cpu{host=1.1.1.1} load= 1 1514214168614, in fact, after converting to id, cpu => 1(measurement id), host=1.1.1.1 => 1(time series id), load => 1(field id), so the final data storage is 1 1 1514214168614 => 1, this considers the design of OpenTSDB.

  • If the writing of the index fails, it is considered that the writing has failed. There are two types of failures. One is that there is a problem with the data writing format, and this type of failure directly indicates the failure. .

  • Use the ID obtained according to the index, and then combine the write time and database Interval to calculate which family under which segment needs to be written. The process of writing the family directly writes to the memory to achieve high throughput requirements. After the memory data reaches the memory limit , which will trigger the Flush operation.

  • The whole writing process first writes the memory, and then the Flusher thread dumps the data in the memory to the corresponding file, so that a batch of data is written sequentially, and the latest data is Rollup according to the Field Type, thereby further reducing Disk IO operations.

query engine

LinDB queries need to solve the following problems:

  1. Solve queries between multiple computer rooms;

  2. Efficient streaming query calculation;

illustrate:

  • Due to the need to support multi-computer room or multi-cluster queries, LinProxy is introduced, and LinProxy is mainly responsible for user-oriented query requests;

  1. SQL Plan is responsible for the analysis of specific SQL, generating the final execution plan and the function of intermediate results that need to be calculated;

  2. Through the Metadata in Zookeeper, the request is routed to the corresponding service in the specific LinDB cluster;

  3. Each LinConnect is responsible for the communication with a LinDB cluster, and each LinConnect internally saves a piece of metadata of the corresponding cluster. The Metadata information is pushed to LinConnect by the server when each Metadata is changed, so that LinConnect basically achieves near real-time Update Metadata of;

  4. Aggregator Stream is mainly responsible for the final combined calculation operation of the intermediate results of each LinConnect;

  5. The entire LinProxy processing process is asynchronous, so that threads can be used to perform calculations while IO is waiting;

Node query

  • Each Node receives the request from LinConnect, calculates the intermediate result in the internal query and returns it to LinConnect. The detailed process will be introduced later;

illustrate:

  • If shown, a query request from the Client will generate many small query tasks. Each task has a single responsibility. It only does its own task, and then sends the result to the next task, so all queries are required. The computing tasks are all abnormal non-blocking processing, and the IO/CPU tasks are separated;

  • The entire server query uses the Actor pattern to simplify the processing of the entire Pipeline;

  • If any task is completed, if no result is produced, the downstream task will not be produced, and all downstream tasks are determined according to whether the upstream task has a result;

  • Finally, the underlying results are aggregated into the final result through Reduce Aggregate;

storage structure

Inverted index

The inverted index is divided into two parts. At present, the data related to the index is still stored in RocksDB.

  1. Generate the corresponding unique ID according to the Measurement+Tags of the Time Series (similar to the doc id in luence).

  2. Points to a list of IDs according to the inverted index of Tags. The TSID list is stored in the form of BitMap, so that the desired data can be filtered out through the BitMap operation during query. BitMap uses RoaringBitMap.

  3. Each type of data is stored in a separate RocksDB Family.

memory structure

  1. In order to improve the writing performance, the data of the current period of time is written into the memory, and the data in the memory is dumped to the file after the memory reaches a certain limit or time.

  2. The memory storage is divided into currently writable and unwritable. The current writable is used to access normal data writing, and the unwritable is used to dump into the file. If the dump is successful, the unwritable part will be cleared.

  3. If the writable part is also at the write limit, but the unwritable part has not been dumped, the writing will be blocked until there is available memory for data writing, in order not to take up too much memory. cause OOM.

  4. MemoryTable stores the Measurement ID->Measurement Store relationship through a Map, that is, each Measurement is stored in a separate Store.

  5. The data corresponding to each TSID under Measurement is stored in the Measurement Store. The data corresponding to each TSID is stored in a Memory Block. Each Memory Block is stored in the Array List in the order of the TSID, and the TSID is stored in a BitMap. The location of the TSID in the Bitmap is used to locate the specific location of the Memory Block in the Array List. Here is why it is not directly stored in Map, because the entire system is implemented in Java. The Map structure in Java is not suitable for storing data of small objects. , there is multiple storage in memory.

  6. Since each TSID corresponds to a timeline, each timeline may have multiple data points. For example, there is only one count value for count, and there are multiple values ​​such as count/sum/min/max for timer. Each data type is stored in chunks. Chunk is also stored in two parts of memory inside the heap and outside the heap. The data of the recent period is placed in the heap, and the historical data is compressed and placed outside the heap, and as much recent data as possible is placed in the memory, because the purpose of LinDB is mainly It stores some monitoring data, and the monitoring data mainly cares about the data of the recent period.

file storage structure

File storage is similar to memory storage. The data of the same Measurement are stored together in Blocks. When querying, the Measurement ID is used to locate the Block in which the data of the Measurement is stored.

  1. After the Measurement Block, an Offset Block is stored, that is, the Offset where each Measurement Block is stored, and each Offset is stored in 4 bytes.

  2. The Offset Block stores a Measurement Index Block, and stores each Measurement ID in sequence, in the form of a Bitmap.

  3. A Footer Block is stored at the end of the file, which mainly stores Version(2 bytes) + Measurement Index Offset(4 bytes) + Measurement Index Length(4 bytes).

  4. Data blocks are all numeric values, so use xor compression, refer to facebook's gorilla paper;

Measurement Block:

  • Each Measurement Block is stored in a similar way to Measurement, except that the Measurement ID is replaced with the TSID in the Measurement.

  • The TS Entry stores the data corresponding to each column of the TSID, and one column of data corresponds to the data points stored for a period of time.

Query logic:

  • When the DataFile is loaded for the first time, the Measurement Index will be placed in the memory, and the input Measurement ID will be queried through the first position in the Measurement Index, and then through this position N, the Offset of the specific Measurement Block will be queried in the Offset Block. Each Offset is 4 bytes, so offset position = (N-1) * 4, and then read 4 bytes to get the real Offset.

  • In the same way, you can find the specific TS Entry through the TSID, and then filter the specific column data according to the conditions, and finally get the data that needs to be read.

Summarize

LinDB has been officially serving the company's monitoring system since 2 years ago. From 1.0 to 2.0, it has been running stably for more than 2 years. Except for a problem with rocksdb, there is almost no problem. The performance of 3.0 has been greatly improved. Basically, they are based on some mature solutions in the industry and gradually evolve.

Some people have also asked why LinDB is so fast. In fact, we refer to many TSDB practices, and then take the best design, and then make some optimizations for the characteristics of the resulting timing.

  1. The time sequence is generally the latest write, but it is also a random write. We will first change the write to sequential write in memory, and finally write the file in sequence, all data is in order, so the query is also Sequential reading is critical;

  2. Convert the written measurement/tags/fields into Int, then generate an inverted index, and finally generate a TSID (similar to Luence's doc id), which greatly reduces the final amount of data. After all, strings like indicators are absolutely The bulk of the data is like OpenTSDB. Although InfluxDB has been storing it in blocks for a period of time, it still puts these data in the head of the block. These are costs, especially when compacting;

  3. Unlike other TSDBs that store timestamps directly, generally timestamps to milliseconds account for 8 nodes. Although delta-encoded is used according to the advantage of time order, the compression is also very good, but we want to be the ultimate, we are A bit is used to represent time. The specific method is to use the high-order time and storage Interval according to the above description, put the high-order time in the directory, and then combine the high-order bits to calculate a delta, and store the delta in 1-bit format to represent There is no data, because most of the monitoring data are continuous data, so it is reasonable to do so, so the storage of time data also greatly reduces the space;

  4. We found that for the data of multiple fields of an indicator, the adjacent points of the data of each field are basically very similar. The LinDB 2.0 storage is directly using RocksDB, and multiple fields are stored together, and then the adjacent points are stored. Compression, in fact, the compression rate will not be very high, and every time a query is fetched, all data must be read out. This is also LinDB 3.0. We consider implementing columnar storage by ourselves, and the same column exists in one piece to improve compression. Rate, read only the required data when querying. We did not use gzip/snappy/zlib for the whole compression, because these are not suitable for numerical types. We directly refer to the xor method of facebook's gorilla paper, which has been adopted by many TSDBs now;

  5. Based on the above basic sequential reads, it is no longer a problem, and the query based on TSID is not a problem, because the entire design is designed based on TSID->data, so it is necessary to solve a random read of data based on a set of TSIDs based on inversion. , as above, we put the TSID in the Bitmap, and then calculate the Offset through the Bitmap, and directly find the data. Through the optimization of storage, we can accurately find the TSID query, but it is not through binary search;

  6. Another point is that after LinDB specifies the Interval when creating a new database, the system will rollup itself. Unlike InfluxDB, which has to write a lot of Continue Query, all of LinDB is automated;

  7. Query computation parallel stream processing;

So to sum it up in one sentence is an efficient index plus a bunch of values, and then how to play with this bunch of values.

self-monitoring

LinDB also comes with some monitoring functions of its own.

Overview

Dashboard

Outlook for the future

  1. Rich query functions;

  2. Optimize memory usage;

  3. Improved self-monitoring;

  4. If possible, plan to open source;

Comparative test

Below are some query performance comparisons with InfluxDB and LinDB2.0. Since the InfluxDB clustering requires a commercial version, it is a single-machine default configuration without Cache tests. Server configuration Alibaba Cloud machine: 8 Core 16G Memory

large dimension

Tags: host(40000), disk(4), partition(20), simulate the monitoring of server disks, the total number of series is 320W, and each series writes a data point

Aggregate testing within 1 day for small dimensions

Tags: host(400), disk(2), partition(10), simulate the monitoring of server disks, the total number of series is 8K, each series writes data for one day, each dimension writes 1 point every 2s, each The dimension has a total of 43200 points in one day, and all dimensions have a total of 43200 * 8000 points, a total of 3 4560 0000 or more than 300 million data

Aggregate testing within 7 days of small dimensions

Tags: host(400), disk(2), partition(10), simulate the monitoring of server disks, the total number of series is 8K, each series writes 7 days of data, each dimension writes 1 point every 5s, and each There are a total of 17280 points for each dimension in one day, and a total of 17280 8000 7 points for all days and all dimensions, that is, 9 6768 0000, more than 900 million points. This test needs to be explained, thanks to the automatic Rollup of LinDB, if InfluxDB opens Continue Query, I believe Should be fine too.

author

Huang Jie: He joined Ele.me in 2015. He is currently the senior development manager of the framework tool department. He is mainly responsible for Ele.me's monitoring system and the tools surrounding the monitoring system.

If you think you have something to gain, you can click, follow, favorite, and forward a wave! If you want to understand and learn more knowledge, enter the java architecture exchange group: 725633148, which has Ali Java senior Daniel live to explain knowledge points, share knowledge, sort out and summarize years of work experience, and lead you to build your own technical system and technology comprehensively and scientifically Cognitive! There are many more videos and dry goods to share!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324896559&siteId=291194637