Database Conference VLDB 2023 Paper Interpretation - Krypton: Bytedance Real-time Service Analysis SQL Engine Design

" Krypton originates from Krypton in the DC universe. It is the hometown of Superman and is named after the krypton element ."

introduction

In recent years, in addition to complex analysis requirements, Byte's internal business has also put forward higher requirements for online service capabilities of real-time data. Most businesses have to use multiple systems to deal with different workloads. Although they can meet the needs, they also bring about the problem of data consistency in different systems. ETL between multiple systems also wastes a lot of resources. At the same time, it is also difficult for R&D In terms of personnel, they also have to learn to maintain multiple systems. In order to solve this problem, we started the Krypton project, which is a new generation of real-time service analysis system for complex businesses (HSAP: Hybrid Serving and Analytical Processing), hoping to cope with complex big data analysis scenarios while also meeting business needs for real-time data online services.

Paper link: https://www.vldb.org/pvldb/vol16/p3528-chen.pdf

Background and introduction

The picture above is Byte's typical advertising back-end architecture. Data flows into different systems through Kafka. For offline links, data usually flows into Spark/Hive for calculation, and the results are imported to systems such as HBase/ES/ClickHouse through ETL to provide online query services. For real-time links, data will directly enter HBase/ES to provide high-concurrency and low-latency online query services. On the other hand, data will flow into ClickHouse/Druid to provide online query aggregation services. The problem this brings is as mentioned in the introduction. Data is stored redundantly in multiple copies, which leads to many consistency problems and a lot of waste of resources. In order to solve this problem, we designed Krypton (HSAP). The design goals of the system have several main points:

  1. adjustable. We hope to design a system that can cope with various workloads. For different workloads, each component of the system can be freely scaled.

  2. High concurrency and low latency. In order to cope with the needs of online serving scenarios, the system needs to be able to meet million-level concurrency and millisecond-level latency requirements.

  3. The data is strongly consistent. Our customers hope that data can be imported atomically and support Snapshot Read.

  4. High timeliness. Most users need data to be visible at the sub-second level. In some serving scenarios, users need data to be visible at the millisecond level.

  5. High throughput import. In big data scenarios, import performance is very critical.

  6. Standard SQL support. Many users have migrated from systems such as MySQL, so ANSI SQL support is critical for user migration.

System overview

data model

As shown in the figure, Krypton supports two levels of partitioning. The first level is called Partition, and the second level is called Tablet. Each level supports the Range/Hash/List partitioning strategy. Each Tablet contains a set of Rowsets, and the internal data of each Rowset is sorted according to the Sort Key defined in the Schema. Rowset has the concept of version number. Rows corresponding to the same Primary Key may exist in multiple copies in different Rowsets. When reading, multiple versions of data will be merged into one copy according to different Merge algorithms. The Commit Version of the Tablet is the maximum version number of the Rowset under the Tablet. For example, the Commit Version of Tablet 2 in the figure above is the version number 21 of Rowset 5. Each Query will carry the version number of the data to achieve Snapshot Read.

According to different merging algorithms, Krypton supports three table models:

  1. Duplicate Table: There are multiple copies of the same row.

  2. Unique Table: The system needs to define a Primary Key (PK). There will only be one copy of the same PK, and the higher version overwrites the lower version.

  3. Aggregate Table: Similar to Unique Table, PK needs to be defined, but the merging algorithm of multiple rows with the same PK and different columns can be customized.

Architecture

As shown in the figure above, Krypton’s architecture has the following characteristics:

  1. Separation of storage and calculation

    1. Krypton's data is stored on Cloud Store, such as HDFS, standard object storage interface S3, etc.; metadata is also stored in external storage systems, such as ZK and distributed KV systems.

  2. Read and write separation

    1. Ingestion Server is responsible for importing data, and Compaction Server is responsible for merging data regularly. After the data is imported, the Ingestion Server will write WAL, and the data will enter the memory Buffer. When the Buffer is full, it will be flushed into a column and stored in the file on the Cloud Store, and the new data will be registered with the Meta Server and the Commit Version of the related Tablet will be updated.

    2. Coordinator and Data Server form a read link. Coordinator will access Meta Server to obtain the latest version number of Schema and data, generate a distributed execution plan and send it to Data Server. Data Server is responsible for the execution of Query Plan. Krypton's Query Processor adopts the MPP execution mode.

    3. In order to provide better data visibility, we support the Dirty Read function, which means that the Data Server can directly access the data in the Ingestion Server memory, providing millisecond-level data visibility.

  3. Cache

    1. In order to support the low latency requirements of online serving, we support Metadata Cache, Plan Cache and Result Cache in Cooridinator. Data Server internally supports multi-level cache of data, including DRAM, PMEM and SSD media. In order to reduce glitches, we also support the Cache preheating function. New data will be notified to the Data Server to be loaded before it is registered to the Meta Server.

materialized view

Materialized View (MV) plays a very important role in both the Serving scene and the AP scene. Based on its own architectural characteristics, Krypton implements a single-table real-time strong consistency MV strategy, and MV does not need to maintain the same partitioning strategy as the Base Table.

MV Maintenance

Within the Ingestion Server, when the data in the Base table memory needs to be flushed, MV Query will be executed to convert this part of the memory data into MV data. The MV data and the Base table data will be flushed atomically. After success, it will register with the Meta Server and atomically update the version numbers of the Base table and MV, ensuring data consistency between the MV and the Base table.

Query Rewrite

Here we introduce a relatively special rewriting scenario, which also comes from Byte's internal business. The original Query aggregates data within a time window, such as the following SQL:

Since the amount of data that needs to be aggregated is relatively large, the online requirements for such Query Latency are relatively high, so we use MV to speed up the execution of this Query. The specific method is as follows:

  1. Create two MVs for the original table, one that aggregates by day and one that aggregates by hour.

  2. Split the time window in Query into three parts:

    1. 2022- 05-01 00:00:00 - 2022-05-09 00:00:00

    2. 2022-05-09 00:00:00 - 2022-05-09 14:00:00

    3. 2022-05-09 14:00:00 - 2022-05-09 14:12:15

  3. For the time window 2.a, directly query the day-level MV, the time window 2.b, query the hour-level MV, and the time window 2.c, query the detailed table, and finally merge the results of the three parts together.

The entire Query rewriting is automatically completed by the Optimizer without the user being aware of it.

Automatic Data Model Derivation

In addition, as a special table, MV can also choose to use different table models. Krypton can automatically derive the MV table model based on the Base table table model and MV Query, reducing the user's burden.

Query Processor

Krytpon implements a Push-based vectorization engine and adopts a Coroutine-based asynchronous scheduling execution framework. Taking the above figure as an example, it shows the execution process of a Query. The Coordinator will generate Fragments from the optimized Query and deliver them to a group of Data Servers for execution. For example, the Query in the picture above generates two sets of Fragments: Fragment 0 and Fragment 1. Fragment 1 is responsible for executing the Scan of the two tables and performing Colocate Join. The generated results are shuffled to the Data Server where Fragment 0 is located. Fragment 0 is responsible for aggregating the data and then periodically taking it away from the Coordinator. Fragment 1 will be internally divided into multiple Pipes. Each Pipe consists of a set of Operators, and the execution logic of these Pipes will not be blocked. Different Pipes are connected through a Local Exchanger operator, and different Pipes can set different concurrency degrees.

Statistics and Query Cache

  1. Query Cache

    1. Cache Maintainance: In order to prevent the use of expired data, version number information is added to the Cache Key, and a thread in the background regularly compares it with the data version in the Meta Server, and removes expired Cache Entry.

    2. Plan/Stats/ Result Cache : The Query plan will be cached in the Coordinator. The selectivity estimation information of some Query Fragments will also be cached. Finally, the results of Query execution will also be cached. This is usually used when the data is not updated frequently. In scenarios where there are many identical Queries. In addition, Krypton will also cache some intermediate results of Query execution, which can be used more effectively by other Query.

  2. Statistics

    1. Incremental Stats : Krypton dynamically maintains the NDV of Table Row Count and Column. NDV uses HLL to perform delta calculations. When Ingestion Server Flush data, it will calculate the Row Count and HLL NDV of the data in memory and submit it to Meta Server.

    2. Dynamic Sampling: For the estimation of Filter Selectivity, Krypton will directly send a Sample Query Plan Fragment during the Plan stage to collect statistical information. On the TPCH-1T test set, the statistical estimation of the Sample data and the statistical value of the supporting data only differ by 1 %, the Overhead executed by Sample Query does not exceed 2% of the execution time. In addition, after our Query is executed, it will collect some lightweight statistical information and return the results to the Coordinator to help the optimizer update the statistical information.

Concurrency control

Krypton uses a combination of static and dynamic methods to determine the concurrency of Query execution.

  1. In the Plan phase, the Optimizer will determine the concurrency of the Fragment level and the Pipe level based on the number of Data Servers. This can avoid the additional overhead caused by dynamically modifying the Plan, and can remove the Local Exchanger as much as possible to avoid data shuffle.

  2. In the execution phase, each Pipe corresponds to an Execution Task, and the Task will be handed over to a corresponding Coro Thread for execution. The specific execution concurrency and order of execution are dynamically determined by the underlying Coro-scheduler based on the current system conditions. We can set different Priorities for different Tasks. When encountering a task with a higher priority, Coro-scheduler will dynamically reduce the number of coro-threads corresponding to the task in progress. In addition, compared to pthread, Coro-thread has much less overhead of Context Switch, and IO operations can be asynchronous, which can make full use of the CPU.

Resource isolation

The workloads of Serving and AP are quite different, so resource isolation is very important for mixed workload scenarios. Krypton implements a two-level resource isolation strategy.

1.DS Instance granular resource isolation

Since Krypton adopts a cloud-native deployment model, each DS Instance corresponds to a container, so we can divide the DS Instance into multiple Resource Groups, and different workloads are isolated through Resource Groups. Due to the characteristics of Krypton's separation of storage and calculation, multiple Resource Groups can share a piece of data. For some temporary ETL Queries, Krypton can quickly pull up some resources for processing, and then release the resources after processing.

2. Coro-based resource isolation within DS

Within the same Resource Group, different queries also need to be isolated. Krypton provides a fair scheduling strategy based on Coroutine. As shown in Figure 6, each Core is bound to a Task Group, which manages all Tasks assigned to it. Here, each Task corresponds to a Coro-thread. During execution, the Task is submitted to the Local Task Queue to wait. After execution, after a period of time t, the unfinished Local Task will be put into the Global Time-slicing Queue. When the Local Task Queue is empty, the corresponding Task Group will fetch Tasks from the Global Queue, where the priority of the Global Queue is based on the CPU time consumed by each Task. This is the basic principle of the fair scheduling algorithm.

Optimization specific to Serving scenarios

1.Lightweight API

In the Serving scenario, each Query is usually not very complex and the number of results returned is not large. Therefore, when the Coordinator finds that a Single Node Plan is generated, it will directly call the Lightweight API of the corresponding DS to obtain the result. Lightweight API avoids the problem of multiple RPC communications under large Query and avoids a large number of thread switching.

2.Dirty Read

For scenarios with high timeliness requirements, we provide the Dirty Read function. After the Coordinator sends the Query to the DS with the Committed Version, the DS goes to the Ingestion Server memory to obtain the Uncommited data, and merges it with the Committed data after returning. After the Ingestion Server flushes the data in the memory to HDFS, it will cache this part of the data for a period of time to ensure that the Dirty Read Request can definitely get the part of the data after the Committed Version, and there will be no data holes.

Multi-level Cache

In order to meet performance requirements, Krypton implements a multi-level Cache inside the Data Server, and can use DRAM, PMEM and SSD as the storage medium of the Cache. As shown in the figure below, the Cache module contains three parts: Cache Index, Replacement Policy and Cache Storage Engine.

Replacement Policy

AP often needs to scan a large amount of data, but Serving has obvious data access locality. Because our Cache replacement strategy needs to have "anti-scanning" characteristics in order to ensure the performance of Serving.

We chose SLRU as our Cache replacement strategy. In addition to the "anti-scan" feature, this strategy does not require re-locking to access data already in the Cache. Compared with MemCached's SLRU, we use a lock-free Hash Table to store the Cache Index, further reducing the lock band. coming expenses. Compared with the FIFO strategy, in the Serving scenario, our strategy has a 28% improvement in P99 Latency.

NUMA-Aware Async PMem Write

PMem has advantages in read latency and throughput, but write bandwidth is the performance bottleneck. PMem write bandwidth is only one-sixth of DRAM write bandwidth, which is lower than the concurrent access level of read bandwidth, and the performance will drop drastically when accessed across NUMA nodes.

Krypton implements a NUMA-based asynchronous write strategy to improve PMem write performance. As shown in the figure above, each PMem device has a corresponding write thread pool and is bound to a NUMA node, which is responsible for all writes to this PMem device. Asynchronous write tasks will be assigned to the corresponding thread pool for processing. After testing, when each Thread Pool has 3 Threads, PMem's writing performance is improved by 23%.

ZonedStore Based SSD Cache

SSD Cache allows Krypton to cache as much data locally as possible, and can quickly warm up when the system is restarted. Internally, most SSD Cache uses KV storage with an LSM Tree architecture similar to Rocksdb. However, LSM Tree is not designed for SSD Cache, which causes a lot of space waste and read and write amplification. To solve this problem, we designed ZonedStore.

ZonedStore divides the SSD into multiple equal-sized Zones, of which only one Zone is writable. Newly written data will be sequentially appended to the currently writable Zone, which can reduce the write amplification inside the SSD. Because most Cache Items in ZonedStore are larger than 4kb, this allows us to put the index of all Items in memory to speed up queries and reduce read amplification. In order to improve the speed of Index Recovery during restart, we will write a Summary Segment to the end of the Zone.

ZonedStore reclaims space according to Zone granularity. The garbage ratio and access frequency of each Zone will be recorded in the Zone Metadata in memory. The GC strategy will select a Zone with a high garbage ratio and low access rate for recycling. For eliminated Cache Items, we will mark them as Soft-deleted, because the Cache data in Krypton is Immutable, so these Cache Items can still be used to provide online services before they are recycled. In order to control the write amplification caused by GC, ZoneStore will directly discard the valid data of the recycled Zone.

As can be seen from the above figure, no matter what kind of workload, whether it is Latency or Throughput, ZonedStore has a relatively large improvement compared to RocksDB.

Storage format

In order to cope with both Serving and AP workloads, Krypton designed its own storage file format. Data Page (1MB) is the basic unit for data reading and writing. The entire file is divided into three parts: Data, Index, and Meta. Each part is partitioned according to Column. When processing Query, first use Index to filter out the Data Page that needs to be read, and then access the Data Page.

Encoding and Index Algorithms

Krypton uses a variety of Data Encoding and Index to speed up Scan and enumeration. In order to quickly locate the physical location of data, users can select the appropriate Index in DDL. The Indexes supported by Krypton are as follows:

  1. Ordinal Index: Quickly find the target Data Page based on the row number.

  2. Sparse Index: Min/Max, Bloom Filter and Ribbon Filter can quickly filter out invalid Data Pages.

  3. Short-key Index: Use the first 36 bytes of the Sorted Key as the Index Key to build an index. It is a special sparse index.

  4. BitMap Index: Line numbers can be quickly filtered based on equivalent Predicates.

  5. Skip Index: You can quickly locate the location of data within a Data Page.

Nested Type Handling

Krypton is different from Dremel in the processing of composite data types. Dremel only stores leaf nodes, while Krypton organizes all fields in a B-tree manner, and stores the data of all fields sequentially and independently. In non-leaf nodes, information about the occurrence (Occurrence) and validity (Validity) of child nodes is stored; in leaf nodes, data is stored. The number of occurrences (Occurrence) represents the prefix sum of the number of occurrences of the subfield, thereby achieving O(1) time complexity when obtaining the offset and length of repeated data. Therefore, even in the case of nested and repeated data, we can still achieve O(m) lookup efficiency, where m is the depth of the Schema Tree. Validity is used to distinguish whether this Field is empty or NULL. We will not store any data for NULL Field, which improves efficiency for storing sparse data. Compared with Dremel, our algorithm has two advantages:

  1. Sparse fields are more storage efficient.

  2. Better Seek efficiency for compound repeating types.

Query Engine Integration

Krypton's storage format design is deeply bound to Query Execution. In order to reduce IO as much as possible, delayed materialization and predicate pushdown are used extensively. Predicate Filtering and Column Pruning are processed at the format layer together with Push-down Runtime Filter Predicates and file indexes. During the reading process, a predicate that can match the index is first used to filter out a set of selected row numbers (Selection Vector). Next, we use the expression framework to execute predicates that cannot match the index, further reduce the selected row number, and perform column pruning. Finally, we materialize the data based on the row numbers in the Selection Vector. In addition, Krypton also supports calculations directly on the encoded data. At this time, Format will return the encoded data directly to QE.

We did a comparative test with the Parquet format on the TPC-H and Magnus datasets. Magnus is a dataset on ML scenarios within Byte, which makes extensive use of composite data types. As can be seen from the table above, compared to Parquet, Krypton Format's read performance increased by 21% on TPCH and 40% on Magnus; in terms of data size, on TPC-H, Krypton increased by 13%, mainly because Krypton internal indexing, but on Magnus, Krypton is reduced by 8%, which mainly benefits from efficient storage in composite types.

experiment

environment

  1. Experimental environment: YCSB Workload C + TPC-H 1T

  2. Production environment: Zhuxiaobang (Note: ByteDance one-stop home decoration and home service platform) scenario. This is a typical feature service scenario, which requires an aggregate budget within any time window for any feature of a given user. Continuous data import, real-time query, query QPS 10K/s

  3. Cluster configuration: 8 physical machines (2.4GHz, 48 Cores, 96 vCPUs, 128G DRAM, 512G PMEM, 2TB NVME, 25G NICs)

  • Coordinators: 2 units

  • Data Servers: 3

  • Compaction Server: 1

  • Ingestion Server: 1

  • Metadata Server: 1

Hybrid Performance

  • Resource Group Isolation


We created two Resource Groups to carry the workload of YCSB and TPCH respectively. As can be seen from Table 4 and Figure 9, compared with running YCSB and TPCH-1T separately, there is no obvious performance loss after using Resource Group for isolation.

  • Fair Scheduling

In order to verify the effect of Fair Scheduling in solving resource competition within the same Resource Group, we ran TPCH-Q6 and Q21 under the same Resource Group, representing short Query and long Query respectively.

All queries start with 1 Client, and then the number of Clients in Q6 increases by 1, 2, 4, and 8.

From Figure 10, we can see:

  1. Without Fair Scheduling, as the concurrency of Q6 increases, the performance of Q21 regresses significantly;

  2. After Fair Scheduling, we allocated resources to Q21 and Q6 at 20% and 80% respectively. The latency of Q21 only increased slightly with the number of clients.

As can be seen in Figure 11, when Q6 starts running, Q6 does not completely use up its own resources (80%), only about 53% is used, and Fair Scheduling can adaptively allocate the remaining 27% of resources. Give Q21 a run. As the number of Q6 clients increases, both Q6 and Q21 use up their own resources.

  • Adaptive Parallelism Control

In order to verify the effect of our adaptive concurrency control, we used 4 clients (G0 - G3), and each client will repeatedly send Q6 according to the maximum concurrency. It can be seen from Figure 12 that when there is only G0, with sufficient CPU resources, it can be executed according to the maximum concurrency. As we start G1 - G3, there is competition for CPU resources, and finally the Coro-threads running by each Client also change dynamically.

Production Performance

  • Effects of Optimizing Time Range Queries

In order to test the effect of using MV to rewrite Time Range queries, we used the real Workload of Online Live Xiaobang. Query is as follows:

We fixed the end time and then dynamically changed the start time. The entire Time Range ranged from 10 minutes to 10 hours.

  • Effects of Lightweight API

We compared and tested the latency under 10K QPS online. After turning on the Lightweight API, the Query P99 latency dropped by 45%.

  • Data Freshness of Streaming Ingestion

Data Freshness is defined as the time interval between when a piece of data is imported and when it can be queried. As can be seen in Figure 15, the latency of Data Freshness P99 has always been maintained at around 15ms and will not change as the import rate increases.

  • Read/Write Scenario in Production

Zhuxiaobang is a typical mixed reading and writing scenario. 18:00-22:00 is the peak period every day. During this period, the import rate increases by 460% and the query QPS increases by 300%. Since Krypton adopts a read-write separation architecture, as shown in Figure 16, The Latency of Query P99 does not change much during the peak period and remains within 60ms.

Summarize

Throughout the entire Krypton design, development and launch process, we learned a lot of useful experiences:

  1. Most of Krypton's business parties used Doris before, and the ecological tools surrounding Doris are also relatively complete. Therefore, we decided from the beginning that the interface level and data model would be fully compatible with Doris. Thanks to this, subsequent users did not encounter particularly big resistance when migrating from Doris, and some of the previous ecosystems can also continue to be used.

  2. Find opportunities for optimization in user scenarios. For example, we found that some users have high QPS, but the query mode is basically fixed, but some filtering conditions are different. At this time, Result/Plan Cache plays a big role. There are also some technologies, such as WAL that supports compression, and fully asynchronous write links, which play a huge role in high-speed writing scenarios.

  3. Test with online traffic. Krypton is a very complex system, and users are often skeptical about the stability of new systems. Therefore, we developed a dual-read and dual-write framework for online traffic. Grayscale online traffic goes to Krypton, and the traffic is switched after the system is running stably.

Guess you like

Origin blog.csdn.net/weixin_46399686/article/details/133071912