Stand-alone tens of thousands of QPS! Interpretation of Apache Doris high concurrency features

66d953140f4e037c020c5c08b776011f.png3 million words! The most complete big data learning interview community on the whole network is waiting for you!

2df205ff2932da85b7216963b6b47787.gif

e0c361cb380c2635a3ab965a52af0ba5.png

With the rapid expansion of user scale, more and more users are using Apache Doris to build a unified analysis platform within the enterprise. On the one hand, Apache Doris is required to undertake processing and analysis of larger business scales - including larger scale On the other hand, it also means that it is necessary to respond to more diverse data analysis demands of enterprises, from the past statistical reports, ad hoc query, interactive analysis and other typical OLAP scenarios, expand In more business scenarios such as recommendation, risk control, tag portrait, and IoT, data service (Data Serving) is one of the representative types of requirements.

Data Serving usually refers to providing data access services to users or enterprise customers. The query mode that users use more frequently is to query one or more rows of data according to Key, for example:

  • Order details query

  • Product details query

  • Logistics Status Inquiry

  • Transaction Details Inquiry

  • User Information Query

  • User portrait attribute query

  • ...

Unlike Adhoc, which is oriented to large-scale data scanning and calculation, Data Serving is usually presented as a high-concurrency point query in actual business— the amount of data returned by the query is small, usually only one row or a small number of rows of data are returned, but for query consumption It is extremely sensitive to time, expects to return query results within milliseconds, and faces the challenge of ultra-high concurrency.

In the past, when faced with such business requirements, different system components were usually used to carry the corresponding query access. OLAP databases are generally built based on columnar storage engines, and are query frameworks designed for big data scenarios. System capabilities are usually measured by data throughput. Therefore, the performance in high-concurrency point-query scenarios of Data Serving is often not as good as user expectations. Based on this, users generally introduce KV systems such as Apache HBase to deal with point queries, and Redis as a cache layer to share the system pressure caused by high concurrency. However, such an architecture is often complex, and has the problems of redundant storage and high maintenance costs.

The integration of a unified analysis paradigm has brought challenges to the workload that Apache Doris can carry, and it also allows us to think more systematically about how to better meet the business needs of users in such scenarios. Based on the above considerations, in the upcoming version 2.0, we have introduced a series of point-oriented query optimization methods based on the original functions. A single node can reach an ultra-high concurrency of tens of thousands of QPS, which greatly broadens the capability boundary of applicable scenarios. .

# How to deal with high concurrent queries?

High concurrency has always been one of the advantages of Apache Doris. For high-concurrency queries, the core lies in how to balance the limited system resource consumption and the high load caused by concurrent execution. In other words, it is necessary to minimize the CPU, memory and IO overhead during the execution of a single SQL. The key lies in reducing the scan of the underlying data and subsequent data calculation. The main optimization methods are as follows:

   Partition bucket pruning

Apache Doris adopts two-level partitioning, the first level is Partition, and time can usually be used as a partition key. The second level is Bucket, which distributes data to each node through Hash, so as to improve read parallelism and further improve read throughput. The query performance can be improved by dividing buckets reasonably. Take the following query statement as an example:

select * from user_table where id = 5122 and create_date = '2022-01-01'

The user uses create_timethe ID as the partition key and the ID as the bucketing key, and sets up 10 buckets. After partitioning and bucketing, the unnecessary partition data can be quickly filtered, and finally only a small amount of data needs to be read, such as 1 bucket of 1 partition. The Bucket can quickly locate the query results, minimizing the amount of data scanned and the delay of a single query.

   index

In addition to partitioning, bucketing and pruning, Doris also provides a rich index structure to speed up data reading and filtering. The types of indexes can be roughly divided into smart indexes and secondary indexes. Smart indexes are automatically generated when Doris data is written without user intervention.

Smart indexes include prefix indexes and ZoneMap indexes:

  • A prefix sparse index (Sorted Index) is an index built on a sorted structure. The data stored in the file by Doris is stored in order according to the sorting column, and Doris will create a sparse index item every 1024 rows on the sorted data. The key of the index is the value of the prefix sorting column of the first row among the current 1024 rows. When the user's query condition includes these sorting columns, the starting row can be quickly located through the prefix sparse index.

  • The ZoneMap index is an index built at the Segment and Page levels. For each column in the Page, the maximum and minimum values ​​in the Page will be recorded. Similarly, the maximum and minimum values ​​of each column will be recorded at the Segment level. In this way, when performing equivalent or range queries, the MinMax index can be used to quickly filter out rows that do not need to be read.

The secondary index is an index that needs to be created manually, including Bloom Filter index, Bitmap index, and the newly added Inverted inverted index and NGram Bloom Filter index in version 2.0. I will not go into details here, but you can learn about it from the official website documents first, and follow-up There are a series of articles for interpretation.

Official website documents:

  • Inverted index: https://doris.apache.org/zh-CN/docs/dev/data-table/index/inverted-index

  • NGram BloomFilter index: https://doris.apache.org/zh-CN/docs/dev/data-table/index/ngram-bloomfilter-index

Let's take the following query as an example:

 
  
select * from user_table where id > 10 and id < 1024

Assuming that the ID is used as the key specified when creating the table, then the Memtable and the disk are organized in an orderly manner according to the ID. When querying, if the filter condition includes a prefix field, you can use the prefix index to quickly filter. Key query conditions are divided into multiple Ranges at the storage layer, and binary search is performed according to the prefix index to obtain the corresponding row number range. Since the prefix index is sparse, it can only roughly locate the travel range. Then go through indexes such as ZoneMap, Bloom Filter, and Bitmap to further reduce the number of rows that need to be scanned. Through indexing, the number of rows that need to be scanned is greatly reduced, the pressure on CPU and IO is reduced, and the overall concurrency capability of the system is greatly improved.

   materialized view

A materialized view is a typical idea of ​​exchanging space for time. Its essence is to perform pre-computation based on predefined SQL analysis statements, and persist the calculation results to another table that is transparent to users but has actual storage. In scenarios where aggregated data and detailed data need to be queried at the same time, and different prefix indexes need to be matched , a faster query response can be obtained when hitting a materialized view, and a large number of on-site calculations can be avoided, so performance can be improved and resource consumption can be reduced .

// 对于聚合操作, 直接读物化视图预聚合的列
create materialized view store_amt as select store_id, sum(sale_amt) from sales_records group by store_id;
SELECT store_id, sum(sale_amt) FROM sales_records GROUP BY store_id;

// 对于查询, k3满足物化视图前缀列条件, 走物化视图加速查询
CREATE MATERIALIZED VIEW mv_1 as SELECT k3, k2, k1 FROM tableA ORDER BY k3;
select k1, k2, k3 from table A where k3=3;

   Runtime Filter

In addition to the index mentioned above to speed up filtering query data, Doris also added a dynamic filtering mechanism, namely Runtime Filter. In multi-table association query, we usually refer to the right table as BuildTable and the left table as ProbeTable. The amount of data in the left table will be greater than that in the right table. In terms of implementation, the data in the right table will be read first, and a HashTable (Build) will be constructed in memory. Then start to read each row of data in the left table, and perform connection matching in the HashTable to return the data (Probe) that meets the connection conditions. The Runtime Filter generates a filter structure for the connection column while building the HashTable in the right table, which can be Min/Max, IN and other filter conditions. Then push down the filter column structure to the left table. In this way, the left table can use this filtering structure to filter data, thereby reducing the amount of data that Probe nodes need to transmit and compare. In most Join scenarios, Runtime Filter can realize automatic node penetration, and push down Filter penetration to the bottom scanning node or distributed Shuffle Join.

Most of the associated query Runtime Filter can significantly reduce the effect of data reading, thus speeding up the entire query.

   TOPN optimization technology

There are a wide range of application scenarios for querying the largest or smallest pieces of data in the database, such as querying the last 100 pieces of data when certain conditions are met, querying the highest or lowest prices of several commodities, etc. The performance of such queries is very important for real-time analysis . The TOPN optimization is introduced in Doris to solve the high IO, CPU, and memory resource consumption in big data scenarios:

  • First read the sorting field and query field from the Scanner layer, use heap sorting to retain TOPN pieces of data, update the currently known largest or smallest data range in real time, and dynamically push it down to the Scanner

  • The Scanner layer uses indexes to speed up skipping files and data blocks according to range conditions, greatly reducing the amount of data read.

  • In wide tables, users usually need to query a large number of fields. In the TOPN scenario, only N pieces of data are actually valid. By splitting the reading into two stages, the first stage locates the row number based on a small number of sorting columns and conditional columns and Sorting, in the second stage, according to the result of sorting and taking TOPN, the row number reverse query data can be obtained, which can greatly reduce the overhead of Scan

Through the above series of optimization methods, unnecessary data can be pruned, the amount of data read and sorted can be reduced, and the consumption of system IO, CPU and memory resources can be significantly reduced . In addition, caching mechanisms including SQL Cache, Partition Cache, and Join optimization methods can be used to further improve concurrency, which will not be detailed here due to space reasons.

# Apache Doris 2.0 new features revealed

Through the content introduced in the previous paragraph, Apache Doris has realized the concurrent support of thousands of QPS on a single node. But in some data serving scenarios with ultra-high concurrency requirements (such as tens of thousands of QPS), there are still bottlenecks:

  • The columnar storage engine is not friendly to the reading of row-level data, and the columnar storage format on the wide table model will greatly amplify random read IO;

  • The execution engine and query optimizer of the OLAP database are too heavy for some simple queries (such as point queries), and short paths need to be planned in query planning to handle such queries;

  • The access of SQL requests and the parsing and generation of query plans are handled by the FE module, which uses the Java language. Parsing and generating a large number of query execution plans in high-concurrency scenarios will result in high CPU overhead;

  • ……

With the above problems in mind, Apache Doris carries out a series of optimizations based on three design points: reducing SQL memory IO overhead, improving query execution efficiency, and reducing SQL parsing overhead.

   Row Store Format

Unlike the columnar storage format, the row-based storage format is more user-friendly in data service scenarios. Data is stored in rows, and it is more efficient when retrieving the entire row of data at a time, which can greatly reduce the number of disk accesses. Therefore, in version 2.0 of Apache Doris, we introduced the row storage format, which encodes the row storage and stores it in a separate column, and stores it in additional space . Users can specify the following properties in the Property of the table creation statement to enable row storage:

"store_row_column" = "true"

We choose JSONB as the encoding format for row storage, mainly due to the following considerations:

  • Schema change is flexible: as the data changes and changes, the schema of the table may also change accordingly. It is very important that the row storage format provides flexibility to handle these changes. For example, users delete fields, modify field types, and data changes need to be synchronized to the row storage in a timely manner. By using JSONB as the encoding method and encoding columns as JSONB fields, it is very convenient to expand fields and change attributes.

  • Higher performance: Accessing rows in a rowstore format can be faster than in a columnstore format because the data is stored in a single row. This can significantly reduce disk access overhead in high-concurrency scenarios. Additionally, fast access to individual columns can be achieved by mapping each column ID to its corresponding value in JSONB.

  • Storage space: A codec for JSONB as a row storage format can also help reduce disk storage costs. A compact binary format can reduce the total size of data stored on disk, making it more cost-effective.

Using the JSONB encoding and decoding row storage format can help solve performance and storage problems faced in high-concurrency scenarios. DORIS_ROW_STORE_COLRows are stored as a hidden column ( ) in the storage engine . During Memtable Flush, each column is encoded in JSONB and cached in this hidden column. When reading data, locate the column through its Column ID, locate a specific row through its row number, and deserialize each column.

Related PR: https://github.com/apache/doris/pull/15491

   Point query short path optimization (Short-Circuit)

Normally, the execution of a SQL statement needs to go through three steps: first, parse the statement through SQL Parser, generate an abstract syntax tree (AST), then generate an executable plan (Plan) through Query Optimizer, and finally obtain the calculation result by executing the plan . For complex queries with a large amount of data, the execution plan generated by the query optimizer undoubtedly has a more efficient execution effect, but for point queries with low latency and high concurrency requirements, it is not suitable to go through the optimization process of the entire query optimizer. Unnecessary additional overhead will be incurred.

In order to solve this problem, we implemented the short path optimization of point query, bypassing the query optimizer and PlanFragment to simplify the SQL execution process, and directly use the fast and efficient read path to retrieve the required data.

5d60d709104c97d9631e153b2182c343.png

When the query is received by FE, it will generate the appropriate Short-Circuit Plan as the physical plan of the point query by the planner. The Plan is very lightweight and does not require any equivalent transformation, logical optimization, or physical optimization. It only performs some basic analysis on the AST tree, builds a corresponding fixed plan, and reduces the overhead of the optimizer.

For simple primary key point queries, for example select * from tbl where pk1 = 123 and pk2 = 456, because it only involves a single Tablet, you can use a lightweight RPC interface to directly interact with StorageEngine, thereby avoiding generating complex Fragment Plans and eliminating the need to perform scheduling under the MPP query framework performance overhead. The details of the RPC interface are as follows:

message PTabletKeyLookupRequest {
    required int64 tablet_id = 1;
    repeated KeyTuple key_tuples = 2;
    optional Descriptor desc_tbl = 4;
    optional ExprList  output_expr = 5;
}

message PTabletKeyLookupResponse {
    required PStatus status = 1;
    optional bytes row_batch = 5;
    optional bool empty_batch = 6;
}
rpc tablet_fetch_data(PTabletKeyLookupRequest) returns (PTabletKeyLookupResponse);

The above tablet_id is calculated from the primary key condition column, key_tupleswhich is the string format of the primary key, in the above example, key_tuplesit is similar to ['123', '456'], and will be encoded as the primary key storage format key_tuplesafter , and identify the row number of the Key in the Segment File according to the primary key index, and check whether the corresponding row is in it delete bitmap, and return its row number if it exists, otherwise return NotFound. This row number is then used __DORIS_ROW_STORE_COL__to do a point query directly against the column, so we just need to locate a row in that column and get the raw value in JSONB format, and deserialize it as the value computed by the subsequent output function.

Related PR: https://github.com/apache/doris/pull/15491

   Prepared Statement Optimization (PreparedStatement)

The CPU overhead in high-concurrency queries can be partially attributed to the FE layer analysis and CPU calculations for parsing SQL. To solve this problem, we provide prepared statements (Prepared Statements) that are fully compatible with the MySQL protocol on the FE side. When the CPU becomes the performance bottleneck of primary key enumeration, Prepared Statement can effectively play a role, achieving a performance improvement of more than 4 times .

5c5d086cf112f4c2dad4fc8f9be81d29.png

The working principle of Prepared Statement is to cache pre-calculated SQL and expressions in the Session memory HashMap, and directly reuse the cached objects in subsequent queries.

Prepared Statement uses the MySQL binary protocol (https://dev.mysql.com/doc/dev/mysqlserver/latest/page_protocol_binary_resultset.html#sect_protocol_binary_resultset_row) as the transport protocol. The protocol mysql_row_buffer.[h|cpp] is implemented in the file and conforms to the standard MySQL binary encoding. Through this protocol client such as JDBC Client, the first stage sends PREPAREMySQL Command to send the pre-compiled statement to FE, and FE parses, Analyzes the statement and caches it to the HashMap in the above figure , then the client uses EXECUTEMySQL Command to replace the placeholder and encode it into a binary format and send it to FE. At this time, FE deserializes according to the MySQL protocol to obtain the value in the placeholder and generate the corresponding query condition.

aa3126fc60539e8e4008abad6f80fa69.png

In addition to caching Statements in FE, we also need to cache reused structures in BE, including pre-allocated calculation blocks, query descriptors, and output expressions. Since these structures will cause CPU hotspots during serialization and deserialization, So these structures need to be cached. For each PreparedStatement queried, a UUID called CacheID is attached. When BE executes point query, it finds the corresponding reuse class according to the relevant CacheID, and reuses the above structure when calculating and executing expressions in BE.

Here is an example of using PreparedStatement in JDBC:

1. Set the JDBC URL and enable PreparedStatement on the server side

 
  
url = jdbc:mysql://127.0.0.1:9030/ycsb?useServerPrepStmts=true

2. Using Prepared Statements

 
  
// use `?` for placement holders, readStatement should be reused
PreparedStatement readStatement = conn.prepareStatement("select * from tbl_point_query where key = ?");
...
readStatement.setInt(1234);
ResultSet resultSet = readStatement.executeQuery();
...
readStatement.setInt(1235);
resultSet = readStatement.executeQuery();
...

Related PR: https://github.com/apache/doris/pull/15491

   line cache

There is a Page-level Cache in Doris, and each Page stores the data of a certain column, so Page Cache is a cache for columns.

3c2802f154674a66a7696bbc71c06f67.png

For the row cache mentioned above, a row contains multiple columns of data, and the cache may be flushed by large queries. In order to increase the hit rate of the row cache, it is necessary to introduce a row cache (Row Cache) separately.

Line Cache reuses the LRU Cache mechanism in Doris. A memory threshold will be initialized at startup, and stale cache lines will be eliminated when the memory threshold is exceeded. For a primary key query statement, there may be a performance gap of dozens of times between hitting the row cache and not hitting the row cache on the storage layer (the gap between disk IO and memory access), so the introduction of row cache can greatly improve the performance of point queries . Especially in scenarios with high cache hits.

9ac3b8772c5286c86188df3053a79af2.png

To enable line cache, you can set the following configuration items in BE to enable it:

 
  
disable_storage_row_cache=false //是否开启行缓存, 默认不开启
row_cache_mem_limit=20% // 指定row cache占用内存的百分比, 默认20%内存

Related PR: https://github.com/apache/doris/pull/15491

#  Benchmark

Based on the above series of optimizations, the performance of Apache Doris in the Data Serving scenario is further improved. We conducted a benchmark test based on the Yahoo! Cloud Serving Benchmark (YCSB) standard performance testing tool, in which the environment configuration and data scale are as follows:

  • Machine environment: a single 16 Core 64G memory 4*1T hard disk cloud server

  • Cluster size: 1 FE + 3 BE

  • Data scale: A total of 100 million pieces of data, with an average of about 1K per line, was warmed up before the test.

  • The corresponding test table structure and query statement are as follows:

 
  
// 建表语句如下:

CREATE TABLE `usertable` (
  `YCSB_KEY` varchar(255) NULL,
  `FIELD0` text NULL,
  `FIELD1` text NULL,
  `FIELD2` text NULL,
  `FIELD3` text NULL,
  `FIELD4` text NULL,
  `FIELD5` text NULL,
  `FIELD6` text NULL,
  `FIELD7` text NULL,
  `FIELD8` text NULL,
  `FIELD9` text NULL
) ENGINE=OLAP
UNIQUE KEY(`YCSB_KEY`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`YCSB_KEY`) BUCKETS 16
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"persistent" = "false",
"storage_format" = "V2",
"enable_unique_key_merge_on_write" = "true",
"light_schema_change" = "true",
"store_row_column" = "true",
"disable_auto_compaction" = "false"
);

// 查询语句如下:

SELECT * from usertable WHERE YCSB_KEY = ?

The test results of enabling optimization (that is, enabling line storage, checking short paths, and PreparedStatement at the same time) and not enabling it are as follows:

da38bd65026e6196e70c20f8632f17b2.png

After enabling the above optimization items, the average query time consumption is reduced by 96% , and the 99th percentile query time consumption is only 1/28 of the previous one . The QPS concurrency has increased from 1400 to 3w, which has increased by more than 20 times . The overall performance and concurrent load realization data A quantum leap!

# Best Practices

It should be noted that the point query optimization implemented in the current stage is all carried out in the Unique Key primary key model, and Merge-on-Write and Light Schema Change need to be enabled before use. The following is an example of a table creation statement in the point query scenario:

 
  
CREATE TABLE `usertable` (
  `USER_KEY` BIGINT NULL,
  `FIELD0` text NULL,
  `FIELD1` text NULL,
  `FIELD2` text NULL,
  `FIELD3` text NULL
) ENGINE=OLAP
UNIQUE KEY(`USER_KEY`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`USER_KEY`) BUCKETS 16
PROPERTIES (
"enable_unique_key_merge_on_write" = "true",
"light_schema_change" = "true",
"store_row_column" = "true",
);

Notice:

  • turn on

    light_schema_change

    To support JSONB row storage encoding ColumnID

  • Enable store_row_column 

    to store row format

After the table creation operation is completed, the performance of query SQL based on the primary key as follows can be greatly improved through the row storage format and short path execution:

 
  
select * from usertable where USER_KEY = xxx;

At the same time, the point query performance can be further improved through the Prepared Statement in JDBC. If there is sufficient memory, you can also enable line storage Cache in the BE configuration file. The usage examples have been given above, so I won't repeat them here.

# Summarize

By introducing row-based storage format, point query short-path optimization, prepared statements, and row-store cache, Apache Doris has achieved ultra-high concurrency of tens of thousands of QPS per node, achieving a performance leap of dozens of times. With the horizontal expansion of the cluster scale and the improvement of machine configuration, Apache Doris can also use hardware resources to achieve computing acceleration, and its own MPP architecture also has the ability to expand horizontally and linearly. Therefore, Apache Doris truly has the ability to simultaneously satisfy high-throughput OLAP analysis and high-concurrency Data Serving online services under a single architecture, greatly simplifying the technical architecture under mixed workloads, and providing users with unified analysis in multiple scenarios experience .

The realization of the above functions has benefited from the joint efforts of developers in the Apache Doris community and the continuous contributions of SelectDB engineers. It is currently in the process of release in full swing, and version 2.0 will be released in the near future.

If this article is helpful to you, don't forget to  "Like",  "Like",  and "Favorite"  three times!

f48c7cd63ec08dcafd22f4e7714859ce.png

56b0a36f6cced3200d76a0cb24406a70.jpeg

It will be released on the whole network in 2022 | Big data expert-level skill model and learning guide (Shengtian Banzi)

The Internet's worst era may indeed be here

I am studying in university at Bilibili, majoring in big data

What are we learning when we are learning Flink?

193 articles beat Flink violently, you need to pay attention to this collection

Flink production environment TOP problems and optimization, Alibaba Tibetan Scripture Pavilion YYDS

Flink CDC I'm sure Jesus can't keep him! | Flink CDC online problem inventory

What are we learning when we are learning Spark?

Among all Spark modules, I would like to call SparkSQL the strongest!

Hard Gang Hive | 40,000-word Basic Tuning Interview Summary

A Small Encyclopedia of Data Governance Methodologies and Practices

A small guide to user portrait construction under the label system

40,000-word long text | ClickHouse basics & practice & tuning full perspective analysis

[Interview & Personal Growth] More than half of 2021, the experience of social recruitment and school recruitment

Another decade begins in the direction of big data | The first edition of "Hard Gang Series" ends

Articles I have written about growth/interview/career advancement

What are we learning when we are learning Hive? "Hard Hive Sequel"

Guess you like

Origin blog.csdn.net/u013411339/article/details/131345864