From Volcano volcano model to Pipeline execution model, iteration of Apache Doris execution model

Author: SelectDB technical team

In modern database systems, the execution engine plays a connecting role in the database architecture, and together with the query optimizer and storage engine, it forms the three major modules of the database. We take the complete execution process of SQL statements in the database system as an example to introduce the role of the execution engine:

  • After receiving a SQL query statement, the query optimizer will perform syntax/lexical analysis of the SQL and generate the optimal execution plan based on the cost model and rules;
  • The execution engine will schedule the generated execution plan to the computing node, operate the data in the underlying storage engine according to the optimal execution plan and return the query results;

In the entire query process, query execution is a crucial link. It often requires data reading, filtering, sorting, aggregation and other operations before it can be submitted to the execution engine for the next query. Whether the design of these steps is reasonable directly affects to query performance and resource utilization. These capabilities are provided by execution models, and different execution models have great differences in data processing, query optimization, and concurrency control. Therefore, a suitable execution model is crucial to improving query efficiency and system performance.

Currently, common execution models in the industry include Iterator Model, Materialization Model, and Vectorized/Batch Model. Among them, the Volcano Model is the most commonly used execution model in database query optimization and execution. Each operation is abstracted as an Operator, and the entire SQL query is constructed into an Operator tree. When the query is executed, the interface is called from the top down of the tree next(), and the data is pulled and processed from the bottom up. Therefore, this processing method is also called the pull execution model (Pull Based). The volcano model is widely used in database query optimization and execution because of its high flexibility, good scalability, easy implementation and optimization.

As a typical MPP database, Apache Doris also adopted the volcano model in past versions. When a user initiates a SQL query, Apache Doris will parse the query into a distributed execution plan and distribute it to the execution nodes for execution. The single execution task distributed to the node is called Instance. Here we use a simple SQL query to understand the Instance in the volcano Execution process under the model:

select age, sex from employees where age > 30

Iterative model-volcano model.png

As shown in the figure above, Instance is an operator (ExecNode) tree, and operators are connected through data redistribution (Exchange) operators to realize the transmission and processing of data flow. Each operator implements next()methods. When the method of an operator next()is called, the operator will call next()the method of its child operator to obtain the input data, and then perform logical processing on the data and output it. And because the operator method is a synchronous method, the method will continue to block next()when no data is generated . next()At this time, you need to call the root node operator next()method cyclically until all data is processed, and then you can get the calculation results of the entire Instance.

As can be seen from the above execution process, the volcano model is a simple, easy-to-use, and highly flexible execution model. However, in a single-machine multi-core scenario, there are some problems that need to be further solved and optimized, which are specifically reflected in the following aspects:

  • 线程阻塞执行:在线程池大小固定的情况下,当一个 Instance 占用一个线程阻塞执行时,如果存在大量的 Instance 同时请求,执行线程池将被占满,从而导致查询引擎出现假死状态,无法响应后续请求。特别是在存在 Instance 之间相互依赖的情况下,还可能会出现逻辑死锁的情况,比如当前线程中正在执行的 Instance 依赖于其他的 Instance,而这些 Instance 正处于等待队列中,无法得到执行,从而加剧系统的负载和压力。当一个执行节点同时运行的 Instance 线程数远大于 CPU 核数时,Instance 间的调度将依赖于系统调度机制,这就可能产生 Context 切换开销,尤其是在系统混部的场景中,线程切换的开销会更加显著。
  • CPU resource preemption : Instance threads compete for CPU resources, which may cause queries of different sizes and different tenants to affect each other.
  • Unable to fully utilize multi-core computing capabilities : The parallelism of the execution plan depends on the data distribution. When there are N data buckets on an execution node, the number of instances running on the node cannot exceed N, so the setting of buckets is particularly important. . If there are too few bucket settings, it will be difficult to fully utilize the multi-core computing power. On the contrary, it will cause fragmentation problems. In most scenarios, you need to manually set the degree of parallelism when performing performance tuning. In a production environment, estimating the number of data buckets is a very challenging task. Unreasonable bucketing prevents the performance advantages of Doris from being fully utilized. , unable to fully utilize multi-core computing capabilities.

Introduction of Pipeline execution model

In order to solve the problems existing in past versions, Apache Doris introduced the Pipeline execution model since version 2.0 to replace the past volcano model, and further upgraded the Pipeline execution model in version 2.1.

Design documents:

Taking the Join scenario as an example, the following figure shows the effect of two Instances forming a query plan under the Pipeline execution model.

Introduction of Pipeline execution model.png

In this plan, the Probe operation of Join depends on the build operation (Build) of the hash table, so the Build operation cannot be started until all the data obtained by Exchange has been processed and the hash table has been built. This dependency causes each Instance is split into two Pipeline Tasks. The Pipeline scheduler places the Pipeline Task in the Ready queue of the worker thread pool. The worker thread obtains the Pipeline Task according to different strategies. Whether the Pipeline Task gives up the thread after calculating a data block depends on whether its previous data is Ready and whether the running time exceeds upper limit.

Design and implementation of Pipeline execution model

The Pipeline execution model decomposes the execution plan into Pipeline Tasks through blocking logic, and schedules the Pipeline Tasks to the thread pool in a time-sharing manner, realizing asynchronous blocking operations and solving the problem of Instance occupying a single thread for a long time. At the same time, we can use different scheduling strategies to allocate CPU resources between large and small queries and different tenants, thereby managing system resources more flexibly. The Pipeline execution model also uses data pooling technology to pool the data in a single data bucket, thereby lifting the limit on the number of instances by the number of buckets, improving Apache Doris's ability to utilize multi-core systems, and avoiding frequent thread creation. and destruction issues, improving the concurrency performance and stability of the system.

01 Deblocking transformation

As can be seen from the above introduction, under the previous version of the volcano model, the execution engine had blocking operations, which would bring about two core problems: First, too many blocked threads would cause the thread pool to be full and unable to respond to subsequent queries; second, execution Thread scheduling completely relies on the operating system and cannot be scheduled based on query priority. Performance needs to be improved. In order to solve these two problems, we redesigned the deblocking execution logic.

In response to the first problem, we fixed an execution thread pool with the same size as the number of CPU cores, and ensured that there would be no blocking operations in the execution thread. In order to avoid thread blocking leading to operating system-level thread scheduling, we split Pipeline Tasks in all blocking operators, such as using independent threads for operations such as disk I/O and RPC.

In response to the second question, we designed a purely user-mode polling scheduler. By continuously polling the status of all executable Pipeline Tasks, we hand over the tasks that currently need to be executed to the execution thread for execution. This approach avoids the overhead of frequent thread switching in the operating system, and can also add more priorities and other customized scheduling strategies to improve system flexibility and scalability.

Pipeline implementation-deblocking transformation.png

02 Parallelization transformation

In versions before 2.0, the concurrency of the Apache Doris execution engine needs to be manually set by the user (ie, session variable parallel_fragment_exec_instance_num) and cannot be dynamically adjusted according to different Workloads. In order to set a reasonable concurrency, detailed analysis is often required, which undoubtedly increases the burden on users. At the same time, using unreasonable concurrency may cause performance problems. Therefore, how to make full use of machine resources to achieve automatic concurrency of each query task has become an urgent problem that needs to be solved.

The current common pipeline concurrency solutions are represented by Presto and DuckDB. The Presto concurrency solution shuffles the data into a reasonable number of partitions during the execution process. The advantage of this is that it basically does not require special concurrency control. The DuckDB concurrency solution does not introduce additional Shuffle operations during execution, but requires the introduction of additional synchronization mechanisms. We made a comprehensive comparison of the above solutions. We believe that it is difficult to avoid the use of locks in the implementation of the DuckDB concurrency solution, and the existence of locks is contrary to our idea of ​​deblocking transformation, so we chose the implementation solution represented by Presto.

In order to achieve Pipeline concurrency, Presto introduced Local Exchange to re-partition the data. For example, for Hash Aggregation, Presto further divides the data into N parts based on the aggregation Key, so that the N Cores of the machine can be fully utilized. Each execution thread only needs Build a smaller Hash Table. For Apache Doris, we chose to make full use of MPP's own architecture and directly partition the data into a reasonable number of partitions during Shuffle, so there is no need to introduce additional Local Exchange.

Pipeline implementation-parallel transformation.png

Based on this feature, we need to make improvements in two aspects: one is to increase concurrency during Shuffle, and the other is to achieve concurrent execution capabilities after reading data in the Scan layer. For the former, we only need to sense the BE environment in FE and then set a reasonable number of partitions. As for the latter, Doris' execution threads in the Scan layer are currently strongly bound to the number of stored tablets, so the Scan layer concurrency logic needs to be reconstructed to meet our needs.

The basic idea of ​​Scan pooling is to pool the data read by the Scanner thread, and multiple Pipeline Tasks can directly fetch data from the pool for execution. This approach can fully decouple the Scanner and execution threads, improving the concurrency performance and stability of the system.

Pipeline implementation-parallel transformation-2.png

Further improvements to the Pipeline execution model

The introduction of the Pipeline execution model has significantly improved the query performance and stability of Apache Doris in mixed load scenarios. However, it is still an experimental feature in version 2.0 of Apache Doris. During the use of community users, some new problems have arisen. Starting to emerge:

  • Limited execution concurrency: Since the execution concurrency of the current version of Doris still receives the static concurrency parameters set by FE and the storage layer tablet number limit, the execution engine cannot fully utilize the multi-core resources of the machine. At the same time, the storage layer may have data skew problems, resulting in query Execution appears long tail.
  • 执行开销较大: 表达式各 Instance 相互独立,而 Instance 的初始化参数存在大量公共部分,这导致每次执行都需要额外进行重复的初始化步骤,显著增加了执行开销。
  • Scheduling overhead is large: During the query execution process, the current scheduler will put all blocking tasks into a blocking queue, and a thread will be responsible for polling and taking out executable tasks from the blocking queue and putting them into the Runnable queue, so when there is a query During the execution process, one core resource will be fixed as the scheduling overhead. Especially on some small models, the overhead caused by fixed scheduling threads is very obvious.
  • Profile has poor readability: Pipeline Profile indicators lack intuitiveness and readability, making performance analysis difficult.

In order to provide higher query performance and a more stable query experience, Apache Doris has greatly optimized the Pipeline execution model in the latest version 2.1 , transformed it into an event-driven execution model, and provided solutions to existing problems. improve proposals . For ease of understanding, the improved Pipeline execution model will be called PipelineX in the following text.

01 Perform concurrent transformation

As mentioned earlier, Pipeline execution concurrency is restricted by two factors: the static concurrency parameters set by FE and the limit on the number of storage layer tablets, which results in the execution engine being unable to fully utilize machine resources. In addition, if the data itself is skewed, it may also lead to long-tail problems during query execution. To this end, we take a simple aggregation query as an example to introduce it in detail.

Assume that there is Table A, the total number of tablets in Table A is 1, and there are 100M rows of data. Execute an aggregation query:

 SELECT COUNT(*) FROM A GROUP BY A.COL_1;

Generally speaking, during the complete execution of query SQL, the query will be divided into multiple query fragments (Fragments) . Each query fragment represents the logical concept in the query execution process and may contain multiple SQL operators. After BE receives the Fragment issued by FE, it starts multiple execution threads to execute the Fragment in parallel to ensure that each Fragment can be processed efficiently. As shown in the figure below, Doris split it into 2 Fragments and executed them respectively:

New Pipeline-Execution Concurrency Transformation.png
​For ease of understanding, only the first part of the logical plan (Fragment 0) is introduced. Since Table A has only one Tablet, the execution concurrency of Fragment 0 is always limited to 1, that is, a single thread completes the aggregation of 100M rows of data. In an ideal state, 16 cores can handle 8 concurrencies. Assuming the execution time is x and each execution thread can read 100M/8 rows of data, the execution time is about x /8. However, in this example, there is about an 8x performance hit.

To solve this problem, Apache Doris version 2.1 introduces the Local Shuffle node in the execution engine, which gets rid of the restriction on execution concurrency by the number of tablets in the storage layer. Specific implementation:

  • Execution threads execute their respective Pipeline Tasks, and Pipeline Tasks only hold some runtime state (i.e. Local State). Global information is held by the same Pipeline object shared by multiple Tasks (ie Global State).
  • On a single BE, data distribution is completed by Local Shuffle nodes, and Local Shuffle ensures data balance among multiple Pipeline Tasks.

New Pipeline-Execution Concurrency Transformation-2.png

The above questions illustrate how the PipelineX execution engine can get rid of the limitation of the number of Tablets. In addition, Local Shuffle can also avoid the long-tail query problem caused by data skew. We still assume that we use the above aggregation query and change the number of Tablets in Table A to 2, where Tablet 1 has 10M rows of data and Tablet 2 has 90M rows of data:

  • Pipeline engine: Before the transformation (left in the figure below), when Fragment 1 is executed, the execution time of Thread 2 is about 9 times that of Thread 1.
  • PipelineX engine: After the transformation (right in the figure below), Local Shuffle will evenly distribute these 100M rows of data to 2 execution threads, so that they are no longer affected by the skew of the storage layer data and the execution time is the same.

New Pipeline-solve the problem of data skew.png

02 Implement process transformation

As mentioned above, each instance of the expression is independent of each other, and the initialization parameters of the instance have a large number of common parts, which results in the need for additional repeated initialization steps for each execution. In order to reduce unnecessary execution overhead, PipelineX reuses the shared state and splits step 3 in the Pipeline execution process into steps 3 and 5 in the Pipelinex execution process. This allows the heavier Global State to be initialized only once, and the lighter Local State to be initialized serially.

New Pipeline-Execution Process Transformation.png

03 Scheduling model transformation

During the Pipeline scheduling process, ready tasks are stored in the ready queue waiting for scheduling, and blocking tasks are stored in the blocking queue waiting to meet execution conditions. Therefore, an additional CPU Core is needed to poll the blocking queue. If the task meets the execution conditions, it is stored in the ready queue. . PipelineX encapsulates blocking conditions through Dependency, and the blocking/ready status of Task completely depends on event notification . When the RPC data arrives, the ExchangeSourceOperator will be triggered to meet the execution conditions and enter the ready queue.

New Pipeline-scheduling model transformation.png

PipelineX's core transformation of execution scheduling is the introduction of event-driven. A query is divided into multiple Pipelines. All Pipelines form a directed acyclic graph (DAG), with the Pipeline as the point and the mutual dependence of the upstream and downstream Pipeline as the edge. We abstract all edges as Dependencies, and whether each Pipeline can be executed depends on whether all its Dependencies meet the execution conditions. Continuing to take a simple aggregation query as an example, the query is divided into the following DAG:

<div style={{textAlign:'center'}}><img src=" https://cdn.selectdb.com/static/Pipeline_2_78b0890e4a.png " alt="New Pipeline-Scheduling Model Transformation-2" style={ {display: 'inline-block', width:300}}/></div >

For the sake of simplicity, the figure only shows the Dependency between the upstream and downstream of the Pipeline. In fact, all blocking conditions of the Pipeline are abstracted into Dependency. For example, the Scan Node relies on the Scanner to read data before it can be executed. This part is also abstracted into Dependency serves as the condition for whether Pipeline 0 can be executed.

For each Pipeline, the execution flow chart is as follows:

New Pipeline-Execution Flowchart.png

After the event-driven PipelineX transformation, each Pipeline Task will determine whether all execution conditions are met before execution. When all dependencies meet the execution conditions, the Pipeline is executed. When a condition is not met, the Task will be added to the blocking queue of the corresponding Dependency. When an external event arrives, all blocked Tasks will re-judge the execution conditions. If the conditions are met, they will enter the execution queue.

Based on the above transformation, PipelineX eliminates the additional overhead of polling threads, especially the performance loss caused by polling threads polling all Pipeline Tasks when the cluster load is high. At the same time, thanks to Dependency encapsulation, Doris's PipelineX engine also has a more flexible scheduling framework, making it easier to implement Spill later.

04 Profile transformation

For Operator Profile, the PipelineX engine has been reorganized, unreasonable indicators have been deleted and necessary indicators have been added. In addition, thanks to the transformation of the scheduling model, all blocking is encapsulated by Dependency. We add the readiness time of all Dependency to Profile, so that we WaitForDependencycan intuitively grasp the time cost of each link. Take the Scan operator and Exchange Source Operator in Profile as an example:

  • Scan Operator: OLAP_SCAN_OPERATOR The total execution time is 457.750ms (including Scanner reading data and execution time), because the Scanner scanning data is blocked for 436.883ms.
OLAP_SCAN_OPERATOR  (id=4.  table  name  =  Z03_DI_MID):
    -  ExecTime:  457.750ms
    -  WaitForDependency[OLAP_SCAN_OPERATOR_DEPENDENCY]Time:  436.883ms

  • Exchange Source Operator:EXCHANGE_OPERATOR The execution time is 86.691us, and the time waiting for upstream data is 409.256us.
EXCHANGE_OPERATOR  (id=3):
    -  ExecTime:  86.691us
    -  WaitForDependencyTime:  0ns
        -  WaitForData0:  409.256us

Summary and Outlook

在完成 Pipeline 执行模型的改造后,Apache Doris 在高负载情况下集群假死以及资源抢占的问题得以彻底解决、CPU 利用率得到大幅提升,而 PipelineX 执行引擎的迭代又进一步优化了执行引擎的并发执行模式与调度模式,使得 Apache Doris 执行引擎取得了显著的收益和进步,能够在真实生产环境中帮助用户进一步提升执行效率。

Currently, we are combining data placement technology, which is widely used in big data scenarios, with the PipelineX engine to further improve query performance and reliability. In the future, we plan to implement more automatic optimization functions during PipelineX runtime, such as adaptive concurrency and adaptive plan tuning, to further improve execution efficiency and performance. At the same time, we will also delve into NUMA (non-uniform storage access) locality to make fuller use of hardware resources and provide better query performance.

Reference

Microsoft's China AI team collectively packed up and went to the United States, involving hundreds of people. How much revenue can an unknown open source project bring? Huawei officially announced that Yu Chengdong's position was adjusted. Huazhong University of Science and Technology's open source mirror station officially opened external network access. Fraudsters used TeamViewer to transfer 3.98 million! What should remote desktop vendors do? The first front-end visualization library and founder of Baidu's well-known open source project ECharts - a former employee of a well-known open source company that "went to the sea" broke the news: After being challenged by his subordinates, the technical leader became furious and rude, and fired the pregnant female employee. OpenAI considered allowing AI to generate pornographic content. Microsoft reported to The Rust Foundation donated 1 million US dollars. Please tell me, what is the role of time.sleep(6) here?
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/selectdb/blog/11141022