Dry stuff throughout! Looking at the PolarDB-X parallel computing framework

Introduction: This article will give a detailed introduction to the entire distributed parallel execution framework, and hope to have a comprehensive understanding of our executor after reading.

Author: Xuandiqifeng

 

 

The article PolarDB-X Hybrid Actuator for HTAP explains in detail the original intention of the PolarDB-X actuator design. Its original intention has always been to inject parallel computing capabilities into PolarDB-X, taking into account both TP and AP scenarios, and gradually build a model for it A database with TB-level data processing capabilities. In order to achieve this, we have borrowed from various databases and large database products, including analytical databases, real-time data warehouses, etc., and drew on various advantages to create a new parallel execution engine. Here will give a detailed introduction to the entire distributed parallel execution framework, I hope to have a comprehensive understanding of our executor after reading.

 

 

▶ Overall design

 

PolarDB-X is a Share Nothing database, sampling the architecture of separation of computing and storage. The data is stored in each DN node in the form of fragments, and the computing node is called CN. In the calculation process, communication between DN and CN, between DN and DN, and between CN and CN is through Gigabit and 10 Gigabit networks. Each CN node generally has scheduling components, resource management components, RPC components, and so on. A complex query will be scheduled to be executed on multiple CN nodes. Considering that data is generally distributed to each DN according to a uniform strategy, each CN node will access multiple DNs at the same time.

 

1.png

 

When a user submits a complex SQL statement, it often needs to access multiple DN nodes. At this time, the parallel scheduling strategy will be started. The entire execution steps can be easily understood:

 

  1. The CN to which the user is connected will assume the role of Query Coordinator;
  2. Query is first sent to Query Coordinator, and the latest Plan will be generated by the optimizer first, and then split into multiple sub-plans (Fragments). Each Fragment may contain multiple execution operators. If some Framgnt is responsible for scanning DN, it must include Scan operator to pull data from DN; Fragment can also include other operators such as Agg or Join;
  3. The Scheduler (Task Scheduler) in the Query Coordinator will encapsulate each Framgnts into Tasks according to the defined scheduling logic, and schedule them to the appropriate CN for execution. Here, some resource calculations may be involved;
  4. After each CN receives the Task, it will apply for execution resources, construct the execution context, start the Task, and periodically report the status to the Query Coordinator;
  5. Each task will exchange data through the data transmission channel (DTL). When all tasks are executed, the data result will be returned to the Query Coordinator, which is responsible for returning the result to the user;
  6. After successfully returning to the user, the Query Coordinator and the tasks of each CN node that are scheduled to perform cleanup actions to release computing resources.

 

The whole process is roughly like this. Careful students will find that our framework has a concept called Split on the DN. Our data is stored on each DN by fragment, and Split refers to the address of the data fragment partition. For tasks that include the scanning operator Scan, it will calculate which partitions need to be accessed and which DNs these partitions are distributed on, and then encapsulate them into splits and divide them proportionally into these scanning tasks. However, during the actual running process, each scanning task is not pre-allocated with splits, but pre-allocated part of the splits to the scanning task, whichever task scans faster will continue to obtain the remaining splits from the Query Coordinator, which can be avoided as much as possible The consumption long tail phenomenon caused by the imbalance of the resources of each scanning task. But if a table is divided into only 2 shards, does it mean that the scan task can only be at most 2, which may not have a significant parallel acceleration effect. Therefore, we also support to continue to split the segments according to the segment, then at this time Split will not only record the address of the segment, but also record the displacement of the segment on the segment. After splitting according to segments, even if the number of data segments is limited, we can still start more scan tasks during the execution process to speed up the scan in parallel.

 

2.pngimage.gif

 

 

▶ Execution plan

 

The execution engine executes the distributed execution plan generated by the optimizer. The execution plan consists of operators. Because PolarDB-X data is stored on each DN node according to shards, the execution plan execution will also meet the locality of the data distribution as much as possible. Plans that can be pushed down will be executed on the DN, and plans that cannot be pushed down will be executed. It will be divided into fragments, which will be executed on each CN node. So here we need to care about how to split a plan from the optimizer into a distributed plan and put it on each CN for execution?


In order to better understand this process, let's take a simple SQL here: select * from (select useid, count(*)  as b from user_data group by userid) as T where T.b > 10as an example, the optimizer generates such a relatively optimal plan:

 

3.png

 

For parallel execution plans, in order to perform more efficiently and minimize data transmission, the execution plan can be divided into different fragments according to whether the calculation process requires data redistribution (ReDistribution) and distributed to the corresponding nodes for execution, and some operations are pushed down Reduce the scan output data, the above plan may become such an execution plan, composed of multiple sub-segments.

 

4.png

 

Data exchange between different fragments through NetWork Write/Read operator. More complex, such as multi-table association (join) queries, will have more fragments and more complex data exchange patterns. The concurrency of each fragment can be different, and the concurrency is derived based on the cost. The basic unit of multi-machine scheduling is the Stage. The stage records the position information of the upstream and downstream segments so as to establish a network channel (DTL) between the upstream and downstream. After each fragment is scheduled to the computing CN node, it will be encapsulated into a logical execution task. For example, if the concurrency of fragment-1 is 2, then Task-1.0 and Task-1.1 will be scheduled to two CN nodes respectively.

 

5.png

 

Task is still the logical unit of CN node calculation. PolarDB-X executor can not only support single-machine parallelism (Parallel Query), but also multi-machine parallelism (MPP). Therefore, the second-tier scheduling logic is also introduced at the CN node. Of course, the benefits of the second-level scheduling are not only here, we will mention them later. Here, we will continue to split into different pipelines based on the characteristics of data exchange between operators within the Task.

 

6.pngimage.gif

 

The concurrency of different pipelines can also be different. Each pipeline will calculate a different degree of concurrency according to the size of the data processed, and generate a specific execution unit Driver. Drivers will determine the upstream and downstream local channels (Local Channel) according to the second-level scheduling. .

 

7.png

 

At this point, you should be able to understand the entire process from executing logical plans to distributed physical execution. Some new names have been introduced, here is a unified review:

 

  • Fragment: Refers to the sub-plan that the logical execution plan is cut into according to whether the data needs to be redistributed during the calculation process.
  • Stage: It is a scheduling logical unit encapsulated by Fragments. In addition to encapsulating Fragments, Stage also records the scheduling position information between upstream and downstream stages.
  • Task: Stages do not run directly on the CN. They are decomposed into a series of tasks that can be scheduled on the CN through concurrency. The tasks are still logical execution units.
  • Pipeline: The tasks on the CN are further divided according to the level 2 concurrency, and divided into different pipelines.
  • Driver: A Pipeline contains multiple Drivers. A Driver is a specific execution unit and a collection of a series of operable operators.

 

Generally speaking, for a complex query, a query contains multiple Fragments, each Fragment and Stage have a one-to-one correspondence, each Stage contains multiple Tasks, and each Task is divided into different pipelines, and one pipeline contains multiple Drivers. Only by understanding the concepts of Fragment/Stage/Task/Pipeline/Driver mentioned above can you have a clearer understanding of our next design.

 

 

▶ Scheduling strategy

 

Parallel computing needs to solve the task scheduling problem at the beginning of its operation. The straightforward understanding of scheduling is to schedule the divided tasks to each CN node for execution, and make full use of the computing resources of each CN. It is easy for everyone to have these questions here:

 

1. During the execution process, the computing resources of each CN node are unbalanced, so how do you break up the tasks to different CN nodes for execution in multi-machine scheduling? 2. How do the tasks that interact with each DN pull data in parallel? For example, a logical table is divided into 16 physical tables, which are distributed on 4 DN nodes. There are 4 Drivers to pull data in parallel. Each Driver does not pull 4 physical tables evenly, but according to its own consumption capacity. Determine the number of physical tables to be pulled; will scanning tasks sent by multiple drivers land on exactly one DN node at the same time, causing a certain DN to become a bottleneck? 3. We can completely schedule multiple tasks at the same time on a CN node. It is already possible to achieve single-machine parallelism. Why do we need two-layer scheduling?

 

One-level scheduling (between multiple nodes)

 

In order to solve the problems of (1) and (2), we have introduced a scheduling module (Task Scheduler) inside the CN node, which is mainly responsible for the scheduling of Tasks on different CN nodes. This level of scheduling is called a level of scheduling here. In this level of scheduling, multiple tasks belonging to the same stage must be scheduled to different CN nodes, ensuring that a CN node can only have one task of the same tag. During the scheduling process, the Task state machine is constantly maintained through heartbeat, and the Load information of each CN node in the cluster is also maintained. The entire scheduling is based on CN Load. The multi-machine scheduling process is as follows:

 

 

 

 

Resource Manager (RM) is a resource management module on the CN node. RM will use the Task heartbeat mechanism to maintain the load of each CN node in the cluster in real time. The Task Scheduler component will select the appropriate CN node based on the load to execute tasks, such as CN-1. The load is much higher than that of other CN nodes in the cluster, so the task currently queried will be distributed to other CN nodes to avoid scheduling to CN-1 nodes. When the executor executes the Task task, the mapping relationship of its consumption DN splits is determined when the Task is not created. Each task dynamically pulls splits for consumption in batches, and it is straightforward to understand that whoever has the stronger consumption power is likely to consume more splits. Similarly, in order to solve the problem of multiple tasks simultaneously consuming splits on the same DN at the same time, we will split the splits on each DN to the entire splits queue according to the address information according to the Zig-zag method at the beginning of scheduling. When consuming, the pressure of each DN can be shared as much as possible, so that the resources of each DN will be fully utilized in the calculation process.

With a level of scheduling, we can also schedule multiple tasks that belong to the same stage to the same CN, so that in fact, it can also be single-machine parallel. If designed in this way, we can easily overlook two issues:

 

  • The logic of the first-level scheduling is more complicated and requires multiple interactions. A CN needs to maintain the status of each task at the same time, and the cost will be relatively high, which is intolerable in the TP scenario;
  • In one-level scheduling, the higher the concurrency, the more tasks are generated, and more network transmission channels need to be established between these tasks.

 

Layer 2 scheduling (inside the node)

 

In order to solve the deficiencies of the above-mentioned one-level scheduling, we are referring to Hyper’s paper [1] and introduced the two-level scheduling, which is a single-machine parallel scheduling inside the CN node. In short, we will use the local CN in the Task. The scheduling component (Local Scheduler) performs further parallel scheduling of the Task, allowing the Task to execute on the CN, and it can also be run in parallel. In the figure below, Stage-1 and Stage-2 are upstream and downstream, each with a concurrency of 9, and they are scheduled to be executed on 3 CN nodes. If there is only one level of concurrency, each CN node will also schedule 3 Tasks to run, then a total of 81 Channels will be established between upstream and downstream. The internal Tasks of the CN nodes are independent of each other, so the disadvantages are still obvious:

 

9.png

 

  1. Multiple channels amplify the network overhead, the same buffer will be sent multiple times, and sending and receiving have costs for CPU and Memory;
  2. The object of data transmission is Task, and the data itself is skewed, which will cause load imbalance (hash skew) between tasks within the same node, and there is a long tail problem.

 

When the first-level scheduling and the second-level scheduling are combined, the first-level concurrency of Stage-1 and Stage-2 is 3, so that each CN node will only have 1 Task, and the internal task concurrency is 3. Since the object of shuffle is Task, only 9 Channels can be established between Stage-1 and Stage-2, which greatly reduces network overhead. At the same time, the data in the three Drivers in the Task are shared, and all Drivers in the Task can be shared. Consume the received data and execute it in parallel to avoid long tail problems. For HashJoin, suppose Ta is a large table and Tb is a small table. These two tables are used as HashJoin, which can make Ta and Tb shuffle to the same node for calculation at the same time; you can also make the small table Tb broadcast to the node where Ta is located for calculation. The network cost of the former is Ta+Tb, while the cost of the latter is N*Tb (N represents the number of broadcasts). Therefore, if there is only one level of scheduling, N may be relatively large. During the execution process, we may choose to do shuffle execution plans at both ends; and the combined scheduling strategy of the first and second levels allows the selection of BroadcastHashJoin during the execution process, so that you can Avoid shuffle of large tables to improve execution efficiency.


In addition, in the second-tier scheduling strategy, the multithreading within the task can easily share data, which is conducive to more efficient algorithms. As shown in the figure below, during the same HashJoin process, multiple threads (drivers) in the Task of the build side coordinate calculation: After the build side receives the shuffle data, the multiple threads collaborate to create a shared hash table. Such a task has only one build table. After the probe receives the shuffle data, it does not need to do ReDistribution, and directly reads the received data and performs parallel probes.

 

10.png

 

 

 

▶ Parallel execution

 

After talking about scheduling, the next step should be to care about how the task runs on the CN. How does our system deal with exceptions encountered during operation?

 

Thread model

 

When it comes to execution, experienced students may find that our scheduling does not solve the scheduling deadlock problem. For example, for the following execution plan, two tables Join. Two kinds of problems are generally encountered:

 

1. If f3 and f2 are scheduled first, if the cluster does not have scheduling resources at this time, then f1 cannot be scheduled late. The logic of HashJoin is to build the buildTable first, where f1 happens to be the build table part. Eventually it will lead to execution deadlock: f1 is waiting for the computing resources of f3 and f2 to be released, and f2 and f3 are waiting for f1 to finish building the buildTable;
2. If f1 is scheduled first, assuming that f2 and f3 do not have scheduling resources at this time, the data pulled by f1 from DN at this time cannot actually be sent to f3 because f3 has not been scheduled yet.

 

11.png

 

To solve the problem 1, there are many ways in the industry. The most common one is to build the scheduler dependency (Scheduler Depedency) at the beginning of the schedule: f1->f3-f2. To solve problem 2, it is often to put the data pulled out by f1 into the memory first, and then place it on the disk for processing if it can’t fit. It can be seen that to deal with the above two problems, the execution framework not only needs to make complex scheduling dependencies on multi-machine scheduling, but also needs to consider support for placing orders. In fact, when we were scheduling, we did not consider scheduling dependency. We scheduled all f1/f2/f3 at one time. Why is this? This is to talk about the concept of the logical thread model in our execution. In most computing engines, a query will first pass through the resource scheduling node to apply for execution threads and memory on each CN. After the application is successful, these execution resources will be occupied by the scheduling component and used to allocate the tasks of the current query. Used by other queries, this is the real execution resource, which is bound to the scheduling resource. When the available execution resources on the CN are not enough, the scheduling deadlock problem will occur. In PolarDB-X, we do not apply for real thread resources when scheduling. Scheduling only needs to consider the load of each CN, and does not need to consider how many real resources are left for each CN. Our threading model is not tied to scheduling resources. Each Driver does not monopolize a real thread. In other words, real threads do not correspond to scheduling resources one-to-one. Although Driver is the basic unit of execution, in terms of scheduling, it is just a logical threading model. Does that mean that as long as there is a scheduling task, it can be successfully scheduled to the CN, the answer is yes. Scheduling all execution units to CN for execution at one time is also an overhead to memory and CPU. For example, after f2 is executed, but f1 is not completed, then f2 will continue to execute, and its data will actually be cached, but it can't cache data indefinitely? In order to solve this problem, we need to use our time slice to execute.

 

Time slice execution

 

We have a set of execution thread pools inside each CN node to run these Drivers. Each Driver will queue up to enter the thread pool to participate in the calculation. If the Driver is blocked, it will exit to the Blocking queue and wait to be awakened. For example, after the f2 driver is started, the data is pulled from the DN and placed in the limited space buffer. At this time, assuming that the f1 driver has not ended, then the buffer corresponding to the f2 driver will be full, and when it is full, it will be blocked. Once it blocks our The execution framework will let the f2 driver exit from the executor and add it to the Blocking queue. Simply put, it frees up computing resources and waits for it to be awakened. Until the f1 driver has finished executing, the f2 driver will be awakened, and the execution framework will move him to the Pending queue, waiting to be dispatched to the execution thread pool to continue running. Some memory is still wasted here, but compared to CPU resources, memory resources are still relatively abundant.

 

12.png

 

The core of time slice execution is to determine when the Driver will be blocked. In summary, the reasons for being blocked are generally divided into three situations:

 

  • Determine according to the operator dependency model. For example, if the f1 driver in the figure is not completed, then the f2 driver will actually be blocked (this is a configurable option);
  • Insufficient computing resources (mainly memory), the corresponding driver will be suspended, waiting for the resource to be released;
  • Waiting for the DN response, after the physical SQL is issued to the DN, the Driver will be suspended, waiting for the completion of the physical SQL execution.

 

In addition, we are learning from the Linux time slice scheduling mechanism. At the software level, we will count the running time of the Driver. If it exceeds the threshold (500ms), it will also be forced to exit the execution thread, join the Pending queue, and wait for the next round of execution scheduling. . This time slice scheduling model at the software level can solve the problem of complex queries occupying computing resources for a long time. In fact, it is quite simple to implement, that is, after each batch of data is calculated, we will count the running time of the driver and exit the thread pool if it exceeds the threshold. The pseudo code of the driver processing logic is posted below. The driver uses the classic Producer-Consumer model in execution. We will accumulate time every time we consume a Chunk. When it exceeds a predetermined threshold, we will exit.

 

13.png

 

Task state machine

 

Highly concurrent systems, frequent waiting or task switching are common system bottlenecks. Asynchronous processing is a method that has been proven to effectively avoid these bottlenecks and push the performance of high-concurrency systems to the extreme. Therefore, the entire back-end of the PolarDB-X actuator uses a fully asynchronous execution framework; at the same time, the MPP execution process involves the coordination of multiple machines, so this requires us to maintain these asynchronous states within the system. The maintenance of the asynchronous state is particularly important. For example, if a task under a certain query fails to execute, you need to immediately notify the running Task task of the query in the entire cluster, so that it can be terminated immediately to prevent the Suspend Task from appearing and causing the resource not to be released.


So inside the executor, we maintain the state from three dimensions (Task Stage Query). These three states are mutually dependent and coupled. For example, if the Query is canceled, all of its stages will be notified immediately, and each stage will monitor the state change. All its tasks will be notified in time. Only after waiting for the tasks to be canceled, will the final state of the Stage be changed to Cancel, and the final state of the Query will be marked as Cancel. In this process, we will introduce an asynchronous monitoring mechanism for the state machine, once the state is sent and changed, the related processing logic will be asynchronously called back. By maintaining these statuses, we can also timely diagnose whether the task is abnormal through query or monitoring, and where the abnormality occurred, which is also convenient for us to troubleshoot later.

 

14.png

 

 

▶ Resource isolation

 

If there are too many concurrent requests, resource shortage will make the requesting thread participate in the queuing. But when the running thread needs to consume more computing resources (CPU and Memory), it will seriously affect other normally running Drivers. This is not allowed for our executors oriented to HTAP scenarios. Therefore, in resource isolation, we will isolate computing resources for different WorkLoads, but this isolation is preemptive.

 

CPU

 

At the CPU level, we are based on CGroup for resource isolation. According to different WorkLoad, we divide the CPU resources into two groups: AP Group and TP Group. The CPU resources of TP Group are not limited; and AP Group is hard-isolated based on CGroup. , Its CPU uses the minimum threshold (cpu.min.cfs_quota) and the maximum threshold (cpu.max.cfs_quota) of the water level to limit. The execution threads are divided into three groups: TP Core Pool, AP Core Pool, and SlowQuery AP Core Pool. The latter two will be divided into the AP Croup group, subject to strict CPU restrictions. Driver will be divided into different Pools for execution according to WorkLoad. Seemingly beautiful implementation, there are still two problems here:

 

1. What should I do if the WorkLoad based on COST identification is inaccurate?

2. AP query consumes more resources. What should I do if multiple slow queries in the same group affect each other?

 

Problem (1) The main scenario is that we identify the AP type of query as TP, and the result will cause the AP to affect the TP, which is unacceptable. Therefore, during the execution process, we will monitor the execution time of the TP Driver. For queries that have not yet ended after a certain threshold, we will actively exit the time slice, and then schedule others to the AP Core Pool for execution. In order to solve the problem (2), we will further gracefully degrade the Drivers that have not been running for a long time in the AP Core Pool and schedule them to the SlowQuery AP Core Pool for execution. Among them, the SlowQuery AP Core Pool will set the execution weight to reduce the frequency of its execution of the Driver as much as possible.

 

15.png

 

MEMORY

 

At the memory level, the memory area in the CN node heap can be roughly divided into four major blocks:

 

  • TP Memory: used to store temporary data during TP calculation
  • AP Memory: used to store temporary data during AP calculation
  • Other: store data structures, temporary objects and metadata, etc.
  • System Reserverd: system reserved memory

 

TP and AP Memory have maximum and minimum threshold limits respectively. They can preempt each other during memory usage, but the basic principle is: TP Memory can preempt AP Memory and will not be released until the end of the query; while AP Memory can preempt memory TP, But once TP needs memory, AP Memory needs to release the memory immediately, and the release method can be suicide or disk placement.

 

16.png

 

 

▶ Data Transmission Layer (DTL)

 

Parallel computing is to make full use of various CN resources to participate in the calculation, so there is bound to be data interaction between DN and DN. The upstream and downstream task data on each DN needs to be transmitted. For example, the number of upstream tasks is N and the number of downstream tasks is M, then the data transmission channel between them needs to use M*N channels (Channel), the same we will The concept of these channels is abstracted into a data transmission layer. The design of this transport layer often faces two problems:

 

1. The channel is divided into a sending end and a receiving end. When the sending end continuously sends data and the receiving end cannot handle it, it will cause a memory avalanche;

2. Data is lost during transmission.

 

17.png

 

There are two main transmission methods for data transmission in the industry: Push and Pull. Push is the sending end to push data to the receiving end. In order to avoid the receiving end from processing, flow control logic needs to be introduced. The general approach is to reserve a slot on the receiving end and notify when the slot is full of data. The sender suspends sending data and informs the sender to continue sending when there is data on the receiving end that is consumed by an idle slot. This header will involve multiple interactions between the sender and the receiver, and the flow control mechanism is relatively complicated. Pull means that the sender sends data to the buffer first, and the receiver pulls data from the sender's buffer as needed. When the sender sends data to the buffer, the receiver assumes that it will not pull data for a long time, and finally the sender's buffer is full In order to avoid frequent back pressure, the buffer on the sending end should not be set too small. In summary, we chose the pull method to do it. The sampling pull method also encounters two problems:

 

1. Each receiver generally establishes a connection with multiple upstream senders, so does it pull data from all upstream senders through broadcast every time?
2. How much data is requested from the sender at a time, that is, averageBytesPerRequest?

 

Let's answer question (2) first, we will record the lastAverageBytesPerRequest, the current number of connected channels n, and the total responseBytes returned last time to calculate the current averageBytesPerRequest. The specific formula is also given below Up. As for the question (1), with the current averageBytesPerRequest, combined with the remaining space of the buffer on the current receiver, it can be estimated that this time it is necessary to send a request to several upstream senders.

 

 

 

18.png

 

In order to ensure the reliability of transmission in the asynchronous communication process, we adopt a method similar to tcp ack. When the receiver takes the token to pull data upstream, it means that the data before the current token has been consumed by the receiver, and the sender can Release these data, and then return the next batch of data and nextToken to the receiver.

 

19.png

 

 

 

▶ Effect display

 

After talking about a lot of dry goods, let's get some simple and practical things. Here we take TPCH Q13 as an example to demonstrate the acceleration effect of the actuator in different scenarios. For the convenience of screenshots, a limit is added after Q13. In this test ring environment, both CN and DN specifications are 2*16C64G.

 

Running under single machine and single thread, it takes 3min31s

20.png

 

Use Parallel Query to accelerate, both single-machine multi-threaded execution, and time-consuming 23.3s

21.png

 

Use MPP to accelerate, both use the resource calculation of two CN nodes at the same time, which takes 11.4s

22.png

 

 

▶ Summary

 

Whether it is a simple query or a complex query in Parallel Query and MPP scenarios, they share a set of execution frameworks. The requirements for executors in different scenarios are more about the differences in concurrency settings and scheduling strategies. Compared with other products in the industry, the main features of PolarDB-X actuators:

 

  1. In the resource mode, lightweight resource management is used. Unlike the big data computing engine, additional resource management nodes need to be introduced, and strict resource pre-allocation is performed. The main consideration is that our scenario is for online computing in small clusters. ;
  2. On the scheduling model, the executor supports DAG scheduling. Compared with MPP scheduling, it can achieve a more flexible concurrency control model. The concurrency between stages and pipelines can be different;
  3. Different from other products, AP acceleration refers to an external parallel computing engine, and the PolarDB-X parallel executor is built-in. Different queries share a set of execution models to ensure that TP and AP enjoy consistent SQL compatibility.

 

PolarDB-X parallel computing has been running smoothly online for nearly two years. In the past two years, we have not only done a lot of stability work on the execution framework, but we have also deposited a lot of technology in the optimization of the operator layer. But these are not enough. At present, the hottest thing is adaptive execution. The adaptive execution combined with Pipeline mode is more challenging. We are also studying in the near future. Welcome friends who are interested to come and make progress together!

 

 

▶ Reference

[1] V. Leis, et al., Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age, in SIGMOD, 2014.
[2] Presto: SQL on Everything.
[3] A Deep Dive into Query Execution Engine of Spark SQL.
[4] Impala: A Modern, Open-Source SQL Engine for Hadoop
[5] FusionInsight LibrA: Huawei's Enterprise Cloud Data Analytics Platform. Proc. VLDB Endow. 11(12): 1822-1834 (2018)

 

 

【Related Reading】

HTAP database "compulsory course": PolarDB-X Online Schema Change

Type binding and code generation of PolarDB-X vectorization engine

Need to explain a lot of instructions every time? Use PolarDB-X vectorization engine

PolarDB-X hybrid actuator for HTAP

PolarDB-X CBO optimizer for HTAP

Such as BMW 3 Series and 5 Series: PolarDB-X and DRDS go hand in hand

PolarDB-X storage architecture "best production practice based on Paxos"

PolarDB-X private protocol: improve the performance and stability of the cluster

Technical Interpretation | Implementation of PolarDB-X Distributed Transaction

Technical Interpretation | PolarDB-X Strongly Consistent Distributed Transaction

PolarDB-X consensus consensus protocol (X-Paxos)

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/115181251
Recommended