In-depth evaluation and reflection on Trino fault-tolerance mode

This article is shared from the Huawei Cloud Community " Toward Batch Processing-Interactive Analysis Integration: In-depth Evaluation and Reflection on Trino Fault Tolerance Mode ", author: HetuEngine Level 9 Endorsement.

This article is originally created by the Huawei Cloud Big Data R&D team. Original author: Wenbo, Mengyue

1 Introduction to Trino

On December 27, 2020, Presto community leaders-Martin Traverso, Dain Sundstrom and David Phillips announced that the name of the open source project PrestoSQL would be changed to TrinoDB (referred to as Trino in this article).

Trino is an open source, high-performance, distributed SQL query engine designed to run interactive analytical queries on a variety of heterogeneous data sources, supporting data volumes ranging from GB to PB. Trino is specially designed for interactive analysis. It can perform merged queries on data from different data sources (including: Hive, AWS S3, Alluxio, MySQL, Kafka, ES, etc.), and provides a good custom connector programming extension framework. . Ideal for analyst scenarios where response times are expected from sub-seconds to minutes.

1.PNG

At the time of its birth, Trino was designed to fill the gap between Facebook's internal real-time query and ETL processing. The core goal of Trino is to provide interactive query, which is what we often call Ad-Hoc Query. Many companies use it as an OLAP computing engine. In recent years, business scenarios have become more and more complex. In addition to interactive query scenarios, many companies also need to take into account batch processing operations. Technology leaders have begun to think about how to use Trino for batch processing of large data sets.

2 Limitations of the traditional Trino architecture

In the traditional Trino running architecture, Trino pre-plans all tasks for processing specific queries. These tasks are related to each other, and the result of one task is the input of the next task. For MPP engines, this interdependence is necessary. Once any task fails during this process, the entire task chain will be destroyed, causing the entire SQL execution to exit.

The process of Trino executing SQL tasks is as shown below (from Trino official website):

2.png

advantage:

Data is streamed through tasks without intermediate checkpoints, resulting in high throughput and low latency.

insufficient:

  • Lack of fine-grained failure recovery, if a problem occurs, you can only run the entire Query from scratch
  • Completely relies on memory resources for data loading and exchange
  • Once the execution plan is determined, it cannot be flexibly adjusted according to the actual implementation progress.

3 Trino Fault Tolerant Execution Architecture (FTE)

The Trino open source community has designed a new fault-tolerant execution architecture that allows us to implement advanced resource-aware scheduling with fine-grained retries. The project is codenamed "Tardigrade".

Project Tardigrade aims to break down the old all-or-nothing implementation barrier. It brings many new opportunities for resource management, adaptive query optimization, and failure recovery. The project is named after the tardigrade, the world's most indestructible creature, similar to the robustness FTE brings to Trino.

3.png

The following are some intuitive effects brought by the Tardigrade project:

  • When a long-running SQL Query encounters a failure, it does not have to be run from scratch;
  • When Query requires more memory than is currently available in the cluster, it can still run successfully;
  • When multiple queries are submitted at the same time, they can share resources in a fair way and run steadily.

From a code implementation perspective, Trino implements core functions such as task-level fault tolerance, automatic retry, and shuffle directly in the kernel. As shown in the picture below (from Trino official website):

4.png

Trino will divide a Query execution into multiple stages. In fault-tolerant mode, the shuffle data of the upstream stage will be written to disk (supports writing to AWS S3, HDFS and local storage). The downstream stage reads the required data from the intermediate storage and re-optimizes and allocates subsequent tasks in the process.

5.png

Improvements brought about:

  • Adaptive planning : Query plans can be dynamically adjusted while buffering data
  • Resource management : Adjust resource allocation while the query is running. When the cluster is idle, we can allow a single query to utilize all available resources on the cluster. When more workload starts, the resource allocation for the initial query can be gradually reduced.
  • Fine-grained failure recovery : Allows failed tasks to be restarted transparently, making ETL completion times more predictable.

Next, this article will take you to experience the Trino fault-tolerant execution mode in depth.

4 Basic performance test

First, conduct a basic performance test in a scenario with sufficient computing resources. Select TPCDS with 1TB data volume, and the computing resource specification is 2CN+16Worker 136GB/process. Test before and after turning on fault tolerance and execute TPCDS99. The time-consuming statistics are as follows:

6.png

To test the writing performance, select catalog_sales, the largest table in the TPCDS table, to test the writing performance. The SQL is:

--- create table catalog_sales_copy as select * from catalog_sales;

The test data is as follows:

data set

Computing resources

Execution time (unit: seconds)

Disable fault tolerance and spill

Task fault tolerance

Task fault tolerance+spill

1TB

1CN+2Worker,20GB/process

622.2

673

687

10TB

1CN+3Worker,136GB/process

3445

1485

1486

summary:

  • Enabling Task fault tolerance will cause the intermediate swap area results to be flushed to disk, which will cause performance loss and the execution time will be approximately twice as long as before;
  • Query fault tolerance does not have a disk placement process, and the performance is the same as not enabling fault tolerance.
  • When using a 1TB data set, Task fault-tolerant writing performance will also suffer a loss of 8%-10%, but when using a 10TB data set, the performance will improve. This needs to be analyzed in depth;

5 Stability test for large data volume scenarios

This section will conduct a TPCDS stress test in a scenario where computing resources are severely insufficient. The test results are as follows:

The amount of data

Computing resources

Error rate

No fault tolerance

Task fault tolerance

Task fault tolerance +
spill to disk

1TB

1CN+2Worker,40GB/process

7.07%

0%

0%

1CN+2Worker,20GB/process

12.12%

0%

0%

1CN+2Worker,10GB/process

16.16%

4.04%

0%

10TB

1CN+3Worker,136GB/process

8.08%

0%

0%

50TB

1CN+16Worker,136GB/process

13.13%

6.06%

5.05%

summary:

  • Using Task fault tolerance when memory is insufficient can greatly improve the success rate of SQL execution. Used in conjunction with the spill to disk feature, it can bring better fault tolerance;
  • When using a 50TB data set, Task fault tolerance can still improve the execution success rate, but some complex SQL may have a single point bottleneck. It is currently observed that the main bottleneck is single-point aggregation.

6 High concurrency scenario testing

6.1 1TB TPCD standard data set

Computing resource specifications: 1CN+8Worker, 136GB/process

Test SQL use case: Q01 (multi-fact table association query, Q29 in TPCDS99)

The test results are shown in the table below:

testing scenarios

1 concurrent

100 concurrent

200 concurrent

Disable fault tolerance

QUERY fault tolerance

TASK fault tolerance

Disable fault tolerance

QUERY fault tolerance

TASK fault tolerance

Disable fault tolerance

QUERY fault tolerance

TASK fault tolerance

Multi-table related query (multi-fact table) Q01-1 round

4.1/min

5.2/min

2.6/min

7.3/min

7.2/min

8.1/min

17.50% failed

18% failed

7.9/min

Multi-table related query (multi-fact table) Q01-5 round

5.2/min

4.8/min

3.4/min

8.3/min

8.6/min

8.6/min

64.9% failed

74.9% failed

8.5/min

7.png

6.2 10TB TPCD standard data set

Computing resource specifications: 1CN+8Worker, 136GB/process

Test SQL case:

Single table multi-column aggregation sort query Q02:

select

  • ws_item_sk,
  • ws_web_site_sk,
  • sum(ws_sales_price) total

from

  • web_sales

where

  • ws_sold_date_sk >= 2450815
  • and ws_sold_date_sk <= 2451179

group by

  • ws_item_sk,
  • ws_web_site_sk

having

  • sum(ws_sales_price) > 0

order by

  • total desc

limit 100;

Turning on TASK fault tolerance can all be executed successfully. The test results are shown in the table below:

testing scenarios

1 concurrent

100 concurrent

200 concurrent

300 concurrent

400 concurrent

No fault tolerance

TASK fault tolerance

No fault tolerance

TASK fault tolerance

No fault tolerance

TASK fault tolerance

No fault tolerance

TASK fault tolerance

No fault tolerance

TASK fault tolerance

Single table multi-column aggregation sort query Q02_1 round

3.3/min

1.3/min

7.9/min

5.7/min

9.7/min

8.8/min

8.5/min

5.9/min

97.25% failed

6.8/min

Single table multi-column aggregation sort query Q02_5 rounds

7.1/min

2.0/min

10.7/min

9.5/min

10.3/min

9.3/min

8.20% failed

8.0/min

99.1% failed

6.6/min

summary:

Task fault tolerance can increase the concurrency limit of the Trino engine and greatly reduce the occurrence of errors such as "Encountered too many errors talking to a worker node."

7 horizontal comparison tests of multiple engines

First, select the SQL use cases from TPCDS99 that will fail 100% if Trino does not enable fault tolerance when computing resources are limited, including:

Q04,Q11,Q23,Q38,Q64,Q65,Q67,Q74,Q75,Q78,Q80,Q81,Q85,Q87,Q93,Q95,Q97

Based on the same computing resources (memory, CPU, number of Containers), horizontally compare the performance of Trino, Spark, and Hive (TEZ).

Note: When testing Trino, the kernel version of Huawei Cloud HetuEngine 2.0 was actually used.

7.1 1TB TPCD standard data set

 

8.png

It can be seen that with 1TB of data and the same resources, Trino can successfully execute the SQL that originally failed when task fault tolerance is turned on, and its performance is about three times that of Spark and dozens of times that of Hive (TEZ).

7.2 10TB TPCDS standard data set

Conduct a comparative test against the 10TB TPCDS standard data set:

9.png

It can be seen that under the condition of 10TB data volume and using the same resources, Trino can successfully execute the SQL that originally failed when task fault tolerance is turned on, and the performance is about 3 times that of Spark.

8 Comprehensive evaluation

In summary, based on the test data, the summary is as follows -

Single concurrent basic performance

  1. Sufficient memory resources: disable fault tolerance = Query fault tolerance > Task fault tolerance
  2. Insufficient memory resources: Task fault tolerance can be run, but fault tolerance is not enabled/Query fault tolerance cannot produce results.

Stability in large data volume scenarios

Task fault tolerance + spill to disk > Task fault tolerance > Disable fault tolerance

  • 1-10TB data set: Task fault tolerance performance is very stable, and the pass rate is 100%
  • 50TB data set: Combining Task fault tolerance and spill to disk performs better than using Task fault tolerance alone (one less failed use case)

Stability in concurrent scenarios

Task fault tolerance > Disable fault tolerance

Horizontal performance comparison of multiple engines

  • 1TB TPCDS data set: Trino (Task fault tolerance) > Spark > Hive (TEZ)
  • 10TB TPCDS data set: Trino (Task fault tolerance) > Spark

Overall, Trino's FTE function exceeded expectations in terms of performance and stability. With the gradual evolution and improvement of this capability, I believe Trino will play a greater role in one-stop data processing and analysis scenarios.

9 Thoughts and improvements

After having first-hand test data and analysis conclusions, we will next think about how to make good use of the Trino fault-tolerance mode and maximize its value. At the same time, we must identify possible problems in advance and explore solutions.

9.1 Fault Tolerance Mode Enablement Decision

It can be seen from the previous test data that turning on fault-tolerance mode has a certain impact on short query performance (on the contrary, there is the possibility of optimizing large query performance). Therefore, we need to think about when and how to turn on fault tolerance mode.

There are following ideas to choose from——

  • Users can choose to activate

The simplest way is to let business users choose to enable or disable fault tolerance mode at their own discretion. Typically, experienced users know which queries are likely to be computationally expensive or take a long time to run. They can flexibly switch between "interactive mode" and "fault-tolerant mode" by changing the session parameters of the JDBC connection;

  • cost-based decision making

You can decide whether to enable "fault tolerance mode" based on the predicted cost of SQL execution. Generally, this technique relies on column-level statistics obtained by implementing statistics. However, column-level statistics are sometimes unavailable, and prediction accuracy based on cost estimates is often less than ideal;

  • adaptive selection technology

By default, the query can be started in "interactive mode", and then after running for N minutes, after a period of learning, the engine core will make an independent decision to start or turn off the "fault-tolerant mode" based on available resources, business characteristics and other dimensional information. This idea requires combining the Trino engine with machine learning and AI technology to implement the digital intelligence integration route;

  • Decision-making based on historical information

For certain types of queries against specific data sources, historical running records can be collected in advance and analyzed and modeled. Based on the prior knowledge model learned in advance, the optimal execution mode is selected before SQL execution.

9.2 Horizontally expand scale applications

Trino has a fault-tolerant execution mode, and the test data shows good results. Then everyone will think: Can we provide larger-scale analysis query acceleration services based on this capability?

In actual business scenarios, enterprises may need to submit tasks and elastic resource scheduling on demand, especially in large-scale, cloud-native environments. Even if fault-tolerance mode is turned on, for a single Trino cluster, its coordinator node (Coordinator) may still have concurrency capabilities. bottleneck. In addition, from the perspective of software architecture, there are certain risks in the availability of a single Trino cluster, which affects the achievement of SLA goals in a cloud service environment.

In response to the above problems, Huawei Cloud's interactive analysis engine HetuEngine provides a three-tier distributed architecture and provides a globally unique JDBC service address to the business through a unified SQL access portal - HSFabric .

10.png

Through HSFabric's unified SQL access portal, HetuEngine decouples the business layer logic from a specific computing instance. Multiple computing instances can be horizontally expanded within a single resource tenant, and SQL tasks within the same tenant can be flexibly distributed between different computing instances. distribute.

Whether from a multi-tenant or single-tenant perspective, HetuEngine's concurrent capacity can be expanded horizontally, while also improving service availability and resource utilization.

Based on the above architecture, HetuEngine allows service administrators to freely decide whether to turn on/off the fault-tolerant execution mode of a single tenant to better meet the business requirements of different scenarios.

11.png

9.3 Troubleshooting and recovery

During the fault-tolerant execution of Trino, a large amount of shuffle data between stages will fall into the distributed file system. There may be problems in discussing HDFS as an example.

Assumption - During the execution of a large SQL, Trino is writing shuffle data to HDFS, and suddenly an accident occurs on the physical machine node where Trino is located (for example, power outage, network disconnection, operating system crash, etc.), or Trino itself fails and stops working. (For example, overload, etc.). This may cause the entire Trino cluster to completely stop functioning. At this time, manual intervention by the administrator is required to restore the normal working status of the Trino cluster.

Obviously, for Trino, there are at least two problems that need to be thought about and solved:

  • How to achieve emergency and rapid recovery of Trino cluster
  • Ensure that residual files on HDFS are cleaned up in time to avoid running out of storage space

Huawei Cloud interactive analysis engine HetuEngine is based on a three-layer service-based + containerized architecture and can effectively address the above challenges:

Regarding question 1 :

With the help of the fully containerized deployment architecture, when any software process in any computing instance of HetuEngine (corresponding to 1 distributed Trino cluster) fails/accidentally occurs, the Service layer can quickly and automatically launch a new container process. To take over and make up for service deficiencies, and quickly complete fault self-healing before manual intervention.

When available resources may be insufficient, HetuEngine supports online elastic scaling of computing instances, dynamically balancing resource utilization by automatically adjusting the number of Workers, and quickly replenishing Worker node resources lost due to failures.

When the Coordinator node fails, HetuEngine responds from three aspects:

  1. Worker nodes in the same computing instance immediately network with the backup coordinator;
  2. The standby Coordinator is promoted to the new primary Coordinator;
  3. The unified SQL portal immediately directs new SQL requests to the new primary coordinator

12.png

Regarding question 2 :

HetuEngine's Service layer monitors 24 hours a day, tracking and promptly discovering and cleaning up job residues at all levels (including: data, files, directories, metadata, etc.).

At the same time, it provides multi-dimensional in-depth insights into historical tasks, generates high-value SQL operation and maintenance charts and decision-making recommendation information, and finally presents them on the console page.

The comprehensive and considerate services provided by the Service layer greatly reduce the professional knowledge requirements for data analysis platform administrators and solve administrators' worries about long-term operations.

9.4 Flexible expansion and contraction of big data platform business without loss

Generally speaking, the elastic scaling solution of big data platform only covers batch processing engines such as Hive and Spark. Because Hive and Spark themselves have fault-tolerant execution capabilities, even if the management and control plane of the big data platform issues an instruction to forcefully shrink a physical node that is running a Hive/Spark job, it will not affect the final execution success of the related job. At most, it will only cause Retries local tasks and increases execution time. Therefore, the big data platform elastic scaling solution for Hive and Spark engines is relatively easy, and you only need to focus on resource-level management operations.

However, for MPP architecture engines such as Trino, the elastic scaling management model of the above big data platform may face the following challenges:

  • The SQL engine of the MPP architecture is generally resident. If any node is forcibly killed during the downsizing process, the SQL task running on the node may fail;
  • Trino's coordination node Coordinator is 1 by default. During the scaling process, killing the node where the Coordinator is located will cause the entire Trino cluster to be unavailable and all running SQL tasks to fail;
  • The expansion of the Trino cluster requires a deep understanding of the internal service discovery and working mechanism of the Trino cluster from the platform management side, and customized configurations for the IP and port numbers of the specific cluster, so that new nodes can be successfully added to an existing Trino cluster.

To sum up, in order to achieve elastic scaling of the Trino ecological engine on the big data platform service and achieve business loss-free, it is necessary to abstract a multi-resource tenant + multiple calculations between the big data platform service layer and the Trino kernel layer. Instance (Trino cluster) resource management & business access service layer.

The service layer of HetuEngine shields the underlying Trino kernel details from the big data platform service layer, provides Rest API calls, and converts the management and operation requirements of the big data platform service layer into actual changes to specific Trino clusters. At the same time, daily status monitoring and self-maintenance of multiple Trino clusters must be achieved.

13.png

Based on the above architecture, Trino's fault-tolerant execution capability can be used to further reduce the waiting time for elastic scaling at the big data platform level when elastic scaling is enabled.

A feasible idea is roughly - the service layer of the big data platform issues shrinkage instructions to the service layer of HetuEngine. The service determines the computing instance running on the node that is about to be reduced, and dynamically switches it to fault-tolerant mode. Under normal circumstances, the service layer can quickly reply to the upper service layer that the reduction operation is ready to continue without waiting for the SQL task to be executed.

9.5 Summary

Based on the above architecture and ideas, Huawei Cloud HetuEngine can well cope with new problems that may be introduced by the fault-tolerant execution model, significantly improve the actual operation and maintenance efficiency of the production environment, and help users easily enjoy the new dividends of fault-tolerant execution.

Next, HetuEngine will gradually introduce and improve the ability to intelligently switch between two different execution modes, further improve the scenario adaptation of elastic scaling of big data cloud services, and continue to innovate and evolve in the field of one-stop SQL analysis in the data lake. .

10 HetuEngine 2.0 version preview

It is expected that HetuEngine 2.0 will be officially released with Huawei Cloud MRS 3.3.0-LTS on September 30, 2023. In this version, you can see a series of new abilities, such as -

  • Running a new kernel based on Java17, basic performance and stability have been taken to a new level, and TPCDS speed is increased by 30%
  • Big SQL active defense: pre-prompt/interception, mid-event circuit breaker, post-event statistics
  • Supports fault-tolerant execution mode: wider scope of application, enabling one-stop SQL processing & analysis
  • Multi-computing instance architecture within the tenant: automatic load balancing, horizontal scalability of concurrency capabilities for a single business
  • New data source types: Hudi, MySQL
  • Added support for creating new Hudi tables and Insert data
  • Added support for Hue to connect to HetuEngine, providing a visual SQL editing page
  • Added support for proxy user mode to support proxy authentication and auditing of customers' own user systems

14.png

Related links: https://support.huaweicloud.com/intl/zh-cn/cmpntguide-lts-mrs/mrs_01_1711.html

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10117447