Spark Commiter in-depth interpretation: Apache Spark Native Engine

This article comes from You Xiduo, a big data technology expert at NetEase Hangyan, Apache Kyuubi PMC Member, and Apache Spark Committer. The content mainly focuses on Apache Spark and Native Engine. It shares what Native Engine is, why we should make Native Engine, and how to make Native Engine.

Preface

Apache Spark is a distributed computing engine developed based on the JVM language. The execution performance of its single SQL operators has not been improved for a long time, such as Aggregation, Join, etc. The main source of performance gains when we migrated from Spark2 to Spark3 is AQE, and AQE is actually a framework for optimizing execution plans and shuffle data reading, and has nothing to do with the operator performance itself, so the optimization effect of AQE also has an upper limit. In the context of cost reduction and efficiency improvement, computing cluster resources are becoming increasingly tight, and baseline guarantees without additional machines have become the norm. However, there will be more and more computing tasks, so lower-level languages ​​such as C and C++ are used. The need for Rust to implement Native Engine to accelerate Spark SQL computing performance emerged. In fact, in the OLAP ecosystem, there are already many successful Native computing engines, such as ClickHouse, Doris, etc. Generally speaking, Native Engine also means that it is an engine based on vectorized calculations.

01 Native Engine

Why do we need Native Engine?

Now SSD has become very common in big data clusters. For example, the Shuffle disks we use in Spark are all SSD or even better NVMe in the core cluster, which makes IO no longer a computing bottleneck. The bottleneck will never disappear, it will only move from one indicator to another. Most of the current computing bottlenecks are caused by insufficient computing resources, that is, CPU, and of course memory sometimes accounts for a part.

Figure 1 JVM vs. C++

So what if we continue to optimize around the JVM? Of course it is possible, but it will be very, very difficult, and there are no very mature solutions and projects in the industry.

For example, Spark's current optimization solution, Codegen, has many limitations on the JVM. For example, a single method bytecode cannot exceed 64KB, JIT does not optimize methods exceeding 8K bytecode, etc., which sometimes accounts for a part. In large-wide table calculation scenarios, performance degradation is prone to occur, which is why Spark codegen has a limit on the number of fields. On the other hand, it is also difficult for developers to take advantage of some features of modern CPUs on the JVM, such as SIMD, a CPU feature that accelerates calculations with single instruction multiple data streams, the common AVX instruction set, AVX2, and even the more modern AVX512. This goal can be easily achieved by developing based on C++. You only need to call a C++ library and then enable the corresponding instruction set optimization during compilation. And in terms of memory management, the efficiency of JVM garbage collection is also a factor that affects task performance. Generally speaking, it is a good situation that the CPU time occupied by GC is within 10% of the total CPU time, but even so, 10% of the total CPU time is wasted. % computing performance, not to mention that the performance will be even worse on larger memory, such as a heap exceeding 32GB.

Therefore, using lower-level languages ​​to develop Native Engine to accelerate computing performance has become mainstream in the industry.

Native Engine Current Situation

Figure 2 Spark Native Engine

In the Spark ecosystem, the voice of Native Engine has actually existed for some time. The first product should be Databricks' Photon, which was developed as early as 2018, but even Spark's parent company took 5 years to develop it. It took time for research and development to become officially commercially available, which shows the huge amount of engineering work. Unfortunately, Photon is not an open source project, and we can only analyze its design and implementation from its published papers.

Meta has open sourced its own Velox, which can be understood as a SQL Engine SDK implemented in C++. It provides many libraries around SQL, such as logical/physical execution plans, functions, vector data structures, Parquet reading and writing, memory pool management, etc. Developers can implement their own SQL engines based on Velox. The original intention of Velox was to accelerate the computing engine developed for JVM. Meta's Presto team is developing Native Engine based on Velox. The Intel team also built and open sourced the project Gluten based on Spark + Velox, benchmarking Databricks' Photon. Both Photon and Gluten can do this: users can achieve a 2x improvement in Spark SQL task performance without modifying a single line of code, which is very exciting.

In addition to these two projects, there is also Apache Arrow Datafusion, a Native Engine developed based on rust. Kuaishou's open source Blaze is based on the Spark Native Engine developed by it. The community is no longer active, and there are relatively few companies participating in the country. We continue to Conservative wait-and-see attitude (PS seems to be active again recently).

02 Introduction to Gluten

Gluten is an open source project initiated by Intel and Kyligence. Gluten architecture Gluten is essentially a project developed based on the Spark plug-in interface. It can also be understood as a large Spark plug-in library, so there is no need to invade the Spark code base. This is a very good design. The same is true for Spark downstream projects such as Iceberg and Delta. In doing so, this ensures the stability of the Spark kernel.

Figure 3 Gluten Overview

As shown in Figure 3, Gluten is a vectorized execution engine that supports multiple Native Engine Backends . The current community mainly develops around the two backends of Velox and ClickHouse. In the future, it may integrate Apache Data Fusion or other excellent open source projects. Its core principle consists of two parts:

  • Pass the execution plan program and realize the transfer of the execution plan between JVM and Native through Substrait. Substrait can be understood as a cross-language serialization SDK for relational objects based on google protobuf. Velox also supports Substrait as the execution plan input.
  • Transfer data through the vectorization framework provided by Spark SQL. The community version of Spark SQL provides a Row-based implementation by default. The object of the operator operation is Row. In the vectorization framework, the object of the operator operation is ColumnarBatch. A ColumnarBatch can contain multiple rows of data. Each field in the ColumnarBatch The data structure is Column Vector

Gluten plugin library

Figure 4 Gluten Plugin

As shown in Figure 4, first Gluten Plugin is implemented based on Spark's open  Driver Plugin  and Executor Plugin . Through this portal, Gluten connects his capabilities to Spark. 1) Load the dynamic link library through the Java interface, which is .so under the Linux operating system, such as gluten.so, velox.so, arrow.so, etc. 2) Dynamically load Spark plug-in:  

  • SQL Extension, convert execution plan, convert Spark Plan into Gluten Plan through Columnar Rule exposed by Spark
  • CachedBatchSerializer supports RDD Columnar Cache , and the corresponding usage interface is dataset.cache 
  • ShuffleManager,支持 Columnar Shuffle Exchange
  • Gluten UI, displays C++/Java compilation information, Gluten execution plan rollback information, etc.

One of the biggest advantages of using Gluten's execution layer as a Native Engine based on Spark is that there is no need to reinvent the wheel and you can directly enjoy Spark's battle-tested code base and future community iterations. For example, robust scheduling framework, speculative execution, task/stage failure retry at different granularities; rich ecological connections, such as YARN/K8s. And we only need to focus on the SQL performance we care about.

Figure 5 Gluten execution model

As shown in Figure 5, Gluten fully follows the design concept of Spark, and Task runs in a thread as the smallest execution unit. Gluten passes execution plan fragments through JNI calls. After Native gets it, it can build its own Native Task and execute it in the current thread. That is to say, for every Spark Task, there is a corresponding Native Task during execution.

What problems does Gluten solve?

I have shared some basic principles and implementation of Gluten before, so you may think that it seems that Gluten does not do much, and it is not difficult to implement Spark Native Engine. Let’s take a deeper look at what problems Gluten solves.

Execution plan conversion

Figure 6 Execution plan conversion

As mentioned above, Gluten converts execution plans by injecting Columnar Rules into Spark SQL Extension, and only matches Spark's physical execution plans to reduce the complexity of the conversion. This conversion process is fully compatible with the Spark AQE framework . We know that AQE will use Stage as the segmentation granularity during the running process and constantly re-optimize the new parent execution plan based on the child execution plan. Therefore, every time it enters a new Stage, AQE will Feed the Columnar Rule a new Spark execution plan fragment. Gluten has a concept of WholeStageTransformer, which is similar to the optimization idea of ​​Spark WholeStageCodegen. WholeStageTransformer will convert the execution plan fragment generated by the entire Stage into a Native execution plan using Substrait, so that the entire Stage can run on the Native Engine, so that it is no longer needed during the execution process. Generates additional data interactions with the JVM.

JVM + Native coexistence execution plan

Figure 7 Coexistence execution plan and rollback

Of course, it is our expectation that the entire Stage runs on Native Engine, but it will definitely not be so ideal in reality. After all, even with Hive and Spark, which are both implemented based on JVM, sometimes the results of running the same SQL will be different. When it comes to Native Engine, this difference will only be amplified. A typical example is Hive UDF. Native cannot run it. How to use C++ to run the code implemented in Java? Then faced with this kind of operator that is not supported in Native, Gluten will choose to fall back to Spark to run it. As shown in Figure 7, the left side is an execution plan for the coexistence of JVM and Native. If a Stage contains both operators running on the JVM and operators running on Native, Gluten will add a ColumnarToRow or RowToColumnar operator between the two operators to bridge the data structure. It should be noted that the cost of this operator will be higher than that of Spark, because the two operators in Gluten also include copying of data between OffHeap and OnHeap.

The problem comes again. If there are multiple operators in a Stage that repeatedly jump between the JVM and Native, there will be many ColumnarToRow and RowToColumnar operators, and the performance of this Stage will most likely decrease. Therefore, Gluten also supports Stage granular rollback. If there are multiple ColumnarToRow operators in a Stage, the entire Stage will be rolled back to the JVM for running. This can ensure that the computational performance of this stage is consistent with Spark, and there will be no performance degradation problem. From this we can also see that the proportion of operator rollback is one of the key factors affecting Gluten performance.

Data interaction

Figure 8 Data interaction

The above describes how Gluten delivers execution plans, so here we take a look at how the entire data flows. As shown in Figure 8, Gluten uses two ColumnarBatch iterators, one maintained by JVM and Native each, and interact with each other through JNI. The actual data is maintained on the Native side, that is, using OffHeap Memory management. What is exposed to the JVM is only an index pointing to this ColumnarBatch, and some basic metadata, such as the number of data items and the number of fields in this ColumnarBatch. This makes the entire data flow very lightweight. Only when encountering some specific operators such as ColumnarToRow that need to change the memory distribution of data, Gluten will copy the actual data to OnHeap, expose it to the JVM, and then convert it to UnsafeRow. In other words, there are two types of ColumnarBatch circulating in the JVM, one is lightweight and only contains indexes and metadata, and the other is heavy and contains actual data.

Unified memory management

After knowing how the data flows, a bigger question arises. How does Gluten manage memory? The data may be in the JVM or Native. How to unify OnHeap and OffHeap Memory? This is the basis for Spark's stability guarantee, such as whether SQL operators require Spill, whether RDD Blocks can be cached, etc.

Figure 9 Gluten under Spark memory framework

As shown in Figure 9, Gluten fully follows Spark's unified memory management framework. The Spark memory pool is divided into two parts: Execution Memory Pool and Storage Memory. The former is responsible for managing runtime memory, such as the memory used during SQL sorting/aggregation, and the latter is responsible for maintaining storage. real-time memory, such as RDD cache, etc. For the Execution Memory Pool, there are three usage scenarios after Gluten is connected:

  • Spark is used by itself, because there may be scenarios where JVM and Native coexist, then the Spark operator needs to apply for memory, which is the same as the behavior of the Spark community.
  • Arrow Memory Pool, Gluten implements memory management through Arrow in the JVM operator he implements. For example, in the RowToColumnar operator, a ColumnarBatch is generated by caching a batch of Rows, so you need to apply for a piece of memory in advance.
  • Velox Moemry Pool, which is the key to unified memory. Velox provides a Memory Pool to manage the application and release of memory by Native operators during runtime. Gluten calls JNI callbacks to the JVM's Task Memory Consumer by registering a listener, thus Connect to Spark memory management framework

Gluten's RDD Columnar Cache uses the Storage Memory Pool. Spark Cache has the StorageLevel attribute, so users can choose to cache data in OnHeap or OffHeap. For example, our most commonly used dataset.cache will select OnHeap by default. Of course, we can Customize a new StorageLevel to explicitly declare caching through OffHeap. In Native Engine, we also recommend using OffHeap for caching. Therefore, an ideal Spark task running based on Native Engine should use very little OnHeap, and all data-related operators use OffHeap. This also reduces the performance penalty caused by GC in large heap scenarios.

03Gluten Performance

We conducted performance tests on Gluten and Spark based on  TPCDS 1TB  data volume, as shown in Figure 10. There are a total of 99 test cases, of which Gluten is better than Spark in 95% of the test cases. From the overall data point of view, Gluten also performs better than Spark, running faster and using fewer resources.

Comparative item Gluten Spark compare results
Total execution time s 2830 7082 Gluten is 2.5 times faster
Total CPU * Hours 339.6 655.7 Gluten saves 48.2%
Peak memory usage GB 9.3 14.1 Gluten Save 34%

Figure 10 TPCDS-1TB

This test data is worse than the official test data provided by Intel, which has about 2.8 times performance improvement. There are many factors that will affect performance, such as CPU, Shuffle disk, network card at the hardware level, operating system version, GCC version, etc. at the software level. Of course, this test data can show that Gluten has very good prospects, but we still need to do a lot of things before it is actually implemented in the production environment. After all, TPCDS test cases cannot cover all Spark usage methods, such as Spark write table, Datasource v1/v2 interface, Dataset cache, Hive UDF, etc.

04 Future plans

In addition to NetEase, the Gluten community also has deep participation in Alibaba Cloud, Baidu, Meituan and other companies. The community is also iterating rapidly. We hope that while deeply participating in the Gluten community, we can expose internal usage scenarios and problems to the community. , making it part of the standard and speeding up the implementation.

Since Native Engine has certain dependencies on the operating system, and some Hadoop clusters running on old hardware also use the older CentOS7 version, this will be an obstacle to implementation. Therefore, we plan to first use it in the K8s cluster and load the Native Engine based on the Ubuntu Docker image, so as to accelerate the computing performance of Spark SQL tasks without the user being aware of it.

Finally, teams with ideas are very welcome to come and communicate with us.

 
Broadcom announced the termination of the existing VMware partner program deepin-IDE version update, a new look. WAVE SUMMIT is celebrating its 10th edition. Wen Xinyiyan will have the latest disclosure! Zhou Hongyi: Hongmeng native will definitely succeed. The complete source code of GTA 5 has been publicly leaked. Linus: I won’t read the code on Christmas Eve. I will release a new version of the Java tool set Hutool-5.8.24 next year. Let’s complain about Furion together. Commercial exploration: the boat has passed. Wan Zhongshan, v4.9.1.15 Apple releases open source multi-modal large language model Ferret Yakult Company confirms that 95 G data was leaked
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4565392/blog/10316086