The official version of Apache Spark 3.0.0 is finally released, with a comprehensive analysis of important features

The official version of Apache Spark 3.0.0 is finally released, with a comprehensive analysis of important features

Past memory big data Past memory big data was
originally planned to be released at the end of 2019. Apache Spark 3.0.0 was finally released today before the Spark Summit AI conference to be held next Tuesday! Apache Spark 3.0.0 was developed on October 02, 2018 It has been nearly 21 months so far! The release of this version has gone through two previews and three votes:

• For the first preview version on November 6, 2019, see Preview release of Spark 3.0[1]
• For the second preview version on December 23, 2019, see Preview release of Spark 3.0[2]
• March 21, 2020 [VOTE] Apache Spark 3.0.0 RC1[3]
• May 18, 2020 [VOTE] Apache Spark 3.0 RC2[4]
• June 6, 2020 [vote] Apache Spark 3.0 RC3[5]
Apache Spark 3.0 Many exciting new features have been added, including Dynamic Partition Pruning, Adaptive Query Execution, Accelerator-aware Scheduling, and Data Source API that supports Catalog with Catalog Supports), Vectorization in SparkR (Vectorization in SparkR), support for Hadoop 3/JDK 11/Scala 2.12, etc. This version has solved more than 3400 ISSUES in total.
The official version of Apache Spark 3.0.0 is finally released, with a comprehensive analysis of important features
If you want to learn about Spark, Hadoop or HBase related articles in time, please pay attention to the WeChat public account: iteblog_hadoop

The distribution of these more than 3400 issues in each component of Spark is as follows:

The official version of Apache Spark 3.0.0 is finally released, with a comprehensive analysis of important features
The main features in Apache Spark 3.0.0 are as follows:

The official version of Apache Spark 3.0.0 is finally released, with a comprehensive analysis of important features
If you want to learn about Spark, Hadoop or HBase related articles in time, please pay attention to the WeChat public account: iteblog_hadoop

I have already introduced these more important features in this category, and interested students can check it out. Let's take a quick look at the more important new features of Spark 3.0. You can check out Spark Release 3.0.0 here for a more comprehensive one.

Dynamic Partition Pruning

The so-called dynamic partition cutting is based on the information inferred at runtime (run time) to further partition cutting. For example, we have the following query:


SELECT * FROM dim_iteblog 
JOIN fact_iteblog 
ON (dim_iteblog.partcol = fact_iteblog.partcol) 
WHERE dim_iteblog.othercol > 10

Suppose that the dim_iteblog.othercol> 10 of the dim_iteblog table filters out less data, but because the previous version of Spark cannot dynamically calculate the cost, it may cause the fact_iteblog table to scan a large amount of invalid data. With dynamic partition reduction, useless data in the fact_iteblog table can be filtered out at runtime. After this optimization, the data for query and scan is greatly reduced, and the performance is improved by 33 times.

In the TPC-DS benchmark test, 60 out of 102 queries got a speedup of 2 to 18 times.

The official version of Apache Spark 3.0.0 is finally released, with a comprehensive analysis of important features

If you want to learn about Spark, Hadoop or HBase related articles in time, please follow the WeChat public account: iteblog_hadoop

The ISSUE corresponding to this feature can be found in SPARK-11150 and SPARK-28888. The past memory big data official account also introduced this feature in detail some time ago. For details, please refer to "Introduction to Apache Spark 3.0 Dynamic Partition Pruning" and "Use of Apache Spark 3.0 Dynamic Partition Pruning" .

Adaptive Query Execution

Adaptive query execution (also known as Adaptive Query Optimisation or Adaptive Optimisation) is the optimization of query execution plans, allowing Spark Planner to execute optional execution plans at runtime. These plans will be optimized based on runtime statistics.

As early as 2015, the Spark community proposed the basic idea of ​​adaptive execution. In Spark's DAGScheduler, an interface for submitting a single map stage was added, and an attempt was made to adjust the number of shuffle partitions at runtime. However, the current implementation has certain limitations. In some scenarios, more shuffles, that is, more stages, will be introduced, and it cannot handle the situations where three tables are joined in the same stage, etc.; and use the current It is difficult for the framework to flexibly implement other functions in adaptive execution, such as changing the execution plan or handling inclined joins at runtime. Therefore, this feature has been in the experimental stage, and the configuration parameters are not mentioned in the official documents. This idea mainly comes from the big cows of Intel and Baidu, see SPARK-9850 for details, and the corresponding article can be found in "Apache Spark SQL Adaptive Execution Practice".

The Adaptive Query Execution (AQE) of Apache Spark 3.0 is implemented based on the idea of ​​SPARK-9850, please refer to SPARK-23128 for details. The goal of SPARK-23128 is to implement a flexible framework to perform adaptive execution in Spark SQL and support changing the number of reducers at runtime. The new implementation solves all the limitations discussed earlier, and other functions (such as changing the join strategy and handling inclined joins) will be implemented as separate functions and provided as plug-ins in later versions.

The AQE framework currently provides three functions:

• Dynamically merge shuffle partitions;
• Dynamically adjust the join strategy;
• Dynamically optimize skew joins (skew joins).
Based on the 1TB TPC-DS benchmark without statistical data, Spark 3.0 can increase the speed of q77 by 8 times, the speed of q5 by 2 times, and the speed of 26 other queries by more than 1.1 times. AQE can be enabled by setting the SQL configuration spark.sql.adaptive=true, the default value of this parameter is false.

The official version of Apache Spark 3.0.0 is finally released, with a comprehensive analysis of important features

If you want to learn about Spark, Hadoop or HBase related articles in time, please follow the WeChat public account: iteblog_hadoop

Accelerator-aware Scheduling

Nowadays, big data and machine learning have been greatly combined. In machine learning, because the calculation iteration time may be very long, developers generally choose to use GPU, FPGA or TPU to accelerate the calculation. In Apache Hadoop 3.1 version, native support for GPU and FPGA has begun. As a general-purpose computing engine, Spark is certainly not far behind. Engineers from Databricks, NVIDIA, Google, and Alibaba are adding native GPU scheduling support to Apache Spark. This solution fills the gap in Spark's task scheduling of GPU resources, organically. It integrates big data processing and AI applications, and expands the application scenarios of Spark in deep learning, signal processing and various big data applications. The issue of this work can be viewed in SPARK-24615, and the related SPIP (Spark Project Improvement Proposals) document can be found in SPIP: Accelerator-aware scheduling[6]

The official version of Apache Spark 3.0.0 is finally released, with a comprehensive analysis of important features

If you want to learn about Spark, Hadoop or HBase related articles in time, please follow the WeChat public account: iteblog_hadoop

Currently, the resource managers YARN and Kubernetes supported by Apache Spark already support GPU. In order for Spark to also support GPUs, two major changes need to be made at the technical level:

• At the cluster manager level, cluster managers need to be upgraded to support GPU. And provide users with related APIs, so that users can control the use and allocation of GPU resources.
• Within Spark, modifications need to be made at the scheduler level so that the scheduler can identify the demand for the GPU in the user task request, and then complete the allocation based on the GPU supply on the executor.
Because it is a relatively large feature to allow Apache Spark to support GPU, the project is divided into several stages. In Apache Spark 3.0 version, it will support GPU support under standalone, YARN and Kubernetes resource managers, and it will basically have no impact on existing normal operations. Support for TPU, GPU support in Mesos Explorer, and GPU support on Windows platform will not be the goal of this version. Moreover, fine-grained scheduling within a GPU card will not be supported in this version; Apache Spark 3.0 version will treat a GPU card and its memory as an inseparable unit. For details, please refer to the article "Apache Spark 3.0 Will Built-in Support for GPU Scheduling" in the past memory big data official account.

Apache Spark DataSource V2

Data Source API defines how to read and write related API interfaces from the storage system, such as Hadoop's InputFormat/OutputFormat, Hive's Serde, etc. These APIs are very suitable for users to use RDD programming in Spark. Although programming with these APIs can solve our problems, the cost for users is still quite high, and Spark cannot optimize them. In order to solve these problems, Spark 1.3 version began to introduce Data Source API V1, through this API we can easily read data from various sources, and Spark uses some optimization engines of SQL components to optimize the reading of data sources. Such as column cropping, filtering push-down and so on.

The official version of Apache Spark 3.0.0 is finally released, with a comprehensive analysis of important features

If you want to learn about Spark, Hadoop or HBase related articles in time, please pay attention to the WeChat public account: iteblog_hadoop

Data Source API V1 abstracts a series of interfaces for us, and most of the scenarios can be realized by using these interfaces. However, as the number of users increased, some problems gradually appeared:

•Part of the interface depends on SQLContext and DataFrame
•Limited scalability, difficult to push down other operators
•Lack of support for columnar storage reads
•Lack of partitioning and sorting information
•Writing operations do not support transactions
•Does not support stream processing
In order to solve Data Source V1 Some of the problems, starting with Apache Spark 2.3.0, the community has introduced Data Source API V2. In addition to retaining the original functions, it also solves some of the problems of Data Source API V1, such as no longer relying on upper-level APIs and extensions. Ability enhancement. The ISSUE corresponding to Data Source API V2 can be found in SPARK-15689. Although this feature appeared in Apache Spark 2.x version, it is not very stable, so the community has opened two ISSUEs for the stability and new features of Spark DataSource API V2: SPARK-25186 and SPARK-22386. The final stable version of Spark DataSource API V2 and new features will be released with Apache Spark version 3.0.0 at the end of the year. It is also a major new feature of Apache Spark version 3.0.0.

For more detailed introduction of Apache Spark DataSource V2, please refer to the introduction of Apache Spark DataSource V2 Introduction and Getting Started Programming Guide (Part 1) and Introduction to Apache Spark DataSource V2 and Getting Started Programming Guide (Part 2) of the Big Data Official Account.

Rich API and functions

In order to meet new use cases and simplify the development of Spark applications, Apache Spark 3.0 version provides new features and enhances existing features.

Enhanced pandas UDF

Pandas UDF was originally introduced in Spark 2.3 to extend the UDF in PySpark and integrate the pandas API into PySpark applications. However, when adding more UDF types, the existing interface is difficult to understand. To solve this problem, Spark 3.0 introduced a new pandas UDF interface with Python type hints. This version adds two new pandas UDF types: iterator of series to iterator of series and iterator of multiple series to iterator of series, as well as three new pandas function APIs: grouped map, map, and co-grouped map. For a detailed introduction, please refer to Apache Spark 3.0's new Pandas UDF and Python Type Hints for remembering big data in the past: https://www.iteblog.com/archives/9814.html .

A complete set of join hints

Although the community continues to improve the intelligence of the compiler, there is no guarantee that the compiler can always make the best decision for every situation. The choice of the Join algorithm is based on statistics and heuristics. When the compiler cannot make the best choice, users can still use join hints to influence the optimizer to choose a better plan. Apache Spark 3.0 extends the existing join hints by adding new hints: SHUFFLE_MERGE, SHUFFLE_HASH and SHUFFLE_REPLICATE_NL

New built-in functions

32 new built-in functions and higher-order functions have been added to the Scala API. Among these built-in functions, a set of specific built-in functions for MAP [transform_key, transform_value, map_entries, map_filter, map_zip_with] have been added to simplify the processing of MAP data types.

Enhanced monitoring function

Apache Spark also includes many enhancements in monitoring, which make monitoring more comprehensive and stable. These enhanced monitoring features will not have a significant impact on performance. It can be divided into the following three places.

Redesigned the UI of Structured streaming

Structured streaming was first introduced in Spark 2.0. Spark 3.0 redesigned the UI for monitoring these streaming jobs. This new UI provides two sets of statistics:

• Aggregated information of completed stream query jobs
• Detailed statistical information of stream queries, including Input Rate, Process Rate, Input Rows, Batch Duration, Operation Duration, etc.
The official version of Apache Spark 3.0.0 is finally released, with a comprehensive analysis of important features

If you want to learn about Spark, Hadoop or HBase related articles in time, please pay attention to the WeChat public account: iteblog_hadoop

Enhanced EXPLAIN command

Reading plans are very important for understanding and tuning queries. The existing solution looks very confusing, the string representation of each operator may be very wide, and may even be truncated. Spark 3.0 version uses a new formatting (FORMATTED) mode to enhance it, and also provides the function of dumping the plan to a file.

Observable indicators

Continuous monitoring of changes in data quality is a very necessary feature for managing data pipelines. Spark version 3.0 introduced this capability for batch and stream processing applications. Observable indicators are named as any aggregate function (dataframe) that can be defined on the query. Once the execution of the dataframe reaches a completion point (for example, a batch query is completed), a named event is emitted that contains an indicator of the data processed since the last completion point.

Better ANSI SQL compatibility

PostgreSQL is one of the most advanced open source databases. It supports most of the main features of SQL:2011. Among the 179 functions that fully meet the requirements of SQL:2011, PostgreSQL meets at least 160. The Spark community has opened an ISSUE SPARK-27764 to solve the differences between Spark SQL and PostgreSQL, including functional features, bug modifications, etc. The function complement includes some functions supporting ANSI SQL, distinguishing SQL reserved keywords and built-in functions. This ISSUE corresponds to 231 sub-ISSUEs. If this part of the ISSUE is solved, the difference between Spark SQL and PostgreSQL or ANSI SQL:2011 will be even smaller.

SparkR vectorized read and write

Spark has supported the R language since version 1.4, but the architecture diagram of the interaction between Spark and R at that time is as follows:

The official version of Apache Spark 3.0.0 is finally released, with a comprehensive analysis of important features

If you want to learn about Spark, Hadoop or HBase related articles in time, please follow the WeChat public account: iteblog_hadoop

Whenever we use the R language to interact with the Spark cluster, we need to go through the JVM, which cannot avoid the serialization and deserialization of the data. This performance is very low when the amount of data is large!

And Apache Spark has already carried out vectorization optimization in many operations, such as columnar format, Parquet/ORC vectorized reading, Pandas UDFs, etc. Vectorization can greatly improve performance. SparkR vectorization allows users to use existing code as is, but when they execute R native functions or convert Spark DataFrame to R DataFrame, performance can be improved by about thousands of times. This work can be seen under SPARK-26759. The new architecture is as follows:

The official version of Apache Spark 3.0.0 is finally released, with a comprehensive analysis of important features

If you want to learn about Spark, Hadoop or HBase related articles in time, please pay attention to the WeChat public account: iteblog_hadoop

It can be seen that SparkR vectorization uses Apache Arrow, which makes the interaction of data between systems very efficient, and avoids the consumption of data serialization and deserialization, so after adopting this, SparkR and Spark interact Performance has been greatly improved.

Kafka Streaming: includeHeaders

Apache Kafka version 0.11.0.0 supports configuring some header information in the message. For details, see KIP-82-Add Record Headers. For the corresponding ISSUE, see KAFKA-4208. These headers are very useful in some scenarios. Spark 3.0.0 needs to support this function in order to meet the user's scenario. For details, see SPARK-23539. The specific use is as follows


val df = spark 
            .readStream 
            .format("kafka") 
            .option("kafka.bootstrap.servers", "host1:port1,host2:port2") 
            .option("subscribe", "topic1")
            .option("includeHeaders", "true")
            .load()

df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers") .as[(String, String, Map)]

other

•Spark on K8S: Spark support for Kubernetes started from version 2.3, Spark 2.4 has been improved, and Spark 3.0 will add support for Kerberos and dynamic resource allocation.
•Remote Shuffle Service: The current Shuffle has many problems, such as poor elasticity, has a great impact on NodeManager, and is not suitable for cloud environments. In order to solve the above problems, Remote Shuffle Service will be introduced, see SPARK-25299 for details.
•Support JDK 11: see SPARK-24417. The reason for choosing JDK 11 directly is because JDK 8 is about to reach EOL (end of life), while JDK9 and JDK10 It is already EOL, so the community skips JDK9 and JDK10 and directly supports JDK11. However, the Spark 3.0 preview version still uses JDK 1.8 by default;
•Remove support for Scala 2.11 and support Scala 2.12 by default, see SPARK-26132 for details.
•Support Hadoop 3.2, see SPARK-23710 for details. Hadoop 3.0 has been released for 2 years ( Apache Hadoop 3.0.0-beta1 is officially released, the next version (GA) can be used online), so it is natural to support Hadoop 3.0, but the Spark 3.0 preview version still uses Hadoop 2.7.4 by default.
•Removal of Python 2.x support: As early as June 2019, there was a discussion in the community regarding the removal of Python 2 support in Spark 3.0. Currently Spark 3.0.0 supports Python 3.x by default, see SPARK-27884 .
•Spark Graph supports Cypher: Cypher is a popular graph query language. Now we can use Cypher directly in Spark 3.0.
•Spark event logs supports Roll, see "Spark 3.0 finally supports event logs rolling".

Reference link

[1] Preview release of Spark 3.0: https://spark.apache.org/news/spark-3.0.0-preview.html
[2] Preview release of Spark 3.0: https://spark.apache.org/news/spark-3.0.0-preview2.html
[3] [VOTE] Apache Spark 3.0.0 RC1: https://www.mail-archive.com/[email protected]/msg25781.html
[4] [VOTE] Apache Spark 3.0 RC2: https://www.mail-archive.com/[email protected]/msg26040.html
[5] [vote] Apache Spark 3.0 RC3: https://www.mail-archive.com/[email protected]/msg26119.html

[6] https://spark.apache.org/releases/spark-release-3-0-0.html

[7] SPIP: Accelerator-aware scheduling: https://issues.apache.org/jira/secure/attachment/12960252/SPIP_%20Accelerator-aware%20scheduling.pdf

Guess you like

Origin blog.51cto.com/15127589/2677594