New features of spark3.0

Reprinted document: After
nearly two years, the official version of Apache Spark 3.0.0 is finally released

Dynamic Partition Pruning (Dynamic Partition Pruning)

The so-called dynamic partition cutting is based on the information inferred at runtime (run time) to further partition cutting. For example, we have the following query:

SELECT * FROM dim_iteblog 
JOIN fact_iteblog 
ON (dim_iteblog.partcol = fact_iteblog.partcol) 
WHERE dim_iteblog.othercol > 10

Assuming that the dim_iteblog.othercol> 10 of the dim_iteblog table filters out less data, but because the previous version of Spark cannot dynamically calculate the cost, it may cause the fact_iteblog table to scan a large amount of invalid data. With dynamic partition reduction, useless data in the fact_iteblog table can be filtered out at runtime. After this optimization, the data for query scanning is greatly reduced, and the performance is improved by 33 times.

Related configuration:

To enable dynamic partition pruning, you need to set the spark.sql.optimizer.dynamicPartitionPruning.enabled parameter to true (default), other related parameters:

  • spark.sql.optimizer.dynamicPartitionPruning.useStats:true(默认),When true, distinct count statistics will be used for computing the data size of the partitioned table after dynamic partition pruning, in order to evaluate if it is worth adding an extra subquery as the pruning filter if broadcast reuse is not applicable.
  • spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio:0.5,When statistics are not available or configured not to be used, this config will be used as the fallback filter ratio for computing the data size of the partitioned table after dynamic partition pruning, in order to evaluate if it is worth adding an extra subquery as the pruning filter if broadcast reuse is not applicable.
  • spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcast:默认为 true,When true, dynamic partition pruning will seek to reuse the broadcast results from a broadcast hash join operation.

For details, please refer to:
https://www.iteblog.com/archives/8589.html
https://www.iteblog.com/archives/8590.html

Adaptive Query Execution (Adaptive Query Execution)

Adaptive query execution (also known as Adaptive Query Optimisation or Adaptive Optimisation) is the optimization of query execution plans, allowing Spark Planner to execute optional execution plans at runtime. These plans will be optimized based on runtime statistics.
As early as 2015, the Spark community proposed the basic idea of ​​adaptive execution. In Spark's DAGScheduler, an interface for submitting a single map stage was added, and an attempt was made to adjust the number of shuffle partitions at runtime. However, the current implementation has certain limitations. In some scenarios, more shuffles will be introduced, that is, more stages, and it cannot handle the situation of three tables joining in the same stage; and the current It is difficult for the framework to flexibly implement other functions in adaptive execution, such as changing the execution plan or handling inclined joins at runtime.

The AQE framework currently provides three functions:

  • Dynamically merge shuffle partitions;
  • Dynamically adjust the join strategy;
  • Dynamic optimization of skew joins (skew joins).

Based on the 1TB TPC-DS benchmark with no statistics, Spark 3.0 can increase the speed of q77 by 8 times, the speed of q5 by 2 times, and the speed of 26 other queries by more than 1.1 times. AQE can be enabled by setting the SQL configuration spark.sql.adaptive=true, the default value of this parameter is false.
Insert picture description here

Accelerator-aware Scheduling

Nowadays, big data and machine learning have been greatly combined. In machine learning, because the calculation iteration time may be very long, developers generally choose to use GPU, FPGA or TPU to accelerate the calculation. In Apache Hadoop 3.1 version, native support for GPU and FPGA has begun. As a general-purpose computing engine, Spark is certainly not far behind. Engineers from Databricks, NVIDIA, Google, and Alibaba are adding native GPU scheduling support to Apache Spark. This solution fills the gap in Spark's task scheduling of GPU resources, organically. It integrates big data processing and AI applications, and expands the application scenarios of Spark in deep learning, signal processing and various big data applications.
Insert picture description here
Currently, the resource managers YARN and Kubernetes supported by Apache Spark already support GPU. In order for Spark to also support GPUs, two major changes need to be made at the technical level:

At the cluster manager level, cluster managers need to be upgraded to support GPU. And provide users with related APIs, so that users can control the use and allocation of GPU resources.
Within Spark, modifications need to be made at the scheduler level so that the scheduler can identify the demand for the GPU in the user task request, and then complete the allocation according to the GPU supply on the executor.
Because it is a relatively large feature that Apache Spark supports GPU, the project is divided into several stages. In Apache Spark 3.0 version, it will support GPU support under standalone, YARN, and Kubernetes resource managers, and it will basically have no impact on existing normal operations. Support for TPU, GPU support in Mesos Explorer, and GPU support on Windows platform will not be the goal of this version. Moreover, fine-grained scheduling within a GPU card will not be supported in this version; Apache Spark 3.0 version will treat a GPU card and its memory as an inseparable unit.

Apache Spark DataSource V2

Data Source API defines how to read and write related API interfaces from the storage system, such as Hadoop's InputFormat/OutputFormat, Hive's Serde, etc. These APIs are very suitable for users to use RDD programming in Spark. Although programming with these APIs can solve our problems, the cost for users is still quite high, and Spark cannot optimize them. In order to solve these problems, Spark 1.3 version began to introduce Data Source API V1, through this API we can easily read data from various sources, and Spark uses some optimization engines of SQL components to optimize the reading of data sources. Such as column cropping, filtering push-down and so on.
Insert picture description here
Data Source API V1 abstracts a series of interfaces for us, and most of the scenarios can be realized by using these interfaces. However, as the number of users increased, some problems gradually emerged:

  • Part of the interface depends on SQLContext and DataFrame
  • The expansion ability is limited, it is difficult to push down other operators
  • Lack of support for columnar storage reads
  • Lack of partitioning and sorting information
  • Write operation does not support transactions
  • Does not support stream processing

In order to solve some problems of Data Source V1, starting from Apache Spark 2.3.0 version, the community introduced Data Source API V2. In addition to retaining the original functions, it also solved some of the problems of Data Source API V1, such as no longer Relying on the upper-level API, the expansion capability is enhanced.

For details, please refer to:
https://www.iteblog.com/archives/2578.html
https://www.iteblog.com/archives/2579.html

Rich API and functions

  • Enhanced pandas UDF

  • A complete set of join hints
    Although the community continues to improve the intelligence of the compiler, it cannot guarantee that the compiler can always make the best decision for each situation. The choice of the Join algorithm is based on statistics and heuristics. When the compiler cannot make the best choice, users can still use join hints to influence the optimizer to choose a better plan. Apache Spark 3.0 extends the existing join hints by adding new hints: SHUFFLE_MERGE, SHUFFLE_HASH and SHUFFLE_REPLICATE_NL

  • New built-in functions
    Scala API adds 32 new built-in functions and higher-order functions. Among these built-in functions, a set of specific built-in functions for MAP [transform_key, transform_value, map_entries, map_filter, map_zip_with] have been added to simplify the processing of MAP data types.

Enhanced monitoring function

  • Redesign of Structured streaming UI
    Structured streaming was originally introduced in Spark 2.0. Spark 3.0 redesigned the UI for monitoring these streaming jobs
    Insert picture description here
  • Enhanced EXPLAIN command
    Reading plans are very important for understanding and tuning queries. The existing solution looks very confusing, the string representation of each operator may be very wide, and may even be truncated. Spark 3.0 version uses a new formatting (FORMATTED) mode to enhance it, and also provides the function of dumping the plan to a file.
  • Observable indicators
    Continuous monitoring of changes in data quality is a very necessary feature for managing data pipelines. Spark version 3.0 introduced this capability for batch and stream processing applications. Observable indicators are named as any aggregate function (dataframe) that can be defined on the query. Once the execution of the dataframe reaches a completion point (for example, a batch query is completed), a named event is issued that contains an indicator of the data processed since the last completion point.

Better ANSI SQL compatibility

PostgreSQL is one of the most advanced open source databases. It supports most of the main features of SQL:2011. Among the 179 functions that fully meet the requirements of SQL:2011, PostgreSQL meets at least 160. The Spark community currently has a special ISSUE SPARK-27764 to solve the differences between Spark SQL and PostgreSQL, including functional feature supplements, bug modifications, etc. The function complement includes some functions supporting ANSI SQL, distinguishing SQL reserved keywords and built-in functions. This ISSUE corresponds to 231 sub-ISSUEs. If this part of the ISSUE is solved, the difference between Spark SQL and PostgreSQL or ANSI SQL:2011 is even smaller.

other

  • SparkR vectorized read and write
  • Kafka Streaming includeHeaders supports configuring some headers information in the message
  • Spark on K8S: Spark support for Kubernetes started from version 2.3, Spark 2.4 has been improved, and Spark 3.0 will add support for Kerberos and dynamic resource allocation.
  • Remote Shuffle Service: The current Shuffle has many problems, such as poor flexibility, has a great impact on the NodeManager, and is not suitable for the cloud environment. In order to solve the above problems, Remote Shuffle Service will be introduced, see SPARK-25299 for details
  • Support JDK 11: See SPARK-24417. The reason for choosing JDK 11 directly is because JDK 8 is about to reach EOL (end of life), and JDK9 and JDK10 are already EOL, so the community skips JDK9 and JDK10 and directly supports JDK11. However, the Spark 3.0 preview version still uses JDK 1.8 by default;
  • Remove support for Scala 2.11, support Scala 2.12 by default, see SPARK-26132 for details
  • Support Hadoop 3.2, see SPARK-23534 for details. Hadoop 3.0 has been released for 2 years (Apache Hadoop 3.0.0-beta1 is officially released, and the next version (GA) can be used online), so it is natural to support Hadoop 3.0. However, the Spark 3.0 preview version still uses Hadoop 2.7.4 by default.
  • Remove Python 2.x support: As early as June 2019, there was a related discussion in the community regarding the removal of Python 2 support in Spark 3.0. Currently, Spark 3.0.0 supports Python 3.x by default. See SPARK-27884.
  • Spark Graph supports Cypher: Cypher is a popular graph query language, and now we can use Cypher directly in Spark 3.0.
  • Spark event logs supports Roll, see "Spark 3.0 finally supports event logs rolling"

Guess you like

Origin blog.csdn.net/weixin_44455388/article/details/107782152
Recommended