[Big Data] Apache Spark 3.3.0 is officially released, and the new features are explained in detail

Introduction

Apache Spark 3.3.0 was officially developed on July 3, 2021. After nearly a year, it was finally officially released on June 16, 2022, and was also released simultaneously with Databricks Runtime 11.0. A total of 1600 ISSUEs have been resolved in this release. Thanks to the Apache Spark community for their valuable contributions to the Spark 3.3 release.

insert image description here

PySpark's PyPI has rapidly grown to 21 million monthly downloads, and Python is now the most popular API language. Monthly downloads of PySpark doubled compared to the same period last year. Additionally, Maven has over 24 million monthly downloads. Spark has become the most widely used scalable computing engine.

insert image description here
Spark 3.3 still aims to make Spark more unified, simple, fast and scalable. Based on this goal, Spark 3.3 adds the following features:

Improve the performance of Join query through Bloom filters, which can increase the speed by up to 10 times.
The coverage of Pandas API is more comprehensive. For example, this version adds datetime.timedelta and merge_asof.
Simplify migration from traditional data warehouses with improved ANSI compatibility and dozens of new built-in functions.
Improve developer productivity through better error handling, autocompletion, performance improvements, and profiling.

insert image description here

Features detailed

performance improvement

Bloom Filter Joins
Spark can inject and push down Bloom Filter in the query plan as needed, so as to filter data in the early stage and reduce the size of intermediate data in shuffle and later calculation. Bloom filters are row-level runtime filters that complement dynamic partition pruning (DPP) and dynamic file pruning (DFP) to accommodate dynamic file skipping ( dynamic file skipping) is not adequate or thorough enough. As shown in the figure below, the community ran the TPC-DS benchmark on three different data source variants: untuned Delta Lake, tuned Delta Lake, and raw Parquet files. By enabling this Bloom filter feature, we observed to about 10 times faster. In the absence of storage tuning or accurate statistics, such as untuned Delta Lake data sources or raw Parquet file-based data sources, the rate of performance improvement is even greater. In these cases, Bloom filters make query performance more robust without the need for storage/statistical tuning. See SPARK-32268 for Bloom Filter Joins.
insert image description here

Query Execution Improvements

There are several Adaptive Query Execution (AQE) improvements in this release:

Pass temporary empty relation through Aggregate/Union, see SPARK-35442 for details;
optimize single-row query plan in common and AQE optimizers, see SPARK-38162 for details;
support to eliminate limit in AQE optimizer.
Whole-stage codegen has also been improved in several ways:

Full outer sort merge join supports code generation, see SPARK-35352 for details, and the performance is improved by 20%~30%;
Full outer shuffled hash join supports code generation, see SPARK-32567 for details, and the performance is improved by 10%~20%;
Existence sort merge join Support code generation, see SPARK-37316 for details;
Sort aggregate without grouping keys supports code generation, see SPARK-37564 for details.
Spark Parquet vectorized readers support nested types
This enhancement adds support for complex types such as lists, maps, and arrays in Spark's Parquet vectorized readers. Microbenchmarks show that Spark can improve performance by an average of about 15 times when scanning struct fields, and by an average of about 1.5 times when reading arrays containing elements of struct and map types. For performance testing, please refer to https://github.com/apache/spark/pull/33695, and for Spark Parquet vectorized readers supporting nested ISSUE, please refer to SPARK-34863.

Pandas extension

Optimize the default index

With this release, the Pandas API on Spark switches the default indexing from sequence to distributed-sequence, which can be optimized with the Catalyst Optimizer. In a benchmark on an i3.xlarge 5-node cluster, scanning data in the Pandas API on Spark was 2x faster using the default index. See SPARK-37649 for Optimized Default Index.
insert image description here

Pandas API coverage

PySpark now supports datetime.timedelta natively on Spark via the Spark SQL and Pandas API (see SPARK-37275, SPARK-37525), and the Python type returned by this function is now mapped to the date-time interval type in Spark SQL. Also, with this release, the Pandas API on Spark now supports many missing parameters and new API features. Includes ps.merge_asof (SPARK-36813), ps.timedelta_range (SPARK-37673), and ps.to_timedelta (SPARK-37701), among others.

Migration Simplified

ANSI Enhanced

This release completes support for the ANSI interval data type (SPARK-27790). We can now read/write interval values ​​from/to tables and use intervals in many functions/operators for date/time operations including aggregation and comparison. Implicit conversions in ANSI mode now support safe conversions between types while preventing data loss. A growing library of "try" functions, such as "try_add" and "try_multiply", complement ANSI mode, allowing users to accept the safety of ANSI mode rules while still allowing fault-tolerant queries.

Added built-in functions

In addition to the try_* functions (SPARK-35161), this new version now includes nine new linear regression and statistical functions, four new string processing functions, aes encryption and decryption functions, general floor and ceiling functions, " to_number" formatting and many other functions.

For more new features, please visit Apache Spark 3.3.0 release notes

I hope it will be helpful to you who are viewing the article, remember to pay attention, comment, and favorite, thank you

Guess you like

Origin blog.csdn.net/u013412066/article/details/129098642