The first stable version of Apache Spark 3.0 is released, and it can finally be used in a production environment!

The first stable version of Apache Spark 3.0 is released, and it can finally be used in a production environment!

Past memory big data
Apache Spark 3.0.0 official version was released on June 18, 2020. It brings us a lot of new features, many of which speed up data calculation. Unfortunately, this version is not stable.

But just yesterday, Apache Spark 3.0.1 version was released quietly (seems like I didn't see the email notification)! To everyone’s delight, this version is a stable version. All 3.0 users are officially recommended to upgrade to this version.

Apache Spark 3.0 adds a lot of exciting new features, including dynamic partition pruning (Dynamic Partition Pruning), adaptive query execution (Adaptive Query Execution), accelerator-aware scheduling (Accelerator-aware Scheduling), support Catalog data source API ( Data Source API with Catalog Supports, see SPARK-31121), Vectorization in SparkR, support for Hadoop 3/JDK 11/Scala 2.12, etc. For details, please refer to the article "After nearly two years, the official version of Apache Spark 3.0.0 is finally released" on the memory of big data in the past.

•Apache Spark 3.0.1 Release Note: https://spark.apache.org/releases/spark-release-3-0-1.html
•All revised ISSUE see: https://issues.apache.org/jira /secure/ReleaseNote.jspa?projectId=12315420&version=12347862
•Apache Spark 3.0.1 download link : https://spark.apache.org/downloads.html

Noteworthy changes

•[SPARK-26905]: Revisit reserved/non-reserved keywords based on the ANSI SQL standard
•[SPARK-31220]: repartition obeys spark.sql.adaptive.coalescePartitions.initialPartitionNum when spark.sql.adaptive.enabled
•[SPARK-31703]: Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)
•[SPARK-31915]: Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs
•[SPARK-31923]: Event log cannot be generated when some internal accumulators use unexpected types
•[SPARK-31935]: Hadoop file system config should be effective in data source options
•[SPARK-31968]: write.partitionBy() creates duplicate subdirectories when user provides duplicate columns
•[SPARK-31983]: Tables of structured streaming tab show wrong result for duration column
•[SPARK-32003]: Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
•[SPARK-32038]: Regression in handling NaN values in COUNT(DISTINCT)
•[SPARK-32073]: Drop R < 3.5 support
•[SPARK-32092]: CrossvalidatorModel does not save all submodels (it saves only 3)
•[SPARK-32136]: Spark producing incorrect groupBy results when key is a struct with nullable properties
•[SPARK-32148]: LEFT JOIN generating non-deterministic and unexpected result (regression in Spark 3.0)
•[SPARK-32220]: Cartesian Product Hint cause data error
•[SPARK-32310]: ML params default value parity
•[SPARK-32339]: Improve MLlib BLAS native acceleration docs
•[SPARK-32424]: Fix silent data change for timestamp parsing if overflow happens
•[SPARK-32451]: Support Apache Arrow 1.0.0 in SparkR
•[SPARK-32456]: Check the Distinct by assuming it as Aggregate for Structured Streaming
•[SPARK-32608]: Script Transform DELIMIT value should be formatted
•[SPARK-32646]: ORC predicate pushdown should work with case-insensitive analysis
•[SPARK-32676]: Fix double caching in KMeans/BiKMeans

Known issues

•[SPARK-31511]: Make BytesToBytesMap iterator() thread-safe
•[SPARK-32779]: Spark/Hive3 interaction potentially causes deadlock
•[SPARK-32788]: non-partitioned table scan should not have partition filter
•[SPARK-32810]: CSV/JSON data sources should avoid globbing paths when inferring schema

Guess you like

Origin blog.51cto.com/15127589/2677835