Apache Spark 2.4 review and 3.0 outlook

Apache Spark 2.4 review and 3.0 outlook

Past large data memory historical memory big data
information in this article from Strata Data Conference 2019-03-28 in San Francisco, please see https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/72637 . The sharers come from two famous figures, Fan Wenchen and Li Xiao .

This sharing includes a review of Apache Spark 2.4 and an outlook for Apache Spark 3.0. Apache Spark 2.4 version is the fifth version of the 2.x series. The main features of this version include the following:

  • The new scheduling model (Barrier Scheduling) enables users to properly embed distributed deep learning training into the Spark stage to simplify the distributed training workflow.
  • Added 35 higher-order functions for manipulating arrays/maps in Spark SQL.
  • Added a new native AVRO data source based on Databricks' spark-avro module.
  • PySpark also introduces an eager evaluation mode (eager evaluation mode) for all operations of teaching and debuggability.
  • Spark on K8S supports PySpark and R, and supports client-mode.
  • Various enhancements of Structured Streaming. For example, stateful operators in continuous processing.
  • Various performance improvements for built-in data sources. For example, Parquet nested mode pruning (schema pruning).
  • Support Scala 2.12.
    For more information about Apache Spark 2.4, please refer to "Apache Spark 2.4 Officially Released, Detailed Introduction to Important Features".
    Apache Spark 3.0 also contains many important features, such as GPU-aware Scheduling (GPU-aware Scheduling, please refer to "Apache Spark 3.0 Will Built-in Support for GPU Scheduling, with Benefits at the End of the Article"), Spark Graph graph enhancement, Data Source API V2 , Adaptive Execution (Adaptive Execution, please refer to "How Adaptive Execution Makes Spark SQL More Efficient and Useful?", Apache Spark SQL Adaptive Execution Practice https://www.iteblog.com/archives/2319.html), Support for Hadoop 3.x, support for Hive 2.3.4, Scala 2.12 GA, better ANSI SQL compliance, further improvement of PySpark availability, etc. Of course, this is just a brief introduction to the features of Apache Spark 3.0. For a more detailed introduction, please refer to the Spark+AI Summit 2019 held in San Francisco on April 23-25! , The figure below is the new architecture diagram of Apache Spark 3.x.

Apache Spark 2.4 review and 3.0 outlook
Okay, let’s not talk too much nonsense. The following is the full text of the PPT of this conference. Pay attention to the Hadoop technical blog post and reply to spark-3 to get the PPT of this article.

Apache Spark 2.4 review and 3.0 outlook
Apache Spark 2.4 review and 3.0 outlook
Apache Spark 2.4 review and 3.0 outlook
Apache Spark 2.4 review and 3.0 outlook
Apache Spark 2.4 review and 3.0 outlook
Apache Spark 2.4 review and 3.0 outlook

Guess you like

Origin blog.51cto.com/15127589/2678466