Apache Spark 3.0 preview version of the official release, a number of major feature release

November 8, 2019 the number of bricks Xingbo Jiang big brother to the community sent an e-mail announcing Apache Spark 3.0 preview version of the official release, this version is mainly to Apache Spark 3.0 version of the upcoming large-scale community testing. Whether it is from the API or from the function, this preview is not a stable version, its main purpose is to allow the community to try new features of Apache Spark 3.0 in advance. If you want to test this version, you can download here. Apache Spark 3.0 adds many exciting new features, including dynamic partition pruning (Dynamic Partition Pruning), adaptive query execution (Adaptive Query Execution), accelerator aware scheduling (Accelerator-aware Scheduling), supported data sources Catalog API ( Data Source API with Catalog supports), SparkR of vectorization (vectorization in SparkR), support for Hadoop 3 / JDK 11 / Scala 2.12, and so on. For a complete list Spark 3.0.0-preview features and major changes can be found here. Now I will take you to resolve some of the more important new features. PS: careful observation of the students can be seen, Spark 3.0 How much Streaming Structed Streaming related ISSUE / did not like, which may be several reasons: - Currently Spark Streaming / Structed Streaming based Batch mode can meet most of the needs of enterprises, real need to be applied very calculated in real time, or very little, so the Continuous Processing module is still in the experimental stage, is not anxious to graduate; - the number of bricks should be developed Delta Lake related things in a lot of investment, this can generate revenue business, the current is their focus, so naturally the development of small Streaming invested. Well, nonsense is not pulled, we take a look at Spark 3. New features 0 bar. ## dynamic partition pruning (Dynamic Partition Pruning) is cut so-called dynamic partitioning to further partition cropping information based run-time (run time) inferred. For example, we have the following query: `` `SELECT * FROM dim_iteblog JOIN fact_iteblog ON (dim_iteblog.partcol = fact_iteblog.partcol) WHERE dim_iteblog.othercol> 10` `` assumed dim_iteblog.othercol dim_iteblog table> 10 was filtered out data is relatively small, but due to the previous version of the Spark can not dynamically calculate the cost, it may lead to fact_iteblog table scan a large number of invalid data. With dynamic partitioning cut, you can filter out unwanted data fact_iteblog table at run time. After this optimization, inquiry scan data greatly reduced, by 33-fold performance improvement. This feature can be found in the corresponding ISSUE SPARK-11150 and SPARK-28888. Large number of public data memory past the first few days of this feature is also described in detail. Refer to Apache Spark 3.0 dynamic partitioning cut (Dynamic Partition Pruning) Introduction and Apache Spark 3. 0 dynamic partitioning cut (Dynamic Partition Pruning) use. ## adaptive query execution (Adaptive Query Execution) adaptive query execution (also known as Adaptive Query Optimisation or Adaptive Optimisation) are optimized query execution plan, allowing Spark Planner perform optional runtime execution plan, these plans will be based on statistical data optimized runtime. As early as 2015, Spark community proposed the basic idea of ​​adaptive execution, increasing the interface to submit a single map stage in DAGScheduler Spark of, and adjustment to achieve run-time shuffle partition on the number of attempts made. But this implementation has some limitations, in some scenarios will introduce more shuffle, namely more stage, for three tables at the same stage do join in and so can not be a good deal; and the use of the current framework is difficult to flexibly implement other functions in the adaptive execution, such as changing the implementation plan or process slanted join at run time. So this feature has been in the experimental stage, the configuration parameters are not mentioned in the official documents. The idea primarily from Intel and Baidu Daniel, specifically refer to SPARK-9850, the corresponding article can be found in Apache Spark SQL adaptive execution practices. The Apache Spark 3. Adaptive Query Execution 0 SPARK-9850 is based on the idea implemented, see in particular SPARK-23128. SPARK-23128's goal is to achieve a flexible framework to perform adaptive execution Spark SQL and supports a number of changes reducer at runtime. The new implementation solves all the limitations discussed above, other features (such as changing the treatment strategy and join tilt join) will be realized as a separate function, and as a plug-in provided later version. ## accelerator aware scheduling (Accelerator-aware Scheduling) Today, big data and machine learning has been a great combination of learning inside the machine, because the iterative calculation time can be very long, developers usually choose to use the GPU, FPGA or TPU to speed up the calculation. In Apache Hadoop 3.1 version of which has already begun built native support for the GPU and FPGA. As Spark general purpose computing engine certainly is not far behind, from Databricks, NVIDIA, Google and Alibaba's engineers are working to add GPU scheduling native support for Apache Spark, the program fills Spark gaps in scheduling GPU resources aspects of organic blend of big data processing and AI applications, extends the Spark application scenarios depth learning, signal processing and data applications in the major. issue of this work can be viewed in SPARK-24615 inside, related SPIP (Spark Project Improvement Proposals) documents can be found in SPIP:! Accelerator-aware scheduling ## Apache Spark 3.0 will be built-in support GPU scheduling [If you want to keep abreast of Spark, Hadoop HBase or related article, welcome attention to the micro-channel public account: iteblog_hadoop] (https://img2018.cnblogs.com/blog/377574/201911/377574-20191117114204290-1826268617. png) currently supports Apache Spark Explorer YARN and Kubernetes has supported GPU. In order for Spark also supported GPUs, at the technical level need to make two major changes: On the cluster manager level, cluster managers need to be upgraded to support GPU. And providing the user with the relevant API, so that the user can control the use and distribution of GPU resources. Inside the Spark, modifications need to be made at the scheduler level, so that the scheduler can recognize the needs of the user task requests in the GPU, the GPU is then allocated according to accomplish the supplied executor. Let Apache Spark GPU support because a relatively large properties, so the project is divided into several stages. In Apache Spark 3.0 version will support GPU support in standalone, YARN and Kubernetes Explorer, and basically had no impact on existing normal operation. For the support of TPU, Mesos Explorer GPU support, and Windows platform will not support GPU version of this goal. But also for the fine-grained scheduling within a GPU card does not support this version; Apache Spark 3.0 version will be a GPU and its memory card as an indivisible unit. Refer to past memories large number of public data "Apache Spark 3.0 will be built-in support GPU scheduling" article. How ## Apache Spark DataSource V2 Data Source API definitions relevant API interface to read from the storage system, such as the Hadoop InputFormat / OutputFormat, Hive Serde the like. The API is ideal for users in the RDD program when using the Spark. Although the use of these API programming can solve our problems, but to the user cost can still be pricey, but Spark can not be optimized. To solve these problems, Spark 1. 3 version began to introduce a Data Source API V1, we can easily read the data from various sources through the API, and Spark use some SQL optimization engine component reads the data source is optimized, such as row cropping, filtering down push and so on. ! [If you want to know about the article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public number: iteblog_hadoop] (https://img2018.cnblogs.com/blog/377574/201911/377574-20191117114206251-475779410.png) Data Source API V1 us a series of abstract interfaces, use these interfaces can achieve most of the scenes. But as users increase, gradually showing some questions: - part of the interface relies SQLContext and DataFrame - limited expansion capabilities, it is difficult to push down other operators - lack of support for columnar storage read - the lack of zoning and sorting information - write operation does not support transactions - does not support the stream processing in order to solve some of the problems Data Source V1, starting from Apache Spark version 2.3.0, the community introduced Data Source API V2, while retaining the original features, but also to solve the Data Source API Some problems V1, for example, no longer dependent on the upper API, scalability enhancements. Data Source API V2 can be found in the corresponding ISSUE SPARK-15689. Although this feature in Apache Spark 2.x version appeared, but not very stable, so the stability of community work as well as new features Spark DataSource API V2 are opened two ISSUE: SPARK-25186 and SPARK-22386. Spark DataSource API V2 final stable version and the new features will be released along with the end of the year and Apache Spark version 3.0.0, which is also regarded as Apache Spark 3.0. 0 version of a major new feature. More details about the Apache Spark DataSource V2, see historical memory large number of public data Apache Spark DataSource V2 Introduction and Getting Started Programming Guide (on) and introduce Apache Spark DataSource V2 Introduction and Getting Started Programming Guide (under) two articles. ## Better ANSI SQL-compatible PostgreSQL is one of the most advanced open source database that supports SQL: most of the major characteristics of 2011, in full compliance with SQL: 2011 requirements 179 functions, PostgreSQL comply with at least 160. Spark opened a dedicated community currently ISSUE SPARK-27764 to resolve the differences between Spark SQL and PostgreSQL, including features padded, Bug and other modifications. Features include a padded support some of the functions of ANSI SQL, SQL distinction reserved keywords and built-in functions and so on. This corresponds to the following ISSUE ISSUE 231 sub, if this part of the ISSUE have been resolved, then the Spark SQL and PostgreSQL or ANSI SQL: 2011 difference between the smaller. ## SparkR to quantify reader Spark is beginning to support the R language from the 1.4 version, but then Spark and R interact architecture is as follows:! [If you want to know about the article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public account: iteblog_hadoop] (https://img2018.cnblogs.com/blog/377574/201911/377574-20191117114208065-2079502154. png) Whenever we use the R language and Spark clusters interact, need to go through JVM, it will not be able to avoid serialization and de-serialization of data operations, which in the case of a large amount of data performance is very low! And Apache Spark vectorization has been optimized (vectorization optimization) In many operations, e.g., internal column Schemes (columnar format), Parquet / ORC vectorized read, Pandas UDFs like. Vectorization can greatly improve performance. SparkR quantization allows the user to be used as the existing code, but when they perform a local function, or R and R DataFrame Spark DataFrame the mutual conversion can be increased by about thousands of times performance. This work can look SPARK-26759. The new structure is as follows:! [If you want to know about the article Spark, Hadoop HBase or related to, welcome attention to the micro-channel public account: iteblog_hadoop] (https://img2018.cnblogs.com/blog/377574/201911/377574-20191117114209243-150267609 after .png) As can be seen, SparkR vectorization using Apache Arrow, which allows interactive data between systems has become very efficient and avoid the serialization and deserialization of consumption data, so the use of this, SparkR and Spark interactive performance has been greatly improved.

Guess you like

Origin www.cnblogs.com/w397090770/p/11875829.html