Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
Past memory big data Past memory big data
This information comes from the Big Data & AI Summit branch of the Yunqi Conference held in Hangzhou on September 26, 2019. The title of the topic "New Developments in the Open Source Ecosystem: Apache Spark 3.0 and Koalas", sharing guest Li Xiao, Databricks Spark R&D director.
The following is the video of this meeting (due to the limitation of the WeChat official account, only videos of less than 30 minutes can be posted. For the complete video and PPT, please pay attention to the past memory big data official account and reply to spark_yq to obtain it.)
2019 is a special year for the Spark community. Ten years ago, Ma Tie gave birth to a great project in order to help his classmates get millions of dollars in the Netflix Prize competition initiated by Netflix. This is now Apache Spark.
The above is the development history of Apache Spark. The preview version of Apache Spark 3.0 will be released in September 2019, and the official version of Apache Spark 3.0 will be released early next year.
World-class knows that Spark and PySpark were both ranked high in stackoverflow, ranking first in Apache Spark and second in Apache Hadoop for 10 years; in the future, Apache Spark and PySpark will monopolize the world.
Apache Spark 3.0 is the result of the joint efforts of the community and has been developed for more than a year. The following are the main features of Apache Spark 3.0:
- Dynamic partition reduction
- Adaptive
- Spark Graph
- Accelerated induction scheduling (GPU, see for details, Apache Spark 3.0 will support GPU scheduling built-in, there are benefits at the end of the article)
- Spark on k8s
- DataSource API V2
- ANSI SQL compatible
- SQL Hints
- Vectorization in SparkR
- JDK 11
- Hadoop 3
- Scala 2.12
mainly introduces Apache Spark 3.0's query optimization.
Spark 2.x adds cost-based optimization, but this does not perform well. The main reasons are as follows: - Lack of statistical information;
- Statistics are out of date;
- It is difficult to abstract a general cost model.
To solve these problems, Apache Spark 3.0 introduced Runtime-based query optimization.
The first one is dynamic partition reduction. For
example, the SQL query above. Assuming that t2 table t2.id <2 filters out less data, but because the previous version of Spark cannot dynamically calculate the cost, it may cause a large number of invalid scans in the t1 table. The data. With dynamic partition reduction, you can filter out the useless data of the t1 table at runtime.
After this optimization, the data for query scanning is greatly reduced, and the performance is increased by 33 times. The
second optimization is AE (for details, see How to use Adaptive Execution to make Spark SQL More efficient and better to use?).
For example, for the following query, cost-based model optimization cannot be accurately evaluated.
With AE, Spark can dynamically count relevant information and dynamically adjust the execution plan, such as changing SortMergeJoin to BroadcastHashJoin:
Spark is positioned at the beginning of its birth It is a unified big data processing engine.
Data processing mainly goes through the following three stages - Business Intelligence
- Big data analysis
- Data Unification + AI
Spark mainly faces two groups: data engineers and data scientists.
In recent years, more and more users of Python have also led to more and more users using pandas for data analysis. But pandas mainly solves the analysis of small data volume. When the data volume is large, the analysis performance drops sharply. In order to analyze a large amount of data, it is necessary to learn a new calculation engine.
To solve this problem, Spark open sourced koalas, which is seamlessly compatible with pandas (for details on koalas, please refer to Koalas: Let pandas developers easily transition to Apache Spark). The following are the daily downloads since koalas was open sourced:
As a tool for data analysis, PySpark is gradually increasing in downloads per day.
The following is the difference between PySpark DataFrame and pandas DataFrame:
The following is a comparison of usage examples, mainly to read a csv file, rename the column name, and finally add a new column: It
can be seen that PySpark is used compared to pandas It’s still very troublesome. Users who use pandas have to learn new APIs, but with koalas, this problem doesn’t exist anymore:
Let’s introduce data engineering
in terms of data engineering. Count bricks open source Delta Lake (for details, see Heavy | The Delta Lake that the Apache Spark community is looking forward to is open source):
Delta Lake was not created out of thin air, but based on the pain points of thousands of users. The use of Delta Lake is also very different. Directly replace parquet with delta:
The following mainly introduces the characteristics
of Delta Lake. Here are three user scenarios for Delta Lake:
The following is the use of Delta by bricks: