Yunqi Conference | Apache Spark 3.0 and Koalas latest developments

Yunqi Conference | Apache Spark 3.0 and Koalas latest developments

Past memory big data Past memory big data
This information comes from the Big Data & AI Summit branch of the Yunqi Conference held in Hangzhou on September 26, 2019. The title of the topic "New Developments in the Open Source Ecosystem: Apache Spark 3.0 and Koalas", sharing guest Li Xiao, Databricks Spark R&D director.
The following is the video of this meeting (due to the limitation of the WeChat official account, only videos of less than 30 minutes can be posted. For the complete video and PPT, please pay attention to the past memory big data official account and reply to spark_yq to obtain it.)
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments

2019 is a special year for the Spark community. Ten years ago, Ma Tie gave birth to a great project in order to help his classmates get millions of dollars in the Netflix Prize competition initiated by Netflix. This is now Apache Spark.
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
The above is the development history of Apache Spark. The preview version of Apache Spark 3.0 will be released in September 2019, and the official version of Apache Spark 3.0 will be released early next year.
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
World-class knows that Spark and PySpark were both ranked high in stackoverflow, ranking first in Apache Spark and second in Apache Hadoop for 10 years; in the future, Apache Spark and PySpark will monopolize the world.
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
Apache Spark 3.0 is the result of the joint efforts of the community and has been developed for more than a year. The following are the main features of Apache Spark 3.0:
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments

  • Dynamic partition reduction
  • Adaptive
  • Spark Graph
  • Accelerated induction scheduling (GPU, see for details, Apache Spark 3.0 will support GPU scheduling built-in, there are benefits at the end of the article)
  • Spark on k8s
  • DataSource API V2
  • ANSI SQL compatible
  • SQL Hints
  • Vectorization in SparkR
  • JDK 11
  • Hadoop 3
  • Scala 2.12
    mainly introduces Apache Spark 3.0's query optimization.
    Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
    Spark 2.x adds cost-based optimization, but this does not perform well. The main reasons are as follows:
  • Lack of statistical information;
  • Statistics are out of date;
  • It is difficult to abstract a general cost model.
    To solve these problems, Apache Spark 3.0 introduced Runtime-based query optimization.
    Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
    The first one is dynamic partition reduction. For
    Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
    example, the SQL query above. Assuming that t2 table t2.id <2 filters out less data, but because the previous version of Spark cannot dynamically calculate the cost, it may cause a large number of invalid scans in the t1 table. The data. With dynamic partition reduction, you can filter out the useless data of the t1 table at runtime.
    Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
    After this optimization, the data for query scanning is greatly reduced, and the performance is increased by 33 times. The
    Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
    second optimization is AE (for details, see How to use Adaptive Execution to make Spark SQL More efficient and better to use?).
    Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
    For example, for the following query, cost-based model optimization cannot be accurately evaluated.
    Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
    With AE, Spark can dynamically count relevant information and dynamically adjust the execution plan, such as changing SortMergeJoin to BroadcastHashJoin:
    Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
    Spark is positioned at the beginning of its birth It is a unified big data processing engine.
    Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
    Data processing mainly goes through the following three stages
    Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
  • Business Intelligence
  • Big data analysis
  • Data Unification + AI
    Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
    Spark mainly faces two groups: data engineers and data scientists.
    Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
    In recent years, more and more users of Python have also led to more and more users using pandas for data analysis. But pandas mainly solves the analysis of small data volume. When the data volume is large, the analysis performance drops sharply. In order to analyze a large amount of data, it is necessary to learn a new calculation engine.
    To solve this problem, Spark open sourced koalas, which is seamlessly compatible with pandas (for details on koalas, please refer to Koalas: Let pandas developers easily transition to Apache Spark). The following are the daily downloads since koalas was open sourced:

Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
As a tool for data analysis, PySpark is gradually increasing in downloads per day.
The following is the difference between PySpark DataFrame and pandas DataFrame:
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
The following is a comparison of usage examples, mainly to read a csv file, rename the column name, and finally add a new column: It
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
can be seen that PySpark is used compared to pandas It’s still very troublesome. Users who use pandas have to learn new APIs, but with koalas, this problem doesn’t exist anymore:
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
Let’s introduce data engineering
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
in terms of data engineering. Count bricks open source Delta Lake (for details, see Heavy | The Delta Lake that the Apache Spark community is looking forward to is open source):
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
Delta Lake was not created out of thin air, but based on the pain points of thousands of users. The use of Delta Lake is also very different. Directly replace parquet with delta:
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
The following mainly introduces the characteristics
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
of Delta Lake. Here are three user scenarios for Delta Lake:
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
The following is the use of Delta by bricks:
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments
Yunqi Conference | Apache Spark 3.0 and Koalas latest developments

Guess you like

Origin blog.51cto.com/15127589/2679158