The Spark SQL module is mainly to deal with some content related to SQL parsing. More generally speaking, it is how to parse a SQL statement into a Dataframe or RDD task. Taking Spark 2.4.3 as an example, the large module of Spark SQL is divided into three sub-modules, as shown below

spark sql module

Among them, Catalyst can be said to be a framework dedicated to parsing SQL within Spark. A similar framework in Hive is Calcite (parse SQL into MapReduce tasks). Catalyst divides the SQL parsing task into several stages. This is more clearly described in the corresponding paper. A lot of content in this series will also refer to the paper. Those interested in reading the original paper can go here: Spark SQL: Relational Data Processing in Spark .

The Core module is actually the main analysis process of Spark SQL. Of course, some content of Catalyst will be called in this process. The more commonly used classes in this module include SparkSession, DataSet, etc.

As for the hive module, it goes without saying that it must be related to hive. This module is basically not involved in this series, so I won't introduce it much.

It is worth mentioning that when the paper was published, it was still in the Spark1.x stage. At that time, the SQL was parsed into a lexical tree using a parsing tool written in scala. In the 2.x stage, antlr4 was used to do this part It should be the biggest change). As for why it should be changed, I guess it is due to readability and ease of use, of course, this is only a personal guess.

In addition, this series will briefly introduce the processing flow of a SQL statement, based on spark 2.4.3 (the sql module has not changed much after spark2.1). This article first introduces the background and problem solving of Spark SQL as a whole, what is the process of the Dataframe API and Catalyst, and then details the process of Catalyst in stages.

The background and problems of Spark SQL

At the earliest time, the technology for large-scale data processing was MapReduce, but the implementation efficiency of this framework was too slow, and some relational processing (such as join) required a lot of code. Later, the framework such as hive allows users to enter sql statements, automatically optimize and execute.

However, in large systems, there are still two main problems. One is that ETL operations need to interface with multiple data sources. The other is that users need to perform complex analysis, such as machine learning and graph calculation. But it is more difficult to realize in traditional relational processing system.

Spark SQL provides two sub-modules to solve this problem, DataFrame API and Catalyst .

Compared to RDD, Dataframe api provides richer relational api, and can be converted with RDD. The focus of Spark machine learning later, also transferred from RDD-based mllib to Dataframe-based Spark ML Although the bottom of the Dataframe is also RDD).

The other is Catalyst, through which you can easily add data sources (such as json or custom types through case class) to domains such as machine learning, and optimize rules and data types.

Through these two modules, Spark SQL mainly achieves the following goals:

Provide easy-to-use and good API, including reading external data sources, and relational data processing (everyone who has used it)
Use the established DBMS technology to provide high performance.
Easily support new data sources, including semi-structured data and external databases (such as MYSQL).
Expansion in graph computing and machine learning

Then introduce the process of Dataframe and Catalyst, of course, the main discussion is Catalyst.

Unified API Dataframe

First look at a picture provided in the paper:

Dataframe

This picture can explain a lot, first of all, the bottom layer of Spark's Dataframe API is also based on Spark's RDD. But the difference with RDD is that Dataframe will hold the schema (this is really difficult to translate, can be understood as the structure of the data), and can perform a variety of relational operations , such as Select, Filter, Join, Groupby, etc. From an operational point of view, it is similar to pandas Dataframe (even the name is the same).

At the same time, because it is based on RDD, so many RDD features Dataframe can enjoy, such as distributed computing consistency, reliability assurance, and can cache data through cache to improve computing performance, etc.

At the same time, the page on the figure shows that the Dataframe can be connected to an external database through JDBC, through console operation (spark-shell), or user program. To put it bluntly, the Dataframe can be converted by RDD or generated by an external data table .

By the way, by the way, many children's shoes that are exposed to Spark SQL for the first time may be confused about the two things Dataset and Dataframe. In the 1.x era, they are indeed somewhat different. It has been unified. So basically Dataset and Dataframe can be regarded as equivalent .

Finally, let's make a practical display in combination with the code. The following shows the generation of an RDD, and the corresponding Dataframe is generated according to this RDD, from which you can see the difference between RDD and Dataframe:

//生成RDD
scala> val data = sc.parallelize(Array((1,2),(3,4)))
data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> data.foreach(println)
(1,2)
(3,4)

scala> val df = data.toDF("fir","sec")
df: org.apache.spark.sql.DataFrame = [fir: int, sec: int]

scala> df.show()
+---+---+
|fir|sec|
+---+---+
|  1|  2|
|  3|  4|
+---+---+

//跟RDD相比，多了schema
scala> df.printSchema()
root
 |-- fir: integer (nullable = false)
 |-- sec: integer (nullable = false)

Catalyst flow analysis

Catalyst is called Optimizer in the paper. This part is the core content of the paper, but the process is actually quite easy to understand. The picture in the paper is still posted.

catalyst process

The main process can be roughly divided into the following steps:

The Sql statement is parsed by Antlr4 to generate Unresolved Logical Plan (any children ’s shoes that have used Antlr4 must be familiar with this process)
Analyzer and catalog are bound (catlog stores metadata) to generate Logical Plan;
optimizer optimizes the Logical Plan and generates Optimized LogicalPlan;
SparkPlan converts Optimized LogicalPlan to Physical Plan;
prepareForExecution () converts Physical Plan to executed Physical Plan;
execute () executes the executable physical plan and gets the RDD;

Let me talk about it in advance. Most of the above processes are in the org.apache.spark.sql.execution.QueryExecution class. This post is a simple code, just look at it, just do n’t do more. Later articles will detail the contents here.

class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) {

  ......其他代码
  
  //analyzer阶段
  lazy val analyzed: LogicalPlan = {
    SparkSession.setActiveSession(sparkSession)
    sparkSession.sessionState.analyzer.executeAndCheck(logical)
  }


  //optimizer阶段
  lazy val optimizedPlan: LogicalPlan = sparkSession.sessionState.optimizer.execute(withCachedData)
  
  //SparkPlan阶段
  lazy val sparkPlan: SparkPlan = {
    SparkSession.setActiveSession(sparkSession)
    // TODO: We use next(), i.e. take the first plan returned by the planner, here for now,
    //       but we will implement to choose the best plan.
    planner.plan(ReturnAnswer(optimizedPlan)).next()
  }

  //prepareForExecution阶段
  // executedPlan should not be used to initialize any SparkPlan. It should be
  // only used for execution.
  lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan)

  //execute阶段
  /** Internal version of the RDD. Avoids copies and has no schema */
  lazy val toRdd: RDD[InternalRow] = executedPlan.execute()

  ......其他代码
}

It is worth mentioning that each stage uses lazy lazy loading. If you are interested in this, you can take a look at my previous article Scala Functional Programming (6) Lazy Loading and Stream .

The above mainly introduces the contents of the Spark SQL module, its background and main problems. Then briefly introduce the content of the Dataframe API, and the internal framework of Spark SQL parsing SQL Catalyst. The follow-up will mainly introduce the process of each step in Catalyst, and do some analysis in combination with the source code.

Above ~

Analysis of Spark SQL source code (1) Overview of the SQL analysis framework Catalyst process

The background and problems of Spark SQL

Unified API Dataframe

Catalyst flow analysis

Guess you like