1. Overview of Spark SQL
1. The past and present of Spark SQL
Shark is a large-scale data warehouse system designed for Spark, which is compatible with Hive. Shark builds on Hive's code base by swapping out parts of Hive's physical execution plan. This method allows Shark users to speed up Hive queries, but Shark inherits the large and complex code from Hive, making Shark difficult to optimize and maintain, and Shark depends on the Spark version. As we hit the upper limit of performance optimization, and integrated some complex analytical capabilities of SQL, we found that the framework of Hive's MapReduce design limited Shark's development. At the Spark Summit on July 1, 2014, Databricks announced the termination of development of Shark and the focus on Spark SQL.
2. What is Spark SQL
- Spark SQL is a module that Spark uses to process structured data. It provides a programming abstraction called DataFrame and acts as a distributed SQL query engine.
-
Compared with the Spark RDD API, Spark SQL contains more information about structured data and operations on it. Spark SQL uses this information to perform additional optimizations to make operations on structured data more efficient and convenient.
-
There are multiple ways to use Spark SQL, including SQL, DataFrames API, and Datasets API. But no matter which API or programming language it is, they are all based on the same execution engine, so you can switch between different APIs at will, each with its own characteristics, depending on which style you like.
3. Why learn Spark SQL
1. Comparison between Spark SQL and Hive
- Hive converts Hive SQL into MapReduce and submits it to the cluster for execution, which greatly simplifies the complexity of writing MapReduce programs, but the execution efficiency of MapReduce is relatively slow.
- Spark SQL converts Spark SQL into RDD, and then submits it to the cluster to run. The execution efficiency is very fast!
2. Features of Spark SQL
- Easy to integrate
Seamlessly mix SQL queries with spark programs, you can use API operations in languages such as java, scala, python, and R.
-
Unified Data Access
Connect to any data source in the same way .
-
Compatible with Hive
Supports the syntax of hiveSQL.
- standard data connection
Industry standard JDBC or ODBC connections can be used.
二.DataFrame
1. What is a DataFrame
- The predecessor of DataFrame is SchemaRDD, and SchemaRDD has been renamed DataFrame since Spark 1.3.0. The main difference from SchemaRDD is that DataFrame no longer directly inherits from RDD, but implements most of the functions of RDD by itself. You can still call the rdd method on a DataFrame to convert it to an RDD.
-
In Spark, DataFrame is a distributed data set based on RDD, similar to the two-dimensional table of traditional database. DataFrame has Schema meta information, that is, each column of the two-dimensional table data set represented by DataFrame has name and type, but more optimizations are done under the hood. DataFrames can be constructed from many data sources, such as: existing RDDs, structured files, external databases, Hive tables.
2. The difference between DataFrame and RDD
- RDD can be regarded as a collection of distributed objects, Spark does not know the detailed schema information of objects, DataFrame can be regarded as a collection of distributed Row objects, which provides detailed schema information composed of columns, so that Spark SQL can Do some form of execution optimization. This enables Spark SQL to clearly know which columns are included in the data set, and what the names and types of each column are. DataFrame has more data structure information, namely schema. This looks like a table, and DataFrame is also equipped with new data manipulation methods, DataFrame API (such as df.select()) and SQL (select id, name from xx_table where ...).
-
DataFrame also introduces off-heap, meaning memory outside the JVM heap, which is managed directly by the operating system (rather than the JVM). Spark can serialize data (excluding structures) into off-heap in binary form. When data is to be manipulated, it directly manipulates off-heap memory. Since Spark understands schema, it knows how to operate.
-
RDD is a distributed collection of Java objects. A DataFrame is a distributed collection of Row objects. In addition to providing richer operators than RDD, the more important features of DataFrame are to improve execution efficiency, reduce data reading, and optimize execution plans.
-
After the high-level abstraction of DataFrame, we can process data more easily, and we can even use SQL to process data. For developers, the ease of use has been greatly improved. Processing data through the DataFrame API or SQL will automatically be optimized by the Spark optimizer (Catalyst), so even if the program or SQL you write is not efficient, it can run very fast.
3. Advantages and disadvantages of DataFrame and RDD
- Advantages and disadvantages of RDDs:
advantage:
(1) Compile-time type safety
Type errors can be checked at compile time
(2) Object-oriented programming style
directly manipulates data in the form of object calling methods
shortcoming:
(1) The performance overhead of serialization and deserialization
Whether it is communication between clusters or IO operations, serialization and deserialization of the structure and data of objects are required.
(2) GC performance overhead
Frequent creation and destruction of objects will inevitably increase GC
- Pros and cons of DataFrame
DataFrame solves the shortcomings of RDD by introducing schema and off-heap (memory that is not in the heap, which refers to the memory that is not in the heap and uses the memory on the operating system), Spark can read the data through the schema, so in the communication And IO only needs to serialize and deserialize the data, and the structure part can be omitted; through the introduction of off-heap, the data can be quickly manipulated and a large number of GCs can be avoided. But it loses the advantages of RDD, DataFrame is not type-safe, and the API is not object-oriented.
4. Read the data source to create a DataFrame
1. Read a text file to create a DataFrame
Before the spark 2.0 version, SQLContext in Spark SQL was the entry point for creating DataFrame and executing SQL . You can use hiveContext to operate hive table data through hive SQL statements , which is compatible with hive operations, and hiveContext inherits from SQLContext . After spark2.0 , these are unified in SparkSession , SparkSession encapsulates SparkContext , SqlContext , SparkConetxt, SqlContext objects can be obtained through SparkSession .
(1) Create a file locally with three columns, id, name, age, separated by spaces, and then upload it to hdfs. The content of person.txt is:
1 zhangsan 20 2 lysis 29 3 wangwu 25 4 zhaoliu 30 5 tianqi 35 6 kobe 40
Upload data files to HDFS: hdfs dfs -put person.txt /
(2) Execute the following command in the spark shell, read the data, and divide the data of each row using the column separator
First execute spark-shell --master local[2]
val lineRDD= sc.textFile("/person.txt").map(_.split(" "))
(3) Define a case class (equivalent to the schema of the table)
case class Person(id:Int, name:String, age:Int)
(4) Associate RDD with case class
val personRDD = lineRDD.map(x => Person(x(0).toInt, x(1), x(2).toInt))
(5) Convert RDD to DataFrame
val personDF = personRDD.toDF
(6) Process the DataFrame
personDF.show
personDF.printSchema
4. Common operations of DataFrame
1. DSL style syntax
DataFrame provides a Domain Specific Language (DSL) for manipulating structured data.
Here are some usage examples:
(1) View the content in the DataFrame by calling the show method
personDF.show
(2) View the contents of some columns of the DataFrame View the data of the name field
personDF.select(personDF.col("name")).show
View another way of writing the name field
personDF.select("name").show
View name and age field data
personDF.select(col("name"), col("age")).show
(3) Print the Schema information of the DataFrame
personDF.printSchema
(4) Query all names and ages, and add age+1
personDF.select(col("id"), col("name"), col("age") + 1).show
This can also be done like this:
personDF.select(personDF("id"), personDF("name"), personDF("age") + 1).show
(5) To filter age greater than or equal to 25, use the filter method to filter
personDF.filter(col("age") >= 25).show
(6) Count the number of people over the age of 30
personDF.filter(col("age")>30).count()
(7) Group by age and count the number of people of the same age
personDF.groupBy("age").count().show
2. SQL-style syntax
One of the power of DataFrame is that we can think of it as a relational data table, and then we can execute SQL query by using spark.sql() in the program, and the result returns a DataFrame.
If you want to use SQL-style syntax, you need to register the DataFrame as a table in the following way:
personDF.registerTempTable("t_person")
(1) Query the top two oldest persons
spark.sql("select * from t_person order by age desc limit 2").show
(2) Display the schema information of the table
spark.sql("desc t_person").show
(3) Query the information of people over the age of 30
spark.sql("select * from t_person where age > 30 ").show
5.DataSet
- What is a DataSet
DataSet is a distributed data collection, Dataset provides strong type support, and also adds type constraints to each row of data in RDD. DataSet is a new interface added in Spark 1.6. It combines the advantages of RDDs (strong typing and the ability to use powerful lambda functions) with a Spark SQL-optimized execution engine. DataSets can be constructed from JVM objects and can be manipulated with functional transformations (map/flatmap/filter).
- Difference between DataSet DataFrame RDD
Suppose the two rows of data in the RDD are like this:
1, Zhang San, 23 |
2, lee four, 35 |
Then the data in the DataFrame looks like this:
ID:String | Name:String | Age:int |
1 | Zhang San | 23 |
2 | Li Si | 35 |
Then the data in the Dataset looks like this:
value:String |
1, Zhang San, 23 |
2, lee four, 35 |
DataSet contains the functions of DataFrame. In Spark 2.0, the two are unified. DataFrame is represented as DataSet[Row], which is a subset of DataSet.
(1) DataSet can check the type at compile time
(2) and is an object-oriented programming interface
Compared with DataFrame, Dataset provides compile-time type checking. For distributed programs, submitting a job once is too laborious (compiling, packaging, uploading , and running), and errors are discovered when submitted to the cluster to run, which will waste a lot of money time, which is also an important reason for the introduction of Dataset.
-
DataFrame and DataSet can be converted into each other
(1) Convert DataFrame to DataSet
df.as[ElementType] This can convert DataFrame to DataSet.
(2) Convert DataSet to DataFrame
ds.toDF() This converts the DataSet into a DataFrame.
- Create a DataSet
(1) Created by spark.createDataset
(2) Generate DataSet through toDS method
(1) Generated by DataFrame conversion Use as[type] to convert to DataSet