Spark SQL

1. Overview of Spark SQL

 1. The past and present of Spark SQL

Shark is a large-scale data warehouse system designed for Spark, which is compatible with Hive. Shark builds on Hive's code base by swapping out parts of Hive's physical execution plan. This method allows Shark users to speed up Hive queries, but Shark inherits the large and complex code from Hive, making Shark difficult to optimize and maintain, and Shark depends on the Spark version. As we hit the upper limit of performance optimization, and integrated some complex analytical capabilities of SQL, we found that the framework of Hive's MapReduce design limited Shark's development. At the Spark Summit on July 1, 2014, Databricks announced the termination of development of Shark and the focus on Spark SQL.

 2. What is Spark SQL

  •  Spark SQL is a module that Spark uses to process structured data. It provides a programming abstraction called DataFrame and acts as a distributed SQL query engine.
  • Compared with the Spark RDD API, Spark SQL contains more information about structured data and operations on it. Spark SQL uses this information to perform additional optimizations to make operations on structured data more efficient and convenient.

  • There are multiple ways to use Spark SQL, including SQL, DataFrames API, and Datasets API. But no matter which API or programming language it is, they are all based on the same execution engine, so you can switch between different APIs at will, each with its own characteristics, depending on which style you like.

3. Why learn Spark SQL     

    1. Comparison between Spark SQL and Hive

  • Hive converts Hive SQL into MapReduce and submits it to the cluster for execution, which greatly simplifies the complexity of writing MapReduce programs, but the execution efficiency of MapReduce is relatively slow.
  • Spark SQL converts Spark SQL into RDD, and then submits it to the cluster to run. The execution efficiency is very fast!

    2. Features of Spark SQL

  • Easy to integrate    

    Seamlessly mix SQL queries with spark programs, you can use API operations in languages ​​such as java, scala, python, and R.

  • Unified Data Access

          Connect to any data source in the same way .

  • Compatible with Hive

          Supports the syntax of hiveSQL.

  • standard data connection

          Industry standard JDBC or ODBC connections can be used.

二.DataFrame

 1. What is a DataFrame

  • The predecessor of DataFrame is SchemaRDD, and SchemaRDD has been renamed DataFrame since Spark 1.3.0. The main difference from SchemaRDD is that DataFrame no longer directly inherits from RDD, but implements most of the functions of RDD by itself. You can still call the rdd method on a DataFrame to convert it to an RDD.
  • In Spark, DataFrame is a distributed data set based on RDD, similar to the two-dimensional table of traditional database. DataFrame has Schema meta information, that is, each column of the two-dimensional table data set represented by DataFrame has name and type, but more optimizations are done under the hood. DataFrames can be constructed from many data sources, such as: existing RDDs, structured files, external databases, Hive tables.

 2. The difference between DataFrame and RDD

  • RDD can be regarded as a collection of distributed objects, Spark does not know the detailed schema information of objects, DataFrame can be regarded as a collection of distributed Row objects, which provides detailed schema information composed of columns, so that Spark SQL can Do some form of execution optimization. This enables Spark SQL to clearly know which columns are included in the data set, and what the names and types of each column are. DataFrame has more data structure information, namely schema. This looks like a table, and DataFrame is also equipped with new data manipulation methods, DataFrame API (such as df.select()) and SQL (select id, name from xx_table where ...).
  • DataFrame also introduces off-heap, meaning memory outside the JVM heap, which is managed directly by the operating system (rather than the JVM). Spark can serialize data (excluding structures) into off-heap in binary form. When data is to be manipulated, it directly manipulates off-heap memory. Since Spark understands schema, it knows how to operate.

  • RDD is a distributed collection of Java objects. A DataFrame is a distributed collection of Row objects. In addition to providing richer operators than RDD, the more important features of DataFrame are to improve execution efficiency, reduce data reading, and optimize execution plans.

  • After the high-level abstraction of DataFrame, we can process data more easily, and we can even use SQL to process data. For developers, the ease of use has been greatly improved. Processing data through the DataFrame API or SQL will automatically be optimized by the Spark optimizer (Catalyst), so even if the program or SQL you write is not efficient, it can run very fast.

 3. Advantages and disadvantages of DataFrame and RDD

  • Advantages and disadvantages of RDDs:

          advantage:

       (1) Compile-time type safety 
                Type errors can be checked at compile time

       (2) Object-oriented programming style 
                directly manipulates data in the form of object calling methods

          shortcoming:

       (1) The performance overhead of serialization and deserialization 
                Whether it is communication between clusters or IO operations, serialization and deserialization of the structure and data of objects are required.

       (2) GC performance overhead 
                Frequent creation and destruction of objects will inevitably increase GC

  • Pros and cons of DataFrame

              DataFrame solves the shortcomings of RDD by introducing schema and off-heap (memory that is not in the heap, which refers to the memory that is not in the heap and uses the memory on the operating system), Spark can read the data through the schema, so in the communication And IO only needs to serialize and deserialize the data, and the structure part can be omitted; through the introduction of off-heap, the data can be quickly manipulated and a large number of GCs can be avoided. But it loses the advantages of RDD, DataFrame is not type-safe, and the API is not object-oriented.

 4. Read the data source to create a DataFrame

   1. Read a text file to create a DataFrame

  Before the spark 2.0 version, SQLContext in Spark SQL was the entry point for creating DataFrame and executing SQL . You can use hiveContext to operate hive table data through hive SQL statements , which is compatible with hive operations, and hiveContext inherits from SQLContext . After    spark2.0 , these are unified in SparkSession , SparkSession encapsulates SparkContext , SqlContext , SparkConetxt, SqlContext objects can be obtained through SparkSession .

  

 (1) Create a file locally with three columns, id, name, age, separated by spaces, and then upload it to hdfs. The content of person.txt is:

1 zhangsan 20
2 lysis 29
3 wangwu 25
4 zhaoliu 30
5 tianqi 35
6 kobe 40  

    Upload data files to HDFS: hdfs dfs -put person.txt /

(2) Execute the following command in the spark shell, read the data, and divide the data of each row using the column separator

         First execute spark-shell --master local[2]

        val lineRDD= sc.textFile("/person.txt").map(_.split(" "))

      

(3) Define a case class (equivalent to the schema of the table)

 case class Person(id:Int, name:String, age:Int)

     

(4) Associate RDD with case class

    val personRDD = lineRDD.map(x => Person(x(0).toInt, x(1), x(2).toInt))

     

(5) Convert RDD to DataFrame

      val personDF = personRDD.toDF

     

(6) Process the DataFrame

     personDF.show

     

    personDF.printSchema

    

4. Common operations of DataFrame

    1. DSL style syntax 

      DataFrame provides a Domain Specific Language (DSL) for manipulating structured data.

     Here are some usage examples:

      (1) View the content in the DataFrame by calling the show method

        personDF.show

       

     (2) View the contents of some columns of the DataFrame View the data of the name field

       personDF.select(personDF.col("name")).show

       

      View another way of writing the name field

      personDF.select("name").show

     

    View name and age field data

    personDF.select(col("name"), col("age")).show

    

(3) Print the Schema information of the DataFrame

         personDF.printSchema

         

(4) Query all names and ages, and add age+1

         personDF.select(col("id"), col("name"), col("age") + 1).show

          

      This can also be done like this:

       personDF.select(personDF("id"), personDF("name"), personDF("age") + 1).show

       

(5) To filter age greater than or equal to 25, use the filter method to filter

         personDF.filter(col("age") >= 25).show

        

(6) Count the number of people over the age of 30

        personDF.filter(col("age")>30).count()

        

(7) Group by age and count the number of people of the same age

         personDF.groupBy("age").count().show

         

 2. SQL-style syntax

 One of the power of DataFrame is that we can think of it as a relational data table, and then we can execute SQL query by using spark.sql() in the program, and the result returns a DataFrame.

If you want to use SQL-style syntax, you need to register the DataFrame as a table in the following way:

personDF.registerTempTable("t_person")

(1) Query the top two oldest persons

         spark.sql("select * from t_person order by age desc limit 2").show

          

(2) Display the schema information of the table

         spark.sql("desc t_person").show

         

(3) Query the information of people over the age of 30

         spark.sql("select * from t_person where age > 30 ").show

         

5.DataSet

  • What is a DataSet

         DataSet is a distributed data collection, Dataset provides strong type support, and also adds type constraints to each row of data in RDD. DataSet is a new interface added in Spark 1.6. It combines the advantages of RDDs (strong typing and the ability to use powerful lambda functions) with a Spark SQL-optimized execution engine. DataSets can be constructed from JVM objects and can be manipulated with functional transformations (map/flatmap/filter).

  •  Difference between DataSet DataFrame RDD

            Suppose the two rows of data in the RDD are like this:

1, Zhang San, 23
2, lee four, 35

 

           Then the data in the DataFrame looks like this:

 

ID:String Name:String Age:int
1 Zhang San 23
2 Li Si 35

           Then the data in the Dataset looks like this:

 

 

value:String
1, Zhang San, 23
2, lee four, 35

DataSet contains the functions of DataFrame. In Spark 2.0, the two are unified. DataFrame is represented as DataSet[Row], which is a subset of DataSet.

       (1) DataSet can check the type at compile time

      (2) and is an object-oriented programming interface

Compared with DataFrame, Dataset provides compile-time type checking. For distributed programs, submitting a job once is too laborious (compiling, packaging, uploading , and running), and errors are discovered when submitted to the cluster to run, which will waste a lot of money time, which is also an important reason for the introduction of Dataset.

  • DataFrame and DataSet can be converted into each other

      (1) Convert DataFrame to DataSet

               df.as[ElementType] This can convert DataFrame to DataSet.

      (2) Convert DataSet to DataFrame

              ds.toDF() This converts the DataSet into a DataFrame.

  • Create a DataSet

        (1) Created by spark.createDataset

           

           

      (2) Generate DataSet through toDS method

         

   (1) Generated by DataFrame conversion Use as[type] to convert to DataSet

       

 

          

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325058375&siteId=291194637