Spark Basic Study Notes 23: DataFrame and Dataset

Zero, the learning objectives of this lecture

  1. Understand the basic concepts of Spark SQL
  2. Master the basic concepts of DataFrame
  3. Master the basic concepts of Dataset

1. Spark SQL

(1) Overview of Spark SQL

  • Spark SQL is a Spark component for structured data processing. The so-called structured data refers to data with schema information, such as data in the format of JSON, Parquet, , Avroand . CSVUnlike the underlying Spark RDD API, Spark SQL provides query and computation interfaces on structured data.

(2) Main features of Spark SQL

1. Seamlessly combine SQL queries with Spark applications

  • Spark SQL allows structured data to be queried in Spark programs using SQL or the familiar DataFrame API. Different from Hive, Hive translates SQL into MapReduce jobs, and the bottom layer is based on MapReduce; while the bottom layer of Spark SQL uses Spark RDD.
  • Embed SQL Statements in Spark Application
val results = spark.sql( "SELECT * FROM users")

2. Spark SQL connects multiple data sources in the same way

  • Spark SQL provides generic methods to access various data sources including Hive, Avro, Parquet, ORC, JSON, JDBCetc.
  • Read the JSON file in HDFS, create a temporary view based on the file content, and finally associate and query with other tables based on the specified fields
// 读取JSON文件
val userScoreDF = spark.read.json("hdfs://master:9000/users.json")
// 创建临时视图user_score
userScoreDF.createTempView("user_score")
// 根据name关联查询
val resDF = spark.sql("SELECT i.age, i.name, c.score FROM user_info i INNER JOIN user_score c ON i.name = c.name")

3. Run SQL or HiveQL queries on existing data warehouses

  • Spark SQL supports HiveQLsyntax Hive SerDesand UDF(user-defined functions), allowing access to existing Hive repositories.

Second, the data frame - DataFrame

(1) Overview of DataFrame

  • DataFrame is a programming abstraction provided by Spark SQL. Similar to RDD, it is also a distributed data collection, but unlike RDD, DataFrame data is organized into named columns, just like tables in relational databases. In addition, a variety of data can be converted into DataFrames, such as RDDs generated during Spark computing, structured data files, tables in Hive, external databases, etc.

(2) Convert RDD to DataFrame

  • DataFrame adds data description information (Schema, that is, meta information) on the basis of RDD, so it looks more like a database table.
  • There are 3 rows of data in an RDD
    insert image description here
  • After converting the RDD into a DataFrame, the data may look like the following image
    insert image description here
  • Using the DataFrame API combined with SQL to process structured data is easier than RDD, and processing data through the DataFrame API or SQL, the Spark optimizer will automatically optimize it, even if the program or SQL is not efficient, it can run fast.

3. Dataset - Dataset

(1) Dataset overview

  • Dataset is a distributed dataset, a new API added in Spark 1.6. Compared with RDD, Dataset provides strong type support, adding type constraints to each row of data in RDD. Moreover, using the Dataset API will also be optimized by the Spark SQL optimizer, thereby improving program execution efficiency.

(2) Convert RDD to DataSet

  • There are 3 rows of data in an RDD
    insert image description here
  • The data after converting it to a Dataset may look like the image below
    insert image description here

(3) The relationship between DataFrame and Dataset

  • In Spark, a DataFramerepresented is an element of type Row, Dataseti.e. DataFramejust Dataset[Row]a type alias.

4. Simple use of Spark SQL

(1) Understanding SparkSession

  • scIn addition to creating an instance named by default when Spark Shell starts SparkContext, it also creates an instance named spark, SparkSessionand this sparkvariable can be Spark Shellused directly in .
  • SparkSession is just an encapsulation based on SparkContext, and the entry point of the application is still SparkContext. SparkSession allows users to write Spark programs by calling DataFrame and Dataset related APIs, supports loading data from different data sources, converts data into DataFrame, and then uses SQL statements to manipulate DataFrame data.

(2) Prepare data files

  • Directory where files are created student.txtand uploaded to HDFS/input
    insert image description here

(3) Load data as Dataset

1. Read the text file and return the dataset

  • Calling the SparkSession API read.textFile()can read the file content in the specified path and load it as a Dataset
  • Excuting an order:val ds = spark.read.textFile("hdfs://master:9000/input/student.txt")
    insert image description here
  • dsAs can be seen from the type of the variable , the textFile()method converts the read data to Dataset. In addition to using textFile()methods to read text content, you can also use csv(), jdbc(), , json()and other methods to read data such as CSVfiles, JDBCdata sources, and JSONfiles.

2. Display the content of the dataset

  • Call the show() method in the Dataset to output the data content in the Dataset
  • Excuting an order:ds.show()
    insert image description here
  • As can be seen from the above, Dataseteach line in the file is regarded as an element, and all elements form a column, and the column name defaults to value.

(4) Add metadata information to the dataset

1. Define the sample class

  • Define a sample class Studentfor storing data description information ( Schema)
  • Excuting an order:case class Student(id: Int, name: String, age: Int)
    insert image description here

2. Import implicit conversion

  • Import the implicit conversion of SparkSession so that the operator of the Dataset can be used later
  • Excuting an order:import spark.implicits._
    insert image description here

3. Save the data set content into the sample class

  • Call the map()operator of the Dataset to split each element and store it Studentin the class
    insert image description here

4. View the data set content

  • Excuting an order:studentDataset.show()
    insert image description here
  • As you can see, the data in studentDataset is similar to a relational database table.

Guess you like

Origin blog.csdn.net/howard2005/article/details/124341503
Recommended