Spark SQL

Spark SQL is a module of Spark. It processes structured data, but cannot process unstructured data.

Features

Easy to integrate (no need to install separately)
Unified data access method (structured data types: JDBC, Json, Hive, Parquer files can all be used as Spark SQL data sources)
Fully compatible with Hive (read the data in Hive, run it in Spark SQL)
Support standard data connection

Create DataFrame

1. Create by case class

grade.txt file

06140411	Mr.Wu	102	110	106	318
06140407	Mr.Zhi	60	98	80	238
06140404	Mr.Zhang	98	31	63	192
06140403	Mr.Zhang	105	109	107	321
06140406	Mr.Xie	57	87	92	236
06140408	Mr.Guo	102	102	50	254
06140402	Mr.Li	54	61	64	179
06140401	Mr.Deng	83	76	111	270
06140409	Mr.Zhang	70	56	91	217
06140412	Mr.Yao	22	119	112	253
06140410	Mr.Su	45	65	80	190
06140405	Mr.Zheng	79	20	26	125

scala code

# 定义schema
case class info(studentID: String,studentName: String,chinese: Int,math: Int,english: Int,totalGrade: Int)
val rdd = sc.textFile("/root/grade.txt").map(_.split("\t"))
val rdd1 = rdd.map(x => info(x(0),x(1),x(2).toInt,x(3).toInt,x(4).toInt,x(5).toInt))
val df = rdd1.toDF
df.show

result

+---------+-----------+-------+----+-------+----------+                         
|studentID|studentName|chinese|math|english|totalGrade|
+---------+-----------+-------+----+-------+----------+
| 06140411|      Mr.Wu|    102| 110|    106|       318|
| 06140407|     Mr.Zhi|     60|  98|     80|       238|
| 06140404|   Mr.Zhang|     98|  31|     63|       192|
| 06140403|   Mr.Zhang|    105| 109|    107|       321|
| 06140406|     Mr.Xie|     57|  87|     92|       236|
| 06140408|     Mr.Guo|    102| 102|     50|       254|
| 06140402|      Mr.Li|     54|  61|     64|       179|
| 06140401|    Mr.Deng|     83|  76|    111|       270|
| 06140409|   Mr.Zhang|     70|  56|     91|       217|
| 06140412|     Mr.Yao|     22| 119|    112|       253|
| 06140410|      Mr.Su|     45|  65|     80|       190|
| 06140405|   Mr.Zheng|     79|  20|     26|       125|
+---------+-----------+-------+----+-------+----------+

Insert picture description here

Second, create through spark session

grade.txt file

06140411	Mr.Wu	102	110	106	318
06140407	Mr.Zhi	60	98	80	238
06140404	Mr.Zhang	98	31	63	192
06140403	Mr.Zhang	105	109	107	321
06140406	Mr.Xie	57	87	92	236
06140408	Mr.Guo	102	102	50	254
06140402	Mr.Li	54	61	64	179
06140401	Mr.Deng	83	76	111	270
06140409	Mr.Zhang	70	56	91	217
06140412	Mr.Yao	22	119	112	253
06140410	Mr.Su	45	65	80	190
06140405	Mr.Zheng	79	20	26	125

scala code

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
# 定义schema
val mySchema = StructType(List(StructField("studentID",DataTypes.StringType),StructField("studentName",DataTypes.StringType),StructField("chinese",DataTypes.IntegerType),StructField("math",DataTypes.IntegerType),StructField("english",DataTypes.IntegerType),StructField("totalGrade",DataTypes.IntegerType)))
val rdd = sc.textFile("/root/grade.txt").map(_.split("\t"))
val rdd1 = rdd.map(x => Row(x(0),x(1),x(2).toInt,x(3).toInt,x(4).toInt,x(5).toInt))
val df = spark.createDataFrame(rdd1,mySchema)
df.show

result
Insert picture description here

3. Read formatted files

student.json file

{"studentID":"06140401", "studentName":"Mr.Deng"}
{"studentID":"06140402", "studentName":"Mr.Li"}
{"studentID":"06140403", "studentName":"Mr.Zhang"}
{"studentID":"06140404", "studentName":"Mr.Zhang"}
{"studentID":"06140405", "studentName":"Mr.Zheng"}
{"studentID":"06140406", "studentName":"Mr.Xie"}
{"studentID":"06140407", "studentName":"Mr.Zhi"}
{"studentID":"06140408", "studentName":"Mr.Guo"}
{"studentID":"06140409", "studentName":"Mr.Zhang"}
{"studentID":"06140410", "studentName":"Mr.Su"}
{"studentID":"06140411", "studentName":"Mr.Wu"}
{"studentID":"06140412", "studentName":"Mr.Yao"}

scala code

val df = spark.read.json("/root/temp/student.json")
val df = spark.read.format("json").load("/root/temp/student.json")
df.show

result
Insert picture description here

Manipulate DataFrame

DSL statement

df1.select($"studentName",$"chinese",$"math",$"english",$"totalGrade").show

Insert picture description here

df1.filter($"totalGrade">300).show

Insert picture description here

df1.groupBy($"roomID").count.show

SQL statement

grade table

Insert picture description here
student table

Insert picture description here
Create view

df1.createOrReplaceTempView("grade")
df2.createOrReplaceTempView("student")

spark.sql("select studentID,totalGrade from grade").show

Insert picture description here

spark.sql("select count(*) as studentNum from student").show

Insert picture description here
Multi-table query (internal connection)

spark.sql("select studentName,totalGrade from grade,student where grade.studentID = student.studentID order by grade.studentID").show

Insert picture description here

Spark SQL view

createGlobalTempView、createOrReplaceGlobalTempView、createOrReplaceTempView、createTempView

(1) Normal view (local view): only valid in the current session (createOrReplaceTempView, createTempView)

(2) Global view: useful in different sessions. Principle: Create a global view in the namespace: global_temp (similar to a library) (createOrReplaceTempView, createTempView)

Create DataSet

First, use the sequence

scala code

case class Person(name: String,age: Int)
val rdd = Seq(Person("destiny",18),Person("freedom",20)).toDF
rdd.show

result
Insert picture description here

Second, use JSON data

case class Student(studentID: String,studentName: String)
val df = spark.read.format("json").load("/root/temp/student.json")
df.as[Student].show

result
Insert picture description here

3. Use other format data

val ds = spark.read.text("/root/temp/spark_workCount.txt").as[String]
val word = ds.flatMap(_.split(" ")).filter(_.length > 3)
word.show

result
Insert picture description here

val word = ds.flatMap(_.split(" ")).map((_,1)).groupByKey(_._1).count

result
Insert picture description here

Operation DataSet

ds.where($"totalGrade" >= 250).show

Insert picture description here
Multi-table query

case class Grade(studentID: String,chinese: Int,math: Int,english: Int,totalGrade: Int)
case class Student(studentID: String,studentName: String)
val rdd = sc.textFile("/root/temp/gradeSheet.txt").map(_.split("\t"))
val ds1 = rdd.map(x => Grade(x(0),x(1).toInt,x(2).toInt,x(3).toInt,x(4).toInt)).toDS
val rdd = sc.textFile("/root/temp/studentSheet.txt").map(_.split("\t"))
val ds2 = rdd.map(x => Student(x(0),x(1))).toDS

Insert picture description here

ds1.join(ds2,"studentID").show

Insert picture description here

ds1.join(ds2,"studentID").where("totalGrade >= 250").show

Insert picture description here

Use data source

load and save functions

scala code

val ds = spark.read.load("/root/temp/users.parquet")
# save结果的文件为parquet类型
ds.select($"name",$"favorite_color").write.save("/root/temp/parquet")
val ds1 = spark.read.load("/root/temp/parquet")
ds1.show

result
Insert picture description here

mode function

df.write.mode("overwrite").save("/root/temp/parquet")

Insert picture description here

saveAsTable function

df.select($"name").write.saveAsTable("table1")
spark.sql("select * from table1").show

Insert picture description here

option function

Support schema merge

val df = sc.makeRDD(1 to 6).map(x => (x,x*2)).toDF("singel","double")
df.write.mode("overwrite").save("/root/temp/table/key=1")
val df1 = sc.makeRDD(7 to 10).map(x => (x,x*3)).toDF("single","triple")
df1.write.mode("overwrite").save("/root/temp/table/key=2")
val df2 = spark.read.option("mergeSchema",true).parquet("/root/temp/table")

result
Insert picture description here