Spark SQL
Spark SQL is a module of Spark. It processes structured data, but cannot process unstructured data.
Features
- Easy to integrate (no need to install separately)
- Unified data access method (structured data types: JDBC, Json, Hive, Parquer files can all be used as Spark SQL data sources)
- Fully compatible with Hive (read the data in Hive, run it in Spark SQL)
- Support standard data connection
Create DataFrame
1. Create by case class
grade.txt file
06140411 Mr.Wu 102 110 106 318
06140407 Mr.Zhi 60 98 80 238
06140404 Mr.Zhang 98 31 63 192
06140403 Mr.Zhang 105 109 107 321
06140406 Mr.Xie 57 87 92 236
06140408 Mr.Guo 102 102 50 254
06140402 Mr.Li 54 61 64 179
06140401 Mr.Deng 83 76 111 270
06140409 Mr.Zhang 70 56 91 217
06140412 Mr.Yao 22 119 112 253
06140410 Mr.Su 45 65 80 190
06140405 Mr.Zheng 79 20 26 125
scala code
# 定义schema
case class info(studentID: String,studentName: String,chinese: Int,math: Int,english: Int,totalGrade: Int)
val rdd = sc.textFile("/root/grade.txt").map(_.split("\t"))
val rdd1 = rdd.map(x => info(x(0),x(1),x(2).toInt,x(3).toInt,x(4).toInt,x(5).toInt))
val df = rdd1.toDF
df.show
result
+---------+-----------+-------+----+-------+----------+
|studentID|studentName|chinese|math|english|totalGrade|
+---------+-----------+-------+----+-------+----------+
| 06140411| Mr.Wu| 102| 110| 106| 318|
| 06140407| Mr.Zhi| 60| 98| 80| 238|
| 06140404| Mr.Zhang| 98| 31| 63| 192|
| 06140403| Mr.Zhang| 105| 109| 107| 321|
| 06140406| Mr.Xie| 57| 87| 92| 236|
| 06140408| Mr.Guo| 102| 102| 50| 254|
| 06140402| Mr.Li| 54| 61| 64| 179|
| 06140401| Mr.Deng| 83| 76| 111| 270|
| 06140409| Mr.Zhang| 70| 56| 91| 217|
| 06140412| Mr.Yao| 22| 119| 112| 253|
| 06140410| Mr.Su| 45| 65| 80| 190|
| 06140405| Mr.Zheng| 79| 20| 26| 125|
+---------+-----------+-------+----+-------+----------+
Second, create through spark session
grade.txt file
06140411 Mr.Wu 102 110 106 318
06140407 Mr.Zhi 60 98 80 238
06140404 Mr.Zhang 98 31 63 192
06140403 Mr.Zhang 105 109 107 321
06140406 Mr.Xie 57 87 92 236
06140408 Mr.Guo 102 102 50 254
06140402 Mr.Li 54 61 64 179
06140401 Mr.Deng 83 76 111 270
06140409 Mr.Zhang 70 56 91 217
06140412 Mr.Yao 22 119 112 253
06140410 Mr.Su 45 65 80 190
06140405 Mr.Zheng 79 20 26 125
scala code
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
# 定义schema
val mySchema = StructType(List(StructField("studentID",DataTypes.StringType),StructField("studentName",DataTypes.StringType),StructField("chinese",DataTypes.IntegerType),StructField("math",DataTypes.IntegerType),StructField("english",DataTypes.IntegerType),StructField("totalGrade",DataTypes.IntegerType)))
val rdd = sc.textFile("/root/grade.txt").map(_.split("\t"))
val rdd1 = rdd.map(x => Row(x(0),x(1),x(2).toInt,x(3).toInt,x(4).toInt,x(5).toInt))
val df = spark.createDataFrame(rdd1,mySchema)
df.show
result
3. Read formatted files
student.json file
{"studentID":"06140401", "studentName":"Mr.Deng"}
{"studentID":"06140402", "studentName":"Mr.Li"}
{"studentID":"06140403", "studentName":"Mr.Zhang"}
{"studentID":"06140404", "studentName":"Mr.Zhang"}
{"studentID":"06140405", "studentName":"Mr.Zheng"}
{"studentID":"06140406", "studentName":"Mr.Xie"}
{"studentID":"06140407", "studentName":"Mr.Zhi"}
{"studentID":"06140408", "studentName":"Mr.Guo"}
{"studentID":"06140409", "studentName":"Mr.Zhang"}
{"studentID":"06140410", "studentName":"Mr.Su"}
{"studentID":"06140411", "studentName":"Mr.Wu"}
{"studentID":"06140412", "studentName":"Mr.Yao"}
scala code
val df = spark.read.json("/root/temp/student.json")
val df = spark.read.format("json").load("/root/temp/student.json")
df.show
result
Manipulate DataFrame
DSL statement
df1.select($"studentName",$"chinese",$"math",$"english",$"totalGrade").show
df1.filter($"totalGrade">300).show
df1.groupBy($"roomID").count.show
SQL statement
grade table
student table
Create view
df1.createOrReplaceTempView("grade")
df2.createOrReplaceTempView("student")
spark.sql("select studentID,totalGrade from grade").show
spark.sql("select count(*) as studentNum from student").show
Multi-table query (internal connection)
spark.sql("select studentName,totalGrade from grade,student where grade.studentID = student.studentID order by grade.studentID").show
Spark SQL view
createGlobalTempView、createOrReplaceGlobalTempView、createOrReplaceTempView、createTempView
(1) Normal view (local view): only valid in the current session (createOrReplaceTempView, createTempView)
(2) Global view: useful in different sessions. Principle: Create a global view in the namespace: global_temp (similar to a library) (createOrReplaceTempView, createTempView)
Create DataSet
First, use the sequence
scala codecase class Person(name: String,age: Int)
val rdd = Seq(Person("destiny",18),Person("freedom",20)).toDF
rdd.show
result
Second, use JSON data
case class Student(studentID: String,studentName: String)
val df = spark.read.format("json").load("/root/temp/student.json")
df.as[Student].show
result
3. Use other format data
val ds = spark.read.text("/root/temp/spark_workCount.txt").as[String]
val word = ds.flatMap(_.split(" ")).filter(_.length > 3)
word.show
result
val word = ds.flatMap(_.split(" ")).map((_,1)).groupByKey(_._1).count
result
Operation DataSet
ds.where($"totalGrade" >= 250).show
Multi-table query
case class Grade(studentID: String,chinese: Int,math: Int,english: Int,totalGrade: Int)
case class Student(studentID: String,studentName: String)
val rdd = sc.textFile("/root/temp/gradeSheet.txt").map(_.split("\t"))
val ds1 = rdd.map(x => Grade(x(0),x(1).toInt,x(2).toInt,x(3).toInt,x(4).toInt)).toDS
val rdd = sc.textFile("/root/temp/studentSheet.txt").map(_.split("\t"))
val ds2 = rdd.map(x => Student(x(0),x(1))).toDS
ds1.join(ds2,"studentID").show
ds1.join(ds2,"studentID").where("totalGrade >= 250").show
Use data source
load and save functions
scala codeval ds = spark.read.load("/root/temp/users.parquet")
# save结果的文件为parquet类型
ds.select($"name",$"favorite_color").write.save("/root/temp/parquet")
val ds1 = spark.read.load("/root/temp/parquet")
ds1.show
result
mode function
df.write.mode("overwrite").save("/root/temp/parquet")
saveAsTable function
df.select($"name").write.saveAsTable("table1")
spark.sql("select * from table1").show
option function
Support schema merge
val df = sc.makeRDD(1 to 6).map(x => (x,x*2)).toDF("singel","double")
df.write.mode("overwrite").save("/root/temp/table/key=1")
val df1 = sc.makeRDD(7 to 10).map(x => (x,x*3)).toDF("single","triple")
df1.write.mode("overwrite").save("/root/temp/table/key=2")
val df2 = spark.read.option("mergeSchema",true).parquet("/root/temp/table")
result