SparkSQL driver for SparkSession
SparkSession can execute SparkSQL or HiveSQL
1. Create a DataFrame
Create a text file
1. Create a file locally with three columns of id, name, and age, separated by spaces, and then upload to hdfs
vim /root/person.txt
1 zhangsan 20
2 lisi 29
3 wangwu 25
4 zhaoliu 30
5 tianqi 35
6 kobe 40
Upload the data file to HDFS:
hadoop fs -put /root/person.txt /
2. Execute the following command in the spark shell to read the data and split the data in each row using column separator
打开spark-shell
/export/servers/spark/bin/spark-shell
Create RDD
val lineRDD= sc.textFile(“hdfs://node01:8020/person.txt”).map(_.split(" ")) //RDD[Array[String]]
3. Define case class (equivalent to table schema)
case class Person(id:Int, name:String, age:Int)
4. Associate RDD and case class
val personRDD = lineRDD.map (x => Person (x (0) .toInt, x (1), x (2) .toInt)) // RDD [Person]
5. Convert RDD to DataFrame
val personDF = personRDD.toDF //DataFrame
6. View data and schema
personDF.show
± – ± ------- ± – +
| id | name | age |
± – ± ------- ± – +
| 1 | zhangsan | 20 |
| 2 | lisi | 29 |
3 | wangwu | 25 |
| 4 | zhaoliu | 30 |
| 5 | tianqi | 35 |
| 6 | kobe | 40 |
± – ± ------- ± – +
// metadata structure
personDF.printSchema
7. Registration form
personDF.createOrReplaceTempView(“t_person”)
8. Execute SQL
spark.sql(“select id,name from t_person where id > 3”).show
9. DataFrame can also be built through SparkSession
val dataFrame = spark.read.text (“hdfs: // hadoop01: 8020 /
person.txt ”) dataFrame.show // Note: the text file read directly does not have complete schema information
dataFrame.printSchema
- Read json file
1. Data file
Use json file under spark installation package
/export/servers/spark/examples/src/main/resources/people.json
{“name”:“Michael”}
{“name”:“Andy”, “age”:30}
{“name”:“Justin”, “age”:19}
2. Execute the following command in the spark shell to read the data
val jsonDF= spark.read.json(“file:///export/servers/spark/examples/src/main/resources/people.json”)
3. Next, you can use the function operation of DataFrame
jsonDF.show
// Note: Directly reading the json file has schema information, because the json file itself contains Schema information, SparkSQL can automatically parse
2.2.3. Read parquet file
1. Data file
Use the parquet file under the spark installation package
/export/servers/spark/examples/src/main/resources/users.parquet
2. Execute the following command in the spark shell to read the data
val parquetDF=spark.read.parquet(“file:///export/servers/spark/examples/src/main/resources/users.parquet”)
3. Next, you can use the function operation of DataFrame
parquetDF.show
// Note: directly reading the parquet file has schema information, because the parquet file saves the column information
2. Create a DataSet
1. Create a Dataset through spark.createDataset
val fileRdd = sc.textFile(“hdfs://node01:8020/person.txt”) //RDD[String]
val ds1 = spark.createDataset(fileRdd) //DataSet[String]
ds1.show
2. Generate DataSet through RDD.toDS method
case class Person(name:String, age:Int)
val data = List(Person(“zhangsan”,20),Person(“lisi”,30)) //List[Person]
val dataRDD = sc.makeRDD(data)
val ds2 = dataRDD.toDS //Dataset[Person]
ds2.show
3. Through DataFrame.as [generic] transformation to generate DataSet
case class Person(name:String, age:Long)
val jsonDF= spark.read.json(“file:///export/servers/spark/examples/src/main/resources/people.json”)
val jsonDS = jsonDF.as[Person] //DataSet[Person]
jsonDS.show
4.DataSet can also be registered as a table for query
jsonDS.createOrReplaceTempView(“t_person”)
spark.sql(“select * from t_person”).show