Spark SQL first experience _ Chapter 2

SparkSQL driver for SparkSession
SparkSession can execute SparkSQL or HiveSQL
1. Create a DataFrame
Create a text file
1. Create a file locally with three columns of id, name, and age, separated by spaces, and then upload to hdfs
vim /root/person.txt

1 zhangsan 20
2 lisi 29
3 wangwu 25
4 zhaoliu 30
5 tianqi 35
6 kobe 40

Upload the data file to HDFS:

hadoop fs -put /root/person.txt /

2. Execute the following command in the spark shell to read the data and split the data in each row using column separator

打开spark-shell
/export/servers/spark/bin/spark-shell

Create RDD

val lineRDD= sc.textFile(“hdfs://node01:8020/person.txt”).map(_.split(" ")) //RDD[Array[String]]

3. Define case class (equivalent to table schema)

case class Person(id:Int, name:String, age:Int)

4. Associate RDD and case class

val personRDD = lineRDD.map (x => Person (x (0) .toInt, x (1), x (2) .toInt)) // RDD [Person]

5. Convert RDD to DataFrame

val personDF = personRDD.toDF //DataFrame

6. View data and schema

personDF.show
± – ± ------- ± – +
| id | name | age |
± – ± ------- ± – +
| 1 | zhangsan | 20 |
| 2 | lisi | 29 |
3 | wangwu | 25 |
| 4 | zhaoliu | 30 |
| 5 | tianqi | 35 |
| 6 | kobe | 40 |
± – ± ------- ± – +
// metadata structure
personDF.printSchema

7. Registration form

personDF.createOrReplaceTempView(“t_person”)

8. Execute SQL

spark.sql(“select id,name from t_person where id > 3”).show

9. DataFrame can also be built through SparkSession

val dataFrame = spark.read.text (“hdfs: // hadoop01: 8020 /
person.txt ”) dataFrame.show // Note: the text file read directly does not have complete schema information
dataFrame.printSchema

  1. Read json file
    1. Data file
    Use json file under spark installation package

/export/servers/spark/examples/src/main/resources/people.json
{“name”:“Michael”}
{“name”:“Andy”, “age”:30}
{“name”:“Justin”, “age”:19}

2. Execute the following command in the spark shell to read the data

val jsonDF= spark.read.json(“file:///export/servers/spark/examples/src/main/resources/people.json”)

3. Next, you can use the function operation of DataFrame

jsonDF.show
// Note: Directly reading the json file has schema information, because the json file itself contains Schema information, SparkSQL can automatically parse

2.2.3. Read parquet file
1. Data file
Use the parquet file under the spark installation package

/export/servers/spark/examples/src/main/resources/users.parquet

2. Execute the following command in the spark shell to read the data

val parquetDF=spark.read.parquet(“file:///export/servers/spark/examples/src/main/resources/users.parquet”)

3. Next, you can use the function operation of DataFrame

parquetDF.show
// Note: directly reading the parquet file has schema information, because the parquet file saves the column information

2. Create a DataSet
1. Create a Dataset through spark.createDataset

val fileRdd = sc.textFile(“hdfs://node01:8020/person.txt”) //RDD[String]
val ds1 = spark.createDataset(fileRdd) //DataSet[String]
ds1.show

2. Generate DataSet through RDD.toDS method

case class Person(name:String, age:Int)
val data = List(Person(“zhangsan”,20),Person(“lisi”,30)) //List[Person]
val dataRDD = sc.makeRDD(data)
val ds2 = dataRDD.toDS //Dataset[Person]
ds2.show

3. Through DataFrame.as [generic] transformation to generate DataSet

case class Person(name:String, age:Long)
val jsonDF= spark.read.json(“file:///export/servers/spark/examples/src/main/resources/people.json”)
val jsonDS = jsonDF.as[Person] //DataSet[Person]
jsonDS.show

4.DataSet can also be registered as a table for query

jsonDS.createOrReplaceTempView(“t_person”)
spark.sql(“select * from t_person”).show

Published 238 original articles · praised 429 · 250,000 views

Guess you like

Origin blog.csdn.net/qq_45765882/article/details/105560408