-
Spark SQL first experience
-
- Entrance -SparkSession
● Before spark2.0 version
SQLContext is to create and execute SQL entrance DataFrame
HiveContext hive by hive sql statement table data operation, compatible hive operation, hiveContext inherited from SQLContext.
● After the spark2.0
SparkSession SqlContext encapsulates all the features and HiveContext. By SparkSession you can also get to SparkConetxt.
SparkSession can perform SparkSQL can also perform HiveSQL.
- Creating DataFrame
- Chong read a text file
1. Create a file locally, there are id, name, age three, separated by a space, and then uploaded to the hdfs
vim /opt/package/person
1 zhangsan 20 2 lysis 29 3 wangwu 25 4 zhaoliu 30 5 tianqi 35 6 kobe 40 |
Upload data files to the HDFS :( my local test mode used here )
hadoop fs -put /opt/person.txt /
2. Run the following commands in the spark shell, reading data, the data of each row using the divided column delimiter
Open spark-shell
/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0//bin/spark-shell
Creating RDD
val lineRDD= sc.textFile("hdfs://node01:8020/person.txt").map(_.split(" "))
local
var lineData2 = sc.textFile("file:///opt/package/person.txt").map(_.split(" "))
// eet [Array [String]]
3. Define case class (schema corresponding to the table)
case class Person(id:Int, name:String, age:Int)
4. The RDD class and associated case
Val personRDD = lineRDD.map (x => Person (x (0) .toInt, x (1), x (2) .toInt)) // eet [Person]
The RDD converted into DataFrame
val personDF = personRDD.toDF //DataFrame
6. Review the data and schema
personDF.show
+---+--------+---+
| id| name|age|
+---+--------+---+
| 1 | zhangsan | 20 |
| 2 | lysis | 29 |
| 3 | wangwu | 25 |
| 4 | zhaoliu | 30 |
| 5 | tianqi | 35 |
| 6| kobe| 40|
+---+--------+---+
personDF.printSchema
7. Registry
personDF.createOrReplaceTempView("t_person")
8. Execute SQL
spark.sql("select id,name from t_person where id > 3").show
9 may also be constructed by SparkSession DataFrame
val dataFrame=spark.read.text("hdfs://node01:8020/person.txt")
dataFrame.show // Note: direct read text files do not have complete information schema
dataFrame.printSchema
Local mode operation: