Convert DF
1. RDD cooperates with case class to convert DF
Define the sample class
scala> case class Person(ID:Int,Name:String,Age:Int)
defined class Person
Define RDD
scala> val lineRDD = sc.textFile(“file:///export/person.txt”).map(_.split(" "))
lineRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at :24
Associate RDD with case class
scala> val PersonRDD = lineRDD.map (x => Person (x (0) .toInt, x (1), x (2) .toInt))
PersonRDD: org.apache.spark.rdd.RDD [Person] = MapPartitionsRDD [7] at map at: 28
Convert RDD to DF
scala> val PersonDF = PersonRDD.toDF
PersonDF: org.apache.spark.sql.DataFrame = [ID: int, Name: string … 1 more field]
Show DF
scala> PersonDF.show
±–±-------±–+
| ID| Name|Age|
±–±-------±–+
| 1|zhangsan| 20|
| 2| lisi| 29|
| 3| wangwu| 25|
| 4| zhaoliu| 30|
| 5| tianqi| 35|
| 6| kobe| 40|
±–±-------±–+
2. Read the json file to create a DataFrame
Spark parses json data to create DF
val jsonDF = spark.read.json(“file:///export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/examples/src/main/resources/people.json”)
3. Read the parquet columnar storage format file to create a DataFrame
park directly parses data in parquet format to create DF
val parquetDF = spark.read.parquet(“file:///export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/examples/src/main/resources/users.parquet”)
JavaAPI implements the above
import org.apache.spark.rdd.RDD
import org.apache.spark.sql
import org.apache.spark.sql.SparkSession
case class person(id: Int, name: String, age: Int)
object sparkDF {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder().appName("sprkSQL").master("local[2]").getOrCreate()
val sparkContext: RDD[String] = sparkSession.sparkContext.textFile("file:///C:\\Users\\Administrator\\Documents\\tt")
val map: RDD[Array[String]] = sparkContext.map(_.split(" "))
val personRDD: RDD[person] = map.map(x => person(x(0).toInt, x(1), x(2).toInt))
sparkSession.sparkContext.setLogLevel("WARN")
//todo 使用toDF方法需要导包
import sparkSession.implicits._
val personDF = personRDD.toDF()
//todo ********************************DSL语法*******************************
println(personDF.show())
println(personDF.select($"name", $"age" + 1).show())
personDF.printSchema()
println(personDF.groupBy("age").count())
//todo ********************************SQL语法*******************************
//todo 将dataFrame注册成表
personDF.registerTempTable("person")
personDF.createTempView("person2")
personDF.createOrReplaceTempView("person3")
sparkSession.sql("select * from person")
sparkSession.close()
sparkSession.sparkContext.stop()
}
}
Realize the creation of df through row object and structType
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{IntegerType, StringType, StructType}
import org.apache.spark.sql.{Row, SparkSession}
object sparkDF2 {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder().master("local[2]").appName("sparkDF").getOrCreate()
val sparkContext = sparkSession.sparkContext
val file: RDD[String] = sparkContext.textFile("file:///C:\\Users\\Administrator\\Documents\\tt")
sparkContext.setLogLevel("WARN")
val arrayFile = file.map(_.split(" "))
val row: RDD[Row] = arrayFile.map(x => Row(x(0).toInt,x(1),x(2).toInt))
val structType = new StructType().add("id",IntegerType).add("name",StringType).add("age",IntegerType)
val dataFrame = sparkSession.createDataFrame(row,structType)
dataFrame.show()
sparkContext.stop()
sparkSession.close()
}
}