文件转换为DF

1、RDD配合case class进行转换DF

定义样例类

scala> case class Person(ID:Int,Name:String,Age:Int)
defined class Person

定义RDD

scala> val lineRDD = sc.textFile(“file:///export/person.txt”).map(_.split(" "))
lineRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at :24

关联RDD与case class

scala> val PersonRDD = lineRDD.map(x => Person(x(0).toInt,x(1),x(2).toInt))
PersonRDD: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[7] at map at :28

将RDD转换成DF

scala> val PersonDF = PersonRDD.toDF
PersonDF: org.apache.spark.sql.DataFrame = [ID: int, Name: string … 1 more field]

展示DF

scala> PersonDF.show
±–±-------±–+
| ID| Name|Age|
±–±-------±–+
| 1|zhangsan| 20|
| 2| lisi| 29|
| 3| wangwu| 25|
| 4| zhaoliu| 30|
| 5| tianqi| 35|
| 6| kobe| 40|
±–±-------±–+

2、读取json文件创建DataFrame

spark解析json数据进行创建DF

val jsonDF = spark.read.json(“file:///export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/examples/src/main/resources/people.json”)

3、读取parquet列式存储格式文件创建DataFrame

park直接解析parquet格式的数据来进行创建DF

val parquetDF = spark.read.parquet(“file:///export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/examples/src/main/resources/users.parquet”)

JavaAPI实现上述

import org.apache.spark.rdd.RDD
import org.apache.spark.sql
import org.apache.spark.sql.SparkSession

case class person(id: Int, name: String, age: Int)

object sparkDF {
  def main(args: Array[String]): Unit = {

    val sparkSession = SparkSession.builder().appName("sprkSQL").master("local[2]").getOrCreate()
    val sparkContext: RDD[String] = sparkSession.sparkContext.textFile("file:///C:\\Users\\Administrator\\Documents\\tt")
    val map: RDD[Array[String]] = sparkContext.map(_.split(" "))
    val personRDD: RDD[person] = map.map(x => person(x(0).toInt, x(1), x(2).toInt))
    sparkSession.sparkContext.setLogLevel("WARN")
    //todo  使用toDF方法需要导包
    import sparkSession.implicits._
    val personDF = personRDD.toDF()
    //todo  ********************************DSL语法*******************************
    println(personDF.show())
    println(personDF.select($"name", $"age" + 1).show())
    personDF.printSchema()
    println(personDF.groupBy("age").count())
    //todo  ********************************SQL语法*******************************
    //todo  将dataFrame注册成表
    personDF.registerTempTable("person")
    personDF.createTempView("person2")
    personDF.createOrReplaceTempView("person3")
    sparkSession.sql("select * from person")
    sparkSession.close()
    sparkSession.sparkContext.stop()
  }
}

通过row对象配合structType实现df的创建

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{IntegerType, StringType, StructType}
import org.apache.spark.sql.{Row, SparkSession}

object sparkDF2 {
  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession.builder().master("local[2]").appName("sparkDF").getOrCreate()
    val sparkContext = sparkSession.sparkContext
    val file: RDD[String] = sparkContext.textFile("file:///C:\\Users\\Administrator\\Documents\\tt")
    sparkContext.setLogLevel("WARN")
    val arrayFile = file.map(_.split(" "))
    val row: RDD[Row] = arrayFile.map(x => Row(x(0).toInt,x(1),x(2).toInt))
    val structType = new StructType().add("id",IntegerType).add("name",StringType).add("age",IntegerType)
    val dataFrame = sparkSession.createDataFrame(row,structType)

    dataFrame.show()
    sparkContext.stop()
    sparkSession.close()
  }
}

猜你喜欢

转载自blog.csdn.net/weixin_44429965/article/details/107397443