Convert files to DF

1. RDD cooperates with case class to convert DF

Define the sample class

scala> case class Person(ID:Int,Name:String,Age:Int)
defined class Person

Define RDD

scala> val lineRDD = sc.textFile(“file:///export/person.txt”).map(_.split(" "))
lineRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at :24

Associate RDD with case class

scala> val PersonRDD = lineRDD.map (x => Person (x (0) .toInt, x (1), x (2) .toInt))
PersonRDD: org.apache.spark.rdd.RDD [Person] = MapPartitionsRDD [7] at map at: 28

Convert RDD to DF

scala> val PersonDF = PersonRDD.toDF
PersonDF: org.apache.spark.sql.DataFrame = [ID: int, Name: string … 1 more field]

Show DF

scala> PersonDF.show
±–±-------±–+
| ID| Name|Age|
±–±-------±–+
| 1|zhangsan| 20|
| 2| lisi| 29|
| 3| wangwu| 25|
| 4| zhaoliu| 30|
| 5| tianqi| 35|
| 6| kobe| 40|
±–±-------±–+

2. Read the json file to create a DataFrame

Spark parses json data to create DF

val jsonDF = spark.read.json(“file:///export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/examples/src/main/resources/people.json”)

3. Read the parquet columnar storage format file to create a DataFrame

park directly parses data in parquet format to create DF

val parquetDF = spark.read.parquet(“file:///export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/examples/src/main/resources/users.parquet”)

JavaAPI implements the above

import org.apache.spark.rdd.RDD
import org.apache.spark.sql
import org.apache.spark.sql.SparkSession

case class person(id: Int, name: String, age: Int)

object sparkDF {
  def main(args: Array[String]): Unit = {

    val sparkSession = SparkSession.builder().appName("sprkSQL").master("local[2]").getOrCreate()
    val sparkContext: RDD[String] = sparkSession.sparkContext.textFile("file:///C:\\Users\\Administrator\\Documents\\tt")
    val map: RDD[Array[String]] = sparkContext.map(_.split(" "))
    val personRDD: RDD[person] = map.map(x => person(x(0).toInt, x(1), x(2).toInt))
    sparkSession.sparkContext.setLogLevel("WARN")
    //todo  使用toDF方法需要导包
    import sparkSession.implicits._
    val personDF = personRDD.toDF()
    //todo  ********************************DSL语法*******************************
    println(personDF.show())
    println(personDF.select($"name", $"age" + 1).show())
    personDF.printSchema()
    println(personDF.groupBy("age").count())
    //todo  ********************************SQL语法*******************************
    //todo  将dataFrame注册成表
    personDF.registerTempTable("person")
    personDF.createTempView("person2")
    personDF.createOrReplaceTempView("person3")
    sparkSession.sql("select * from person")
    sparkSession.close()
    sparkSession.sparkContext.stop()
  }
}

Realize the creation of df through row object and structType

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{IntegerType, StringType, StructType}
import org.apache.spark.sql.{Row, SparkSession}

object sparkDF2 {
  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession.builder().master("local[2]").appName("sparkDF").getOrCreate()
    val sparkContext = sparkSession.sparkContext
    val file: RDD[String] = sparkContext.textFile("file:///C:\\Users\\Administrator\\Documents\\tt")
    sparkContext.setLogLevel("WARN")
    val arrayFile = file.map(_.split(" "))
    val row: RDD[Row] = arrayFile.map(x => Row(x(0).toInt,x(1),x(2).toInt))
    val structType = new StructType().add("id",IntegerType).add("name",StringType).add("age",IntegerType)
    val dataFrame = sparkSession.createDataFrame(row,structType)

    dataFrame.show()
    sparkContext.stop()
    sparkSession.close()
  }
}

Guess you like

Origin blog.csdn.net/weixin_44429965/article/details/107397443