SparkSQL-02 RDD转换DF的两种方式

引用原文:

Interoperating with RDDs

Spark SQL supports two different methods for converting existing RDDs into Datasets.

The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.

The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct Datasets when the columns and their types are not known until runtime.

RDD <==> DataFrame (两种不同的方法)

1.反射 reflection 常用API: method class ,通过对象.invoke 

        infer schema

        RDD : know the schema

        case class :

注意:使用反射转换RDD与DataFrame的时候一定要知道其Schema

import org.apache.spark.sql.SparkSession

/**
  * RDD => DataFrame 的第一种方法 通过反射
  * 测试数据
     Andy , 12
     Tom  , 24
          , 12
     Ray  , 32
  */
object SparkSessionT {
  def main(args: Array[String]): Unit = {
    //初始化
    val spark = SparkSession
      .builder()
      .appName("SparkSessionT")
      .master("local[2]")
      .getOrCreate()
    //读取数据,生成RDD,map格式化数据
    val people = spark.sparkContext.textFile("file:///d:/people.txt")
                  .map(x => x.split(","))
                  .map(x => Person(x(0),x(1).trim.toInt))
                  //.filter(x => x.age>30)
                  //.filter(x => x.name.equals(""))
    //转换成DF
    import spark.implicits._
    val personDF = people.toDF()
    //创建临时表
    personDF.createOrReplaceTempView("Person")
    //展示数据
    personDF.show
    //展示表结构
    personDF.printSchema()
    //结束spark进程
    spark.stop()
  }
  //创建Person 的 case class
  case class Person(name:String , age:Int)
}

2. programmatic interface,通过编程接口

import org.apache.spark.sql._
import org.apache.spark.sql.types.{StringType, StructField, StructType}

/**
  * RDD => DataFrame 的第二种方法 通过以编程方式指定模式
  * 测试数据
     Andy , 12
     Tom  , 24
          , 12
     Ray  , 32
  */
object SparkSessionPS {
  def main(args: Array[String]){
    //初始化
    val spark = SparkSession
      .builder()
      .appName("SparkSessionT")
      .master("local[2]")
      .getOrCreate()
    import spark.implicits._

    //Step1 创建一个RDD ROW作用于RDD
    val peopleRDD = spark.sparkContext.textFile("file:///c:/people.txt")
      .map(x => x.split(","))
      .map(attributes => Row(attributes(0), attributes(1).trim))

    //Step2 定义schema
    val schemaString = "name,age"
    val fields = schemaString.split(",")
      .map(fieldName =>
        StructField(fieldName, StringType, nullable = true))
    val schema = StructType(fields)

    //Step3 创建DF(RDD[Row],StringType)
    val peopleDF = spark.createDataFrame(peopleRDD, schema)
    //Step4 创建临时表
    peopleDF.createOrReplaceTempView("people")
    //Step5 测试数据
    val results = spark.sql("SELECT name,age FROM people")
    results.show()
    //results.map(attributes => attributes.getAs[String]("name")).show()
  }
}

实际工作中的使用

import org.apache.spark.sql.SparkSession

/**
  * 测试数据
id|name|phone|email
1|Burke|1-300-746-8446|[email protected]
2|Kamal|1-668-571-5046|[email protected]
3|Olga|1-956-311-1686|[email protected]
4|Belle|l-246-894-6340|[email protected]
5|Trevor|1-300-527-4967|[email protected]
6|Laurel|1-691-379-9921|[email protected]
7|Sara|1-608-140-1995|[email protected]
8|Kaseem|1-881-586-2689|[email protected]
9|Lev|1-916-367-5608|[email protected]
10|Maya|1-271-683-2698|accumsan,[email protected]
11|Emi|l-467-270-1337|[email protected]
12|Caleb|1-68B-212-0896|[email protected]
13|Florence|1-603-575-2444|[email protected]
14|Anika|1-856-828-7883|[email protected]
15|Tarik|l-398-171-2268|[email protected]
16|Amena|1-878-250-3129|[email protected]
17|Blossom|1-154-406-9596|[email protected]
18|Guy|1-869-521-32BO|[email protected]
19|Malachi|1-608-637-2772|[email protected]
20|Edward|1-711-710-6552|[email protected]
21||1-711-710-6552|[email protected]
22||1-711-710-6552|[email protected]
23|NULL|1-711-710-6552|[email protected]
  */
object SparkSessionEXP {
  def main(args: Array[String]){
    //初始化参数
    val spark = SparkSession
      .builder()
      .appName("SparkSessionEXP")
      .master("local[2]")
      .getOrCreate()
    import spark.implicits._
    //创建两个一样的studentDF,用于join测试
    var studentRDD = spark.sparkContext.textFile("file:///c:/student.data")
    val head = studentRDD.first()
    studentRDD = studentRDD.filter(row => row!=head)

    val studentRDD1 = spark.sparkContext.textFile("file:///c:/student.data")
      .map(x => x.split("\\|"))
      .map(x => student(x(0),x(1),x(2),x(3)))
      .toDF

    val studentRDD2 = spark.sparkContext.textFile("file:///c:/student.data")
      .map(x => x.split("\\|"))
      .map(x => student(x(0),x(1),x(2),x(3)))
      .toDF
      //.filter("name = '' or name = 'NULL'")
      /** 找M打头的名字 substr从哪里来呢, 双击shift, 查找funtions, 打开functions.scala文件, 里面有全面的sql 函数
        * alt+7 查看所有方法
      */
      //.filter("substr(name,0,1) = 'M'")
    //创建两张student临时表
    studentRDD1.createOrReplaceTempView("student1")
    studentRDD2.createOrReplaceTempView("student2")
    //一些常用的简单操作
    //spark.sql("select * from student1").show(23,false)
    //studentRDD1.filter("name = ''").show()
    //studentRDD1.select(studentRDD1.col("name"),studentRDD1.col("id").as("mid")).show(30,false)
    //studentRDD1.sort(studentRDD1.col("name"),studentRDD1.col("id")).show(30,false)
    //spark.sql("select * from student1").show(30,false)

    //join测试
    //用id做join key 注意:要写三个等号,默认是inner join
    studentRDD1.join(studentRDD2, studentRDD1.col("id") === studentRDD2.col("id")).show(500,false)

    //关闭
    spark.stop()
  }
  case class student(id:String,name:String,phone:String,email:String){}
}

猜你喜欢

转载自blog.csdn.net/qq_15300683/article/details/80380251