Spark SQL将rdd转换为数据集-以编程方式指定模式(Programmatically Specifying the Schema)

一:解释

官网:https://spark.apache.org/docs/latest/sql-getting-started.html
这种场景是生活中的常态
When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.

Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.

二:操作

package g5.learning

import org.apache.spark.sql.types.{LongType, StructField, StructType,StringType}
import org.apache.spark.sql.{Row, SparkSession}

object DFApp {
  def main(args: Array[String]): Unit = {
    val  sparksession= SparkSession.builder().appName("DFApp")
      .master("local[2]")
      .getOrCreate()

     //inferReflection(sparksession)

    programmatically(sparksession)
    sparksession.stop()
  }



  //第二种方法是编程的形式
  def programmatically(sparkSession: SparkSession)={
    val info = sparkSession.sparkContext.textFile("file:///E:\\data.txt")
    //Create an RDD of Rows from the original RDD;
    import sparkSession.implicits._
    val rdd =info.map(_.split("\t")).map(x=>Row(x(0),x(1),x(2).toLong))
    //这个rdd是个row类型
    //Create the schema
    val struct =StructType(Array(
      StructField("ip", StringType,true),
      StructField("time", StringType,false) ,
        StructField("responseSize",LongType,false)
    ))
    //Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.
   val df = sparkSession.createDataFrame(rdd,struct)
    df.show()
        }
  }

三:注意

1.这种类型要比官网更好,适合大部分类型
2.要注意算子的转换(开始的时候,找算子也是遇到了很多麻烦,算子很多,要注意挑选自己的需求)

四:结果

在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/qq_43688472/article/details/85463465