Spark SQL-DataFrame,DataSet与RDD

  • Spark SQL directory
    • DataFrame
    • DataSet
    • eet
    • Conversion between DataFrame, DataSet and RDD
    • The relationship between DataFrame, DataSet and RDD
    • DataFrame, between DataSet and RDD commonality and differences

1.Spark SQL

Spark Spark SQL is a module for processing structured data, it provides a programming abstraction 2: DataFrame and the DataSet , and acts as a distributed SQL query engine.

2.Spark SQL compared with the Hive SQL

Hive SQL MapReduce is then converted to submit to execution on the cluster, which greatly simplifies the complexity of writing MapReduc programs, such MapReduce computation model execution efficiency is relatively slow.

Spark SQL is converted into RDD, and then submitted to the cluster execution, execution efficiency is very fast!

3.DataFrame

DataFrame container is a distributed data, in addition to data, data structure information is also recorded. DataFrame data Schema provides a view. We can think of it as a table in the database to be treated, DataFrame is lazy execution, the performance is higher than the RDD.

4.DataSet

Dataset is strongly typed data set having a need to provide information corresponding to the type.

5.RDD

RDD (Resilient Distributed Dataset) is called distributed data sets, Spark is the most basic data abstraction. Code is an abstract class that represents an immutable, can partition the set of elements which may be parallel computation.
RDD will separate out the back talk .

Conversion between 6.DataFrame, DataSet and RDD

  • RDD turn the DataSet
    SparkSQL can automatically with case classes RDD converted DataFrame, case class defines the structure of the table, case by reflection into a class attribute table column name.
// 样例类
case class Person(name: String, age: BigInt)

// 创建配置对象
val sparkConf: SparkConf = new SparkConf().setAppName("SparkSQL").setMaster("local[*]")
// 创建上下文对象
val sparkSession: SparkSession = SparkSession
      .builder()
      .config(sparkConf)
      .getOrCreate()

// 增加隐式转换规则
import sparkSession.implicits._

val rdd: RDD[(String, Int)] = sparkSession.sparkContext.makeRDD(Array(("zhaoliu", 20)))

// RDD转换DataSet
val person: Dataset[Person] = rdd.map(x => Person(x._1, x._2)).toDS()
  • DataSet converted to RDD
// 样例类
case class Person(name: String, age: BigInt)

// 创建DataSet
val dataSet: Dataset[Person] = Seq(Person("zhansan", 32)).toDS()

// DataSet转换成RDD
dataSet.rdd
  • DataFrame转DataSet
// 样例类
case class Person(name: String, age: BigInt)

// 获取数据,将数据转换成DataFrame
val dateFrame: DataFrame = sparkSession.read.json("input/user.json")

// DataFrame 转换成 DataSet
val dataSet: Dataset[Person] = dateFrame.as[Person]
  • DataSet转DataFrame
// 样例类
case class Person(name: String, age: BigInt)

// 获取数据,将数据转换成DataFrame
val dateFrame: DataFrame = sparkSession.read.json("input/user.json")

// DataFrame 转换成 DataSet
val dataSet: Dataset[Person] = dateFrame.as[Person]

// DataSet 转换成 DataFrame
val df: Dataset[Person] = dataSet.asDF

The relationship between 7.DataFrame, DataSet and RDD

Here Insert Picture Description
In SparkSQL the Spark offers two new abstract to us, are DataFrame and DataSet.
RDD them and what difference does it make? First, from the point of view generated version:
RDD (Spark1.0) -> DataFrame (Spark1.3) -> a Dataset (Spark1.6)
If, after the same data to give three data structures, they are calculated, will give the same result. The difference is that their efficiency and implementation.
In later versions of the Spark, DataSet will gradually replace the RDD and DataFrame become the only API interface.

8.DataFrame, between DataSet and RDD commonality and differences

  • Common
    • RDD, DataFrame, Dataset all resilient distributed datasets under spark platform to facilitate handling very large data
    • Three are inert mechanism, making creation, conversion, such as map method, is not performed immediately, only in the face of Action such as foreach, the three will begin traversing operation.
    • Three will automatically cache memory operations based on the situation spark, so that even if a large amount of data, do not worry about running out of memory.
    • The concept of partition has three
    • Three have many common functions, such as filter, sort, etc.
    • Carried out in support of DataFrame Dataset and operate many operations need this package
      import spark.implicits._
    • DataFrame Dataset can be used and pattern matching to obtain the value and type of each field
dataFrame.map{
      case Row(col1:String,col2:Int)=>
        println(col1);println(col2)
        col1
      case _=>
        ""
    }
    
// 样例类,定义字段名和类型
case class Coltest(col1:String,col2:Int)extends Serializable 

dataSet.map{
      case Coltest(col1:String,col2:Int)=>
        println(col1);println(col2)
        col1
      case _=>
        ""
    }

  • the difference
  1. RDD:
    . 1) and a spark mlib RDD general use
    2) RDD does not support operation sparksql

  2. DataFrame:
    . 1) and RDD Dataset with different types of fixed DataFrame each row Row, values for each column can not directly access, only by acquiring the value of each field can be resolved, such as:

dataFrame.foreach{
  line =>
    val col1=line.getAs[String]("col1")
    val col2=line.getAs[String]("col2")
}

2) DataFrame Dataset and generally not used in conjunction with a spark mlib
3) DataFrame Dataset and support operations sparksql, such as select, groupby the like, but also the temporary registration table / window, perform operations sql statement, such as:

// 创建视图
dataFrame.createOrReplaceTempView("tmp")
spark.sql("select  ROW,DATE from tmp where DATE is not null order by DATE").show

4) DataFrame with some special support Dataset convenient way to preserve, such as saving to csv, can bring the header, such that each field name column glance

// 保存
val saveOptions = Map("header" -> "true", "delimiter" -> "\t", "path" -> "hdfs://hadoop102:9000/test")
dataFrame.write.format("com.atguigu.spark.csv").mode(SaveMode.Overwrite).options(saveOptions).save()

// 读取
val options = Map("header" -> "true", "delimiter" -> "\t", "path" -> "hdfs://hadoop102:9000/test")
val dataFrame= spark.read.options(options).format("com.atguigu.spark.csv").load()

With such a way to preserve, you can easily be obtained and the field name of the corresponding column, and the separator (DELIMITER) can be specified freely.
A Dataset 3.:
. 1) and a Dataset DataFrame members have exactly the same function, but the difference between each row of different data types.
2) DataFrame can also be called Dataset [Row], the type of each row is Row, not resolved, each row exactly what field, what is the type of each field have no way of knowing, getAS method can only be used or the above-mentioned common Article mentioned in the pattern matching out particular field. The Dataset, each row is not necessarily what type of, after the custom of the case class can be free access to information of each line

// 样例类:定义字段名和类型
case class Coltest(col1:String,col2:Int)extends Serializable 

/**
 rdd
 ("a", 1)
 ("b", 1)
 ("a", 1)
**/
val dataSet: Dataset[Coltest]=rdd.map{line=>
      Coltest(line._1,line._2)
    }.toDS
    
dataSet.map{
      line=>
        println(line.col1)
        println(line.col2)
    }

As can be seen, Dataset when you need to access a field in a column is very convenient, however, if the fit to write some very strong function, if you use Dataset, line type and are not sure, probably all kinds case class, can not be achieved adaptation, this time with DataFrame that Dataset [Row] will be able to better solve the problem

Released seven original articles · won praise 1 · views 117

Guess you like

Origin blog.csdn.net/fu890310/article/details/104573294