Spark 中RDD和DataSet之间的转换

什么是RDD:Spark提供了一个抽象的弹性分布式数据集，是一个由集群中各个节点以分区的方式排列的集合，用以支持并行计算。RDD在驱动程序调用hadoop的文件系统的时候就创建（其实就是读取文件的时候就创建），或者通过驱动程序中scala集合转化而来，用户也可以用spark将RDD放入缓存中，来为集群中某台机器宕掉后，确保这些RDD数据可以有效的被复用。
总之，RDD能自动从宕机的节点中恢复过来。

摘抄自官网的说明：

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

RDD的的操作类型（以下为个人从官网翻译过来）

对于RDD的操作支持两种类型的，一种是transformation,一种是action.
对于transformation,是将一个数据集从一个结构转换成另外一个结构。
对于Action来说，是需要在数据集的计算任务之后，返回给驱动程序一个结果。
比如map函数就是一个transformation操作，它对数据集合中的每个元素都执行一个方法，而且返回一个新的RDD结果集。

而reduce就是一个action,它通过一些函数将RDD中的所有元素重新组合，将最终结果发送给驱动程序。
尽管，这里也有一个并行处理。reduceByKey返回的是一个分布式数据集合

总结起来：

transformation 是将数据集从一个结构转换成另外一个结构。这个过程中的RDD内容是不发生变化的。

action是出发对RDD的计算，并对计算结果执行某种操作，要么返回给用户，要么保存到外部存储器中。

RDD的一些特性：

对于每个被transformed的RDD结果，一旦你在其上运行一个action，就会再次计算。
你也可以把这个RDD存放到内存中，通过使用persist方法，那样spark就会将这个数据存储在集群中
以用于下次你查询的时候，快速给出结果，
它也支持持久化RDD到磁盘中，或者在多个节点中进行备份。

在执行计算任务之前调用persist方法，可以将RDD放入指定的地方，也就是设置RDD的缓存级别。

RDD的缓存级别
Storage Level	Meaning
MEMORY_ONLY	Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK	Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
MEMORY_ONLY_SER (Java and Scala)	Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER (Java and Scala)	Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
DISK_ONLY	Store the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.	Same as the levels above, but replicate each partition on two cluster nodes.
OFF_HEAP (experimental)	Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled.

前提，需要运行spark-shell命令执行以下操作

>>>通过外部数据源创建两个RDD

scala> val rdd1= sc.textFile("file:///opt/datas/stu.txt")
scala> val rdd2 = sc.textFile("file:///opt/datas/stu1.txt")

>>>将外部数据源转换成一个数据源，这是一个转换（transformation）操作

scala> val rdd =rdd1.union(rdd2)

>>>将外部RDD转化为DF

filter(x => (x._2 > 1))是筛选出值大于1的数据，在scala中x._2表示value，x._1表示key,

sortByKey是排序，参数为true，则是升序，false则是降序.

因为在这个RDD中,key是String类型的，而Value是Int类型的。所以当我们需要按照value排序的时候，就需要重新将Key value交换再调用sortBykey的方法。

scala> val lines = rdd.flatMap(x =>x.split(" ")).map(x =>(x,1)).reduceByKey((a,b) =>

(a+b)).filter(x => (x._2 > 1)).map(x =>(x._2,x._1)).sortByKey(false).map(x=>

(x._2,x._1)).toDF


lines: org.apache.spark.sql.DataFrame = [_1: string, _2: int]

>>>转化成DF的时候，如果不指定，就默认schema为x._1,x._2这两列。当我们需要指定的时候，方法如下：

scala> val lines = rdd.flatMap(x =>x.split(" ")).map(x =>(x,1)).reduceByKey((a,b) =>(a+b)).filter(x => (x._2 > 1)).map(x =>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).toDF("key","value")
lines: org.apache.spark.sql.DataFrame = [key: string, value: int] 

scala> lines.printSchema
root
 |-- key: string (nullable = true)
 |-- value: integer (nullable = false)

>>>通过DF对数据集进行操作：

scala> lines.select("key").show
+------+
|   key|
+------+
| spark|
|  hive|
|  java|
|  lele|
|spring|
| hbase|
+------+

>>>我们还可以把转化而来的dataFrame注册成一张临时表，调用spark.sql的语句来分析、查询数据

scala> lines.createOrReplaceTempView("spark") 
scala> spark.sql("select * from spark")
res4: org.apache.spark.sql.DataFrame = [key: string, value: int]
其实结果还是一个dataFrame，只不过这个spark.sql函数接收的对象是一段sql语句。
scala> spark.sql("select * from spark").show
+------+-----+
|   key|value|
+------+-----+
| spark|    8|
|  hive|    7|
|  java|    5|
|  lele|    2|
|spring|    2|
| hbase|    2|
+------+-----+
scala> spark.sql("select count(1) from spark").show
+--------+                                                                      
|count(1)|
+--------+
|       6|
+--------+
scala> spark.sql("select key from spark").show
+------+
|   key|
+------+
| spark|
|  hive|
|  java|
|  lele|
|spring|
| hbase|
+------+

Spark 中RDD和DataSet之间的转换

猜你喜欢