Spark的RDD

RDD：弹性分布式数据集

特性

RDD由分区组成，每个分区运行在不同的Worker上，通过这种方式来实现分布式计算（A list of partitions）
在RDD中，提供算子处理每个分区中的数据（A function for computing each split）
RDD存在依赖关系：宽依赖和窄依赖（A list of dependencies on other RDDs）
可以自定义分区规则来创建RDD（Optionally, a Partitioner for key-value RDDs(The RDD is hash-partitioned)）
优先选择离文件位置近的节点来执行（Optionally, a list of preferred locations to compute each split on(Block locations for hdfs file)）

（1）RDD的缓存机制

标识RDD可以被缓存（persist，cache），执行缓存，第二次执行语句才体现效果

缓存的位置

NONE
DISK_ONLY
DISK_ONLY_2
MEMORY_ONLY
MEMORY_ONLY_2
MEMORY_ONLY_SER
MEMORY_ONLY_SER_2
MEMORY_AND_DISK
MEMORY_AND_DISK_2
MEMORY_AND_DISK_SER
MEMORY_AND_DISK_SER_2
OFF_HEAP

（2）RDD的容错机制

RDD检查点的类型

//设置检查点
sc.textFile("hdfs://192.168.138.130:9000/tmp/text_Cache.txt")
//设置检查点目录
sc.setCheckpointDir("hdfs://192.168.138.130:9000/checkpoint")
//执行检查点操作
rdd1.checkpoint

（3）RDD的依赖关系（宽依赖，窄依赖）

宽依赖：多个子RDD的Partition会依赖同一个父RDD的Partition

窄依赖：每一个父RDD的Partition最多被子RDD的一个Partition使用

窄依赖是划分stage的依据

RDD的创建

（1）通过SparkContext.parallelize方法来创建

val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),3)

（2）通过外部数据源来创建

val rdd = sc.textFile("/root/spark_WordCount.text")

RDD的算子

一、Transformation

（1）map(func) 相当于for循环，返回一个新的RDD

val rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.map(_*2)
rdd1.collect

（2）filter(func) 过滤

val rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.filter(_>2)
rdd1.collect

（3）flatmap(func) flat+map 压平

val rdd = sc.parallelize(Array("a b c", "d e f", "g h i"))
val rdd1 = rdd.flatmap(_.split(" "))
rdd1.collect

（4）mapPartitions(func) 对RDD每个分区进行操作

（5）mapPartitionsWithIndex(func) 对RDD每个分区进行操作，可以取到分区号

（6）sample(withReplacement, fraction, seed) 采样

（7）union(otherDataset) 集合运算

val rdd = sc.parallelize(List(1,2,3,4,5))
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8))
val rdd2 = rdd.union(rdd1)
rdd2.collect
rdd2.distinct.collect  //去重

val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
rdd2.collect

（8）intersection(otherDataset) 集合运算

（9）distinct([numTasks]) 去重

（10）groupByKey([numTasks]) 聚合操作

val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
val rdd3 = rdd2.groupByKey
rdd3.collect

（11）reduceByKey(func, [numTasks]) 聚合操作

val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
val rdd3 = rdd2.reduceByKey(_+_)
rdd3.collect

（12）aggregateByKey(zeroValye)(seqop, combop, [numTasks]) 聚合操作

（13）sortByKey([asceding], [numTasks]) 聚合操作

（14）sortBy(func, [asceding], [numTasks]) 聚合操作

val rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.sortBy(x => x,true)
rdd1.collect

（15）join(otherDataset, [numTasks]) 聚合操作

（16）cogroup(otherDataset, [numTasks]) 聚合操作

val rdd = sc.parallelize(List(("Destiny",1000), ("Destiny",2000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.cogroup(rdd1)
rdd2.collect

（16）cartesian(otherDataset) 聚合操作

（17）pipe(command, [envVars]) 聚合操作

（18）coalesce(numPartitions) 聚合操作

（19）repartition(numPartitions) 聚合操作

（20）repartitionAnSortWithinPartitions(partitioner) 聚合操作

二、Action

（1）reduce(func)

val rdd = sc.parallelize(Array(1,2,3,4,5,6))
val rdd1 = rdd.reduce(_+_)
rdd1.collect

（2）collect()

（3）count()

（4）first()

（5）take(n)

（6）takeSample(withReplacement, num, [seed])

（7）takeOrdered(n, [ordering])

（8）saveAsTextFile(path)

（9）saveAsSequenceFile(path)

（10）saveAsObjectFile(path)

（11）countByKey()

（12）foreach(func) 与map类似，没有返回值

RDD的高级算子

（1）mapPartitionsWithIndex

对RDD中的每个分区进行操作，下标用index表示，可以获取分区号

val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
def func(index: Int, iter: Iterator[Int]): Iterator[String] = {
	iter.toList.map(x => "Index = "+index+", Value = "+x).iterator
}
rdd.mapPartitionsWithIndex(func).collect

（2）aggregate

聚合操作。先对局部进行聚合操作，再对全局进行聚合操作

val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
rdd.aggregate(0)(max(_,_), _+_).collect
rdd.aggregate(0)(_+_, _+_).collect
rdd.aggregate(10)(_+_, _+_) .collect  //51
val rdd = sc.parallelize(List("a","b","c","d","e","f"),2)
rdd.aggregate(*)(_+_, _+_)  //**def*abc

（3）aggregateByKey

类似于aggregate，操作key-value的数据类型

val rdd = sc.parallelize(List(("cat",2), ("cat", 5), ("mouse", 4), ("cat", 12), ("dog", 12), ("mouse", 2)),2)
def func(index: Int, iter: Iterator[(String, Int)]): Iterator[String] = {
	iter.toList.map(x => "Index = "+index+", Value = "+x).iterator
}
rdd.mapPartitionsWithIndex(func).collect
import scala.math._
rdd.aggregateByKey(0)(math.max(_,_),_+_).collect
rdd.aggregateByKey(0)(_+_,_+_).collect
<b>（4）coalesce</b><br>
<p>默认不会进行shuffle，默认为false，true为分区</p>

```scala
val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
val rdd1 = rdd.coalesce(3,true)
rdd1.partitions.length

（5）repartition

默认进行shuffle

val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
val rdd1 = rdd.repartition(3)
rdd1.partitions.length

（6）其他高级算子

http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

JP-Destiny

发布了131 篇原创文章 · 获赞 12 · 访问量 6万+

私信关注

大数据-Spark的RDD