大数据-Spark的RDD

Spark的RDD

RDD:弹性分布式数据集

特性

  • RDD由分区组成,每个分区运行在不同的Worker上,通过这种方式来实现分布式计算(A list of partitions)
  • 在RDD中,提供算子处理每个分区中的数据(A function for computing each split)
  • RDD存在依赖关系:宽依赖和窄依赖(A list of dependencies on other RDDs)
  • 可以自定义分区规则来创建RDD(Optionally, a Partitioner for key-value RDDs(The RDD is hash-partitioned))
  • 优先选择离文件位置近的节点来执行(Optionally, a list of preferred locations to compute each split on(Block locations for hdfs file))
(1)RDD的缓存机制

标识RDD可以被缓存(persist,cache),执行缓存,第二次执行语句才体现效果

缓存的位置

  • NONE
  • DISK_ONLY
  • DISK_ONLY_2
  • MEMORY_ONLY
  • MEMORY_ONLY_2
  • MEMORY_ONLY_SER
  • MEMORY_ONLY_SER_2
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_2
  • MEMORY_AND_DISK_SER
  • MEMORY_AND_DISK_SER_2
  • OFF_HEAP
(2)RDD的容错机制

RDD检查点的类型

  1. 基于本地目录
  2. 基于HDFS目录
//设置检查点
sc.textFile("hdfs://192.168.138.130:9000/tmp/text_Cache.txt")
//设置检查点目录
sc.setCheckpointDir("hdfs://192.168.138.130:9000/checkpoint")
//执行检查点操作
rdd1.checkpoint
(3)RDD的依赖关系(宽依赖,窄依赖)

宽依赖:多个子RDD的Partition会依赖同一个父RDD的Partition

窄依赖:每一个父RDD的Partition最多被子RDD的一个Partition使用

窄依赖是划分stage的依据

RDD的创建

(1)通过SparkContext.parallelize方法来创建
val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),3)

(2)通过外部数据源来创建

val rdd = sc.textFile("/root/spark_WordCount.text")

RDD的算子

一、Transformation

(1)map(func) 相当于for循环,返回一个新的RDD
val rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.map(_*2)
rdd1.collect

(2)filter(func) 过滤

val rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.filter(_>2)
rdd1.collect

(3)flatmap(func) flat+map 压平

val rdd = sc.parallelize(Array("a b c", "d e f", "g h i"))
val rdd1 = rdd.flatmap(_.split(" "))
rdd1.collect

(4)mapPartitions(func) 对RDD每个分区进行操作

(5)mapPartitionsWithIndex(func) 对RDD每个分区进行操作,可以取到分区号

(6)sample(withReplacement, fraction, seed) 采样

(7)union(otherDataset) 集合运算

val rdd = sc.parallelize(List(1,2,3,4,5))
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8))
val rdd2 = rdd.union(rdd1)
rdd2.collect
rdd2.distinct.collect  //去重

val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
rdd2.collect

(8)intersection(otherDataset) 集合运算

(9)distinct([numTasks]) 去重

(10)groupByKey([numTasks]) 聚合操作

val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
val rdd3 = rdd2.groupByKey
rdd3.collect

(11)reduceByKey(func, [numTasks]) 聚合操作

val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
val rdd3 = rdd2.reduceByKey(_+_)
rdd3.collect

(12)aggregateByKey(zeroValye)(seqop, combop, [numTasks]) 聚合操作

(13)sortByKey([asceding], [numTasks]) 聚合操作

(14)sortBy(func, [asceding], [numTasks]) 聚合操作

val rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.sortBy(x => x,true)
rdd1.collect

(15)join(otherDataset, [numTasks]) 聚合操作

(16)cogroup(otherDataset, [numTasks]) 聚合操作

val rdd = sc.parallelize(List(("Destiny",1000), ("Destiny",2000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.cogroup(rdd1)
rdd2.collect

(16)cartesian(otherDataset) 聚合操作

(17)pipe(command, [envVars]) 聚合操作

(18)coalesce(numPartitions) 聚合操作

(19)repartition(numPartitions) 聚合操作

(20)repartitionAnSortWithinPartitions(partitioner) 聚合操作

二、Action

(1)reduce(func)
val rdd = sc.parallelize(Array(1,2,3,4,5,6))
val rdd1 = rdd.reduce(_+_)
rdd1.collect

(2)collect()

(3)count()

(4)first()

(5)take(n)

(6)takeSample(withReplacement, num, [seed])

(7)takeOrdered(n, [ordering])

(8)saveAsTextFile(path)

(9)saveAsSequenceFile(path)

(10)saveAsObjectFile(path)

(11)countByKey()

(12)foreach(func) 与map类似,没有返回值

RDD的高级算子

(1)mapPartitionsWithIndex

对RDD中的每个分区进行操作,下标用index表示,可以获取分区号

val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
def func(index: Int, iter: Iterator[Int]): Iterator[String] = {
	iter.toList.map(x => "Index = "+index+", Value = "+x).iterator
}
rdd.mapPartitionsWithIndex(func).collect

(2)aggregate

聚合操作。先对局部进行聚合操作,再对全局进行聚合操作

val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
rdd.aggregate(0)(max(_,_), _+_).collect
rdd.aggregate(0)(_+_, _+_).collect
rdd.aggregate(10)(_+_, _+_) .collect  //51
val rdd = sc.parallelize(List("a","b","c","d","e","f"),2)
rdd.aggregate(*)(_+_, _+_)  //**def*abc

(3)aggregateByKey

类似于aggregate,操作key-value的数据类型

val rdd = sc.parallelize(List(("cat",2), ("cat", 5), ("mouse", 4), ("cat", 12), ("dog", 12), ("mouse", 2)),2)
def func(index: Int, iter: Iterator[(String, Int)]): Iterator[String] = {
	iter.toList.map(x => "Index = "+index+", Value = "+x).iterator
}
rdd.mapPartitionsWithIndex(func).collect
import scala.math._
rdd.aggregateByKey(0)(math.max(_,_),_+_).collect
rdd.aggregateByKey(0)(_+_,_+_).collect
<b>(4)coalesce</b><br>
<p>默认不会进行shuffle,默认为false,true为分区</p>

```scala
val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
val rdd1 = rdd.coalesce(3,true)
rdd1.partitions.length

(5)repartition

默认进行shuffle

val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
val rdd1 = rdd.repartition(3)
rdd1.partitions.length

(6)其他高级算子

http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

发布了131 篇原创文章 · 获赞 12 · 访问量 6万+

猜你喜欢

转载自blog.csdn.net/JavaDestiny/article/details/95788759