Spark-like RDD
RDD: Elastic distributed data setcharacteristic
- RDD is composed of partitions, each partition runs on a different worker, in this way to achieve distributed computing (A list of partitions)
- In RDD, provide an operator to process the data in each partition (A function for computing each split)
- RDD has dependencies: wide and narrow dependencies (A list of dependencies on other RDDs)
- You can customize the partition rules to create RDD (Optionally, a Partitioner for key-value RDDs (The RDD is hash-partitioned))
- Preferentially select nodes close to the file location to execute (Optionally, a list of preferred locations to compute each split on (Block locations for hdfs file))
Identify that the RDD can be cached (persist, cache), execute the cache, and the second execution of the statement only reflects the effect
Cache location
- NONE
- DISK_ONLY
- DISK_ONLY_2
- MEMORY_ONLY
- MEMORY_ONLY_2
- MEMORY_ONLY_SER
- MEMORY_ONLY_SER_2
- MEMORY_AND_DISK
- MEMORY_AND_DISK_2
- MEMORY_AND_DISK_SER
- MEMORY_AND_DISK_SER_2
- OFF_HEAP
Types of RDD checkpoints
- Based on local directory
- Based on HDFS directory
//设置检查点
sc.textFile("hdfs://192.168.138.130:9000/tmp/text_Cache.txt")
//设置检查点目录
sc.setCheckpointDir("hdfs://192.168.138.130:9000/checkpoint")
//执行检查点操作
rdd1.checkpoint
(3) Dependency of RDD (wide dependence, narrow dependence)
Wide dependency: Partitions of multiple child RDDs will depend on the same parent RDD Partition
Narrow dependencies: each parent RDD Partition is used by at most one Partition of the child RDD
Narrow dependence is the basis for dividing the stage
RDD creation
(1) Created by SparkContext.parallelize methodval rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),3)
(2) Created by external data source
val rdd = sc.textFile("/root/spark_WordCount.text")
RDD operator
一、Transformation
(1) map (func) is equivalent to a for loop, returning a new RDDval rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.map(_*2)
rdd1.collect
(2) filter (func) filtering
val rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.filter(_>2)
rdd1.collect
(3)flatmap(func) flat+map 压平
val rdd = sc.parallelize(Array("a b c", "d e f", "g h i"))
val rdd1 = rdd.flatmap(_.split(" "))
rdd1.collect
(4) mapPartitions (func) operates on each partition of RDD
(5) mapPartitionsWithIndex (func) operates on each partition of RDD, you can get the partition number
(6) sample (withReplacement, fraction, seed) sampling
(7) union (otherDataset) Set operation
val rdd = sc.parallelize(List(1,2,3,4,5))
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8))
val rdd2 = rdd.union(rdd1)
rdd2.collect
rdd2.distinct.collect //去重
val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
rdd2.collect
(8) intersection (otherDataset) set operation
(9) distinct ([numTasks]) deduplication
(10) groupByKey ([numTasks]) aggregation operation
val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
val rdd3 = rdd2.groupByKey
rdd3.collect
(11) reduceByKey (func, [numTasks]) aggregation operation
val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
val rdd3 = rdd2.reduceByKey(_+_)
rdd3.collect
(12) aggregateByKey (zeroValye) (seqop, combop, [numTasks]) aggregation operation
(13) sortByKey ([asceding], [numTasks]) aggregation operation
(14) sortBy (func, [asceding], [numTasks]) aggregation operation
val rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.sortBy(x => x,true)
rdd1.collect
(15) join (otherDataset, [numTasks]) aggregation operation
(16) cogroup (otherDataset, [numTasks]) aggregation operation
val rdd = sc.parallelize(List(("Destiny",1000), ("Destiny",2000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.cogroup(rdd1)
rdd2.collect
(16) cartesian (otherDataset) aggregation operation
(17) pipe (command, [envVars]) aggregation operation
(18) coalesce (numPartitions) aggregation operation
(19) repartition (numPartitions) aggregation operation
(20) repartitionAnSortWithinPartitions (partitioner) aggregation operation
Second, Action
(1)reduce(func)val rdd = sc.parallelize(Array(1,2,3,4,5,6))
val rdd1 = rdd.reduce(_+_)
rdd1.collect
(2)collect()
(3)count()
(4)first()
(5)take(n)
(6)takeSample(withReplacement, num, [seed])
(7)takeOrdered(n, [ordering])
(8)saveAsTextFile(path)
(9)saveAsSequenceFile(path)
(10)saveAsObjectFile(path)
(11)countByKey()
(12)foreach(func) 与map类似,没有返回值
Advanced Operators of RDD
(1) mapPartitionsWithIndexOperate on each partition in the RDD, the index is indicated by index, and the partition number can be obtained
val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
def func(index: Int, iter: Iterator[Int]): Iterator[String] = {
iter.toList.map(x => "Index = "+index+", Value = "+x).iterator
}
rdd.mapPartitionsWithIndex(func).collect
(2)aggregate
Aggregate operation. Aggregate the local part first, then aggregate the global part
val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
rdd.aggregate(0)(max(_,_), _+_).collect
rdd.aggregate(0)(_+_, _+_).collect
rdd.aggregate(10)(_+_, _+_) .collect //51
val rdd = sc.parallelize(List("a","b","c","d","e","f"),2)
rdd.aggregate(*)(_+_, _+_) //**def*abc
(3)aggregateByKey
Similar to aggregate, operation key-value data type
val rdd = sc.parallelize(List(("cat",2), ("cat", 5), ("mouse", 4), ("cat", 12), ("dog", 12), ("mouse", 2)),2)
def func(index: Int, iter: Iterator[(String, Int)]): Iterator[String] = {
iter.toList.map(x => "Index = "+index+", Value = "+x).iterator
}
rdd.mapPartitionsWithIndex(func).collect
import scala.math._
rdd.aggregateByKey(0)(math.max(_,_),_+_).collect
rdd.aggregateByKey(0)(_+_,_+_).collect
<b>(4)coalesce</b><br>
<p>默认不会进行shuffle,默认为false,true为分区</p>
```scala
val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
val rdd1 = rdd.coalesce(3,true)
rdd1.partitions.length
(5)repartition
Shuffle by default
val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
val rdd1 = rdd.repartition(3)
rdd1.partitions.length
(6) Other advanced operators
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html