Big Data-Spark's RDD

Spark-like RDD

RDD: Elastic distributed data set

characteristic

  • RDD is composed of partitions, each partition runs on a different worker, in this way to achieve distributed computing (A list of partitions)
  • In RDD, provide an operator to process the data in each partition (A function for computing each split)
  • RDD has dependencies: wide and narrow dependencies (A list of dependencies on other RDDs)
  • You can customize the partition rules to create RDD (Optionally, a Partitioner for key-value RDDs (The RDD is hash-partitioned))
  • Preferentially select nodes close to the file location to execute (Optionally, a list of preferred locations to compute each split on (Block locations for hdfs file))
(1) RDD cache mechanism

Identify that the RDD can be cached (persist, cache), execute the cache, and the second execution of the statement only reflects the effect

Cache location

  • NONE
  • DISK_ONLY
  • DISK_ONLY_2
  • MEMORY_ONLY
  • MEMORY_ONLY_2
  • MEMORY_ONLY_SER
  • MEMORY_ONLY_SER_2
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_2
  • MEMORY_AND_DISK_SER
  • MEMORY_AND_DISK_SER_2
  • OFF_HEAP
(2) Fault tolerance mechanism of RDD

Types of RDD checkpoints

  1. Based on local directory
  2. Based on HDFS directory
//设置检查点
sc.textFile("hdfs://192.168.138.130:9000/tmp/text_Cache.txt")
//设置检查点目录
sc.setCheckpointDir("hdfs://192.168.138.130:9000/checkpoint")
//执行检查点操作
rdd1.checkpoint
(3) Dependency of RDD (wide dependence, narrow dependence)

Wide dependency: Partitions of multiple child RDDs will depend on the same parent RDD Partition

Narrow dependencies: each parent RDD Partition is used by at most one Partition of the child RDD

Narrow dependence is the basis for dividing the stage

RDD creation

(1) Created by SparkContext.parallelize method
val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),3)

(2) Created by external data source

val rdd = sc.textFile("/root/spark_WordCount.text")

RDD operator

一、Transformation

(1) map (func) is equivalent to a for loop, returning a new RDD
val rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.map(_*2)
rdd1.collect

(2) filter (func) filtering

val rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.filter(_>2)
rdd1.collect

(3)flatmap(func) flat+map 压平

val rdd = sc.parallelize(Array("a b c", "d e f", "g h i"))
val rdd1 = rdd.flatmap(_.split(" "))
rdd1.collect

(4) mapPartitions (func) operates on each partition of RDD

(5) mapPartitionsWithIndex (func) operates on each partition of RDD, you can get the partition number

(6) sample (withReplacement, fraction, seed) sampling

(7) union (otherDataset) Set operation

val rdd = sc.parallelize(List(1,2,3,4,5))
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8))
val rdd2 = rdd.union(rdd1)
rdd2.collect
rdd2.distinct.collect  //去重

val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
rdd2.collect

(8) intersection (otherDataset) set operation

(9) distinct ([numTasks]) deduplication

(10) groupByKey ([numTasks]) aggregation operation

val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
val rdd3 = rdd2.groupByKey
rdd3.collect

(11) reduceByKey (func, [numTasks]) aggregation operation

val rdd = sc.parallelize(List(("Destiny",1000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.union rdd1
val rdd3 = rdd2.reduceByKey(_+_)
rdd3.collect

(12) aggregateByKey (zeroValye) (seqop, combop, [numTasks]) aggregation operation

(13) sortByKey ([asceding], [numTasks]) aggregation operation

(14) sortBy (func, [asceding], [numTasks]) aggregation operation

val rdd = sc.parallelize(Array(5,6,1,2,4,3))
val rdd1 = rdd.sortBy(x => x,true)
rdd1.collect

(15) join (otherDataset, [numTasks]) aggregation operation

(16) cogroup (otherDataset, [numTasks]) aggregation operation

val rdd = sc.parallelize(List(("Destiny",1000), ("Destiny",2000), ("Freedom",2000)))
val rdd1 = sc.parallelize(List(("Destiny",2000), ("Freedom",1000)))
val rdd2 = rdd.cogroup(rdd1)
rdd2.collect

(16) cartesian (otherDataset) aggregation operation

(17) pipe (command, [envVars]) aggregation operation

(18) coalesce (numPartitions) aggregation operation

(19) repartition (numPartitions) aggregation operation

(20) repartitionAnSortWithinPartitions (partitioner) aggregation operation

Second, Action

(1)reduce(func)
val rdd = sc.parallelize(Array(1,2,3,4,5,6))
val rdd1 = rdd.reduce(_+_)
rdd1.collect

(2)collect()

(3)count()

(4)first()

(5)take(n)

(6)takeSample(withReplacement, num, [seed])

(7)takeOrdered(n, [ordering])

(8)saveAsTextFile(path)

(9)saveAsSequenceFile(path)

(10)saveAsObjectFile(path)

(11)countByKey()

(12)foreach(func) 与map类似,没有返回值

Advanced Operators of RDD

(1) mapPartitionsWithIndex

Operate on each partition in the RDD, the index is indicated by index, and the partition number can be obtained

val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
def func(index: Int, iter: Iterator[Int]): Iterator[String] = {
	iter.toList.map(x => "Index = "+index+", Value = "+x).iterator
}
rdd.mapPartitionsWithIndex(func).collect

(2)aggregate

Aggregate operation. Aggregate the local part first, then aggregate the global part

val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
rdd.aggregate(0)(max(_,_), _+_).collect
rdd.aggregate(0)(_+_, _+_).collect
rdd.aggregate(10)(_+_, _+_) .collect  //51
val rdd = sc.parallelize(List("a","b","c","d","e","f"),2)
rdd.aggregate(*)(_+_, _+_)  //**def*abc

(3)aggregateByKey

Similar to aggregate, operation key-value data type

val rdd = sc.parallelize(List(("cat",2), ("cat", 5), ("mouse", 4), ("cat", 12), ("dog", 12), ("mouse", 2)),2)
def func(index: Int, iter: Iterator[(String, Int)]): Iterator[String] = {
	iter.toList.map(x => "Index = "+index+", Value = "+x).iterator
}
rdd.mapPartitionsWithIndex(func).collect
import scala.math._
rdd.aggregateByKey(0)(math.max(_,_),_+_).collect
rdd.aggregateByKey(0)(_+_,_+_).collect
<b>(4)coalesce</b><br>
<p>默认不会进行shuffle,默认为false,true为分区</p>

```scala
val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
val rdd1 = rdd.coalesce(3,true)
rdd1.partitions.length

(5)repartition

Shuffle by default

val rdd = sc.parallelize(List(1,2,3,4,5,6),2)
val rdd1 = rdd.repartition(3)
rdd1.partitions.length

(6) Other advanced operators

http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

Published 131 original articles · won 12 · 60,000 views +

Guess you like

Origin blog.csdn.net/JavaDestiny/article/details/95788759