Good programmers Big Data learning routes Share resilient distributed datasets RDD, RDD defined, RDD (Resilient Distributed Dataset) is called distributed data sets, Spark is the most basic data abstraction that represents an immutable (data and metadata) , you can partition inside the element parallel computing collection.
RDD features : automatic fault tolerance, location-aware scheduling and scalability
RDD property
1. slices
The basic unit of the dataset. For RDD, each slice will be a computing task processing, and determine the particle size parallel computing. The user can specify the number of slices in the creation of RDD RDD, if not specified, it will default value. The default value is the number assigned to the program of the CPU Core.
2. A function is calculated for each partition.
Spark RDD calculation is based on the slice units, each of the function to compute RDD will achieve this purpose. iterator will compute function complex, do not need to save the results of each calculation.
Dependencies between 3.RDD.
RDD each converter generates a new RDD , it will form a line similar to the same dependency between the front and rear RDD.
Fault tolerance: when part partitioning data loss, Spark can recalculate the lost partition data by this dependency, rather than all partitions RDD were recalculated.
4. A Partitioner, partitioner
I.e. RDD slice function. Spark currently implemented in two types of fragmentation function is a Hash-based HashPartitioner, the other is based RangePartitioner range. Only for key-value of RDD, will have Partitioner, the value of non-key-value Parititioner the RDD is None. Partitioner function not only determines the number of fragments RDD itself, but also determines the number of fragments when parent RDD Shuffle output.
5. a list
Partition memory access priority of each of the (preferred location). -> proximity principle
For an HDFS file, the preservation of this list is the location where the blocks for each Partition . In accordance with the concept of "movement data as good as the mobile computing", the Spark when performing task scheduling, the task will be possible to calculate assign it to a storage location of the data block to be processed.
RDD type
1.Transformation -> calculation recording (recording parameter calculation method)
Change |
meaning |
map(func) |
Returns a new RDD, the RDD by each input element composition after conversion function func |
filter(func) |
It returns a new RDD, RDD returned by the function calculating after input element func composition is true |
flatMap(func) |
Similar map, but each input element may be mapped to zero or more output elements (func it should return a sequence, rather than a single element) |
mapPartitions (func) |
Similar map, but run independently on each slice of RDD, thus running the RDD type T when, func function type must be Iterator [T] => Iterator [U] |
mapPartitionsWithIndex (func) |
Similar mapPartitions, but func integer with a value of the index parameter indicates the slice, so as on the type of operation when the RDD T, the function func type must be (Int, Iterator[T]) => Iterator[U] |
sample(withReplacement, fraction, seed) |
Sample data according to the ratio specified fraction can choose whether to use a random number to be replaced, for specifying the SEED the random number generator seed |
union(otherDataset) |
After the source RDD and RDD parameters and requirements set returns a new RDD |
intersection(otherDataset) diff -> Set difference |
RDD source parameters and returns a new RDD RDD after the intersection of |
distinct([numTasks])) [Changing the number of partitions] |
After the source RDD were to re-return to a new RDD |
groupByKey([numTasks]) |
In a (K, V) of the RDD call, it returns a (K, Iterator [V]) of the RDD |
reduceByKey(func, [numTasks]) |
In a (K, V) of the RDD call, returns a (K, V) of RDD, reduce the specified function, the same key value is polymerized together with groupByKey Similarly, reduce the number of tasks may be performed by a second optional parameters to set |
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) |
|
sortByKey([ascending], [numTasks]) |
In a (K, V) of the RDD calls, K must implement the Ordered interface returns a sorted according to a key (K, V) of the RDD |
sortBy(func,[ascending], [numTasks]) |
And sortByKey similar, but more flexible |
join(otherDataset, [numTasks]) |
In the type (K, V), and (K, W) of the RDD call returns all elements of one and the same key corresponding to RDD together (K, (V, W)) of |
cogroup(otherDataset, [numTasks]) |
在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable<V>,Iterable<W>))类型的RDD |
cartesian(otherDataset) |
笛卡尔积 |
pipe(command, [envVars]) |
|
coalesce(numPartitions) |
|
repartition(numPartitions) |
重新分区 |
repartitionAndSortWithinPartitions(partitioner) |
2.Action -> 触发生成job(一个job对应一个action算子)
动作 |
含义 |
reduce(func) |
通过func函数聚集RDD中的所有元素,这个功能必须是可交换且可并联的 |
collect() |
在驱动程序中,以数组的形式返回数据集的所有元素 |
count() |
返回RDD的元素个数 |
first() |
返回RDD的第一个元素(类似于take(1)) |
take(n) |
取数据集的前n个元素组成的数组 |
takeSample(withReplacement,num, [seed]) |
返回一个数组,该数组由从数据集中随机采样的num个元素组成,可以选择是否用随机数替换不足的部分,seed用于指定随机数生成器种子 |
takeOrdered(n, [ordering]) |
takeOrdered和top类似,只不过以和top相反的顺序返回元素 |
saveAsTextFile(path) |
将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本 |
saveAsSequenceFile(path) |
将数据集中的元素以Hadoop sequencefile的格式保存到指定的目录下,可以使HDFS或者其他Hadoop支持的文件系统。 |
saveAsObjectFile(path) |
|
countByKey() |
针对(K,V)类型的RDD,返回一个(K,Int)的map,表示每一个key对应的元素个数。 |
foreach(func) |
在数据集的每一个元素上,运行函数func进行更新。 |
创建RDD
Linux进入sparkShell:
/usr/local/spark.../bin/spark-shell \
--master spark://hadoop01:7077 \
--executor-memory 512m \
--total-executor-cores 2
或在Maven下:
object lx03 { def main(args: Array[String]): Unit = { val conf : SparkConf = new SparkConf() .setAppName("SparkAPI") .setMaster("local[*]") val sc: SparkContext = new SparkContext(conf) //通过并行化生成rdd val rdd1: RDD[Int] = sc.parallelize(List(24,56,3,2,1)) //对add1的每个元素乘以2然后排序 val rdd2: RDD[Int] = rdd1.map(_ * 2).sortBy(x => x,true) println(rdd2.collect().toBuffer) //过滤出大于等于10的元素 // val rdd3: RDD[Int] = rdd2.filter(_ >= 10) // println(rdd3.collect().toBuffer) } |
练习2
val rdd1 = sc.parallelize(Array("a b c", "d e f", "h i j")) //将rdd1里面的每一个元素先切分在压平 val rdd2 = rdd1.flatMap(_.split(' ')) rdd2.collect //复杂的: val rdd1 = sc.parallelize(List(List("a b c", "a b b"), List("e f g", "a f g"), List("h i j", "a a b"))) //将rdd1里面的每一个元素先切分在压平 val rdd2 = rdd1.flatMap(_.flatMap(_.split(" "))) |
练习3
val rdd1 = sc.parallelize(List(5, 6, 4, 3)) val rdd2 = sc.parallelize(List(1, 2, 3, 4)) //求并集 val rdd3 = rdd1.union(rdd2) //求交集 val rdd4 = rdd1.intersection(rdd2) //去重 rdd3.distinct.collect rdd4.collect |
练习4
val rdd1 = sc.parallelize(List(("tom", 1), ("jerry", 3), ("kitty", 2))) val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("shuke", 2))) //求join val rdd3 = rdd1.join(rdd2) -> 相同的key组成新的key,value //结果: Array[(String,(Int,Int))] = Array((tom,(1,1)),(jerry,(3,2))) rdd3.collect //求左连接和右连接 val rdd3 = rdd1.leftOuterJoin(rdd2) rdd3.collect val rdd3 = rdd1.rightOuterJoin(rdd2) rdd3.collect //求并集 val rdd4 = rdd1 union rdd2 //按key进行分组 rdd4.groupByKey rdd4.collect //分别用groupByKey和reduceByKey实现单词计数 val rdd3 = rdd1 union rdd2 rdd3.groupByKey().mapValues(_.sum).collect rdd3.reduceByKey(_+_).collect |
groupByKey和reduceByKey的区别
reduceByKey算子比较特殊,它首先会进行局部聚合,再全局聚合,我们只需要传一个局部聚合的函数就可以了
练习5
val rdd1 = sc.parallelize(List(("tom", 1), ("tom", 2), ("jerry", 3), ("kitty", 2))) val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("shuke", 2))) //cogroup val rdd3 = rdd1.cogroup(rdd2) //注意cogroup与groupByKey的区别 rdd3.collect val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5)) //reduce聚合 val rdd2 = rdd1.reduce(_ + _) //按value的降序排序 val rdd5 = rdd4.map(t => (t._2, t._1)).sortByKey(false).map(t => (t._2, t._1)) rdd5.collect //笛卡尔积 val rdd3 = rdd1.cartesian(rdd2) |
计算元素个数
scala> val rdd1 = sc.parallelize(List(2,3,1,5,7,3,4)) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:27 scala> rdd1.count res0: Long = 7 |
top先升序排序在取值
scala> rdd1.top(3) res1: Array[Int] = Array(7, 5, 4) scala> rdd1.top(0) res2: Array[Int] = Array() scala> rdd1.top(100) res3: Array[Int] = Array(7, 5, 4, 3, 3, 2, 1) |
take原集合前N个,有几个取几个
scala> rdd1.take(3) res4: Array[Int] = Array(2, 3, 1) scala> rdd1.take(100) res5: Array[Int] = Array(2, 3, 1, 5, 7, 3, 4) scala> rdd1.first res6: Int = 2 |
takeordered倒序排序再取值
scala> rdd1.takeOrdered(3) res7: Array[Int] = Array(1, 2, 3) scala> rdd1.takeOrdered(30) res8: Array[Int] = Array(1, 2, 3, 3, 4, 5, 7) |
生成RDD的两种方式
1.并行化方式生成 (默认分区两个)
手动指定分区
scala> val rdd1 = sc.parallelize(List(1,2,3,5)) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:27 scala> rdd1.partitions.length //获取分区数 res9: Int = 2 scala> val rdd1 = sc.parallelize(List(1,2,3,5),3) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:27 scala> rdd1.partitions.length res10: Int = 3 |
2.使用textFile读取文件存储系统里的数据
scala> val rdd2 = sc.textFile("hdfs://hadoop01:9000/wordcount/input/a.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_) rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[11] at reduceByKey at <console>:27 scala> rdd2.collect //调用算子得到RDD显示结果 res11: Array[(String, Int)] = Array((hello,6), (beijing,1), (java,1), (gp1808,1), (world,1), (good,1), (qianfeng,1)) scala> val rdd2 = sc.textFile("hdfs://hadoop01:9000/wordcount/input/a.txt",4).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_) rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[26] at reduceByKey at <console>:27 scala> rdd2.partitions.length //也可以自己指定分区数 res15: Int = 4 |