spark03
map through each element
mapPartitions each traversing a partition
foreach action operator
foreachPartitions action算子
collect
nginx flume hdfs hbase spark mysql
If the data is inserted, then foreachPartition better, because each district to establish a connection
A task submitted, there are several job? Action operators have a few there are several job
reduce to arrive at a result
scala> var arr=Array(1,2,3,4,5,6) arr: Array[Int] = Array(1, 2, 3, 4, 5, 6)
scala> sc.makeRDD(arr,2) res0: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [0] to makeRDD to <console>: 27
scala> res0.reduce(_+_) res1: Int = 21 |
count operator
scala> res0.count res2: Long = 6 |
first
scala> res0.first res4: Int = 1 |
take Operators
scala> sc.makeRDD(arr,3) res9: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [1] for makeRDD to <console>: 27
scala> res9.take(2) res10: Array[Int] = Array(1, 2)
scala> res9.take(4) res11: Array[Int] = Array(1, 2, 3, 4) |
take operator can generate a plurality of job
take operators to submit every time the task is sc.runJob, the number of elements and the total number of elements in the scan comparison, the number of partitions and partition the total number of scans than
top operator
var arr=Array(1,2,3,4,5,6,7,8,9) sc.makeRDD(arr,2)
scala> res9.top(3) res14: Array[Int] = Array(9, 8, 7) |
First sort the data, and then taken before descending N th
takeOrdered positive sequence ordering, and then taken before the N th
scala> res9.takeOrdered(3) res15: Array[Int] = Array(1, 2, 3) |
countByKey according to key derived value number of
scala> var arr = Array(("a",1),("b",1),("a",1)) arr: Array[(String, Int)] = Array((a,1), (b,1), (a,1))
scala> sc.makeRDD(arr,3) res16: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[4] at makeRDD at <console>:27
scala> res16.countByKey() res17: scala.collection.Map[String,Long] = Map(a -> 2, b -> 1) |
collect将数据收集到driver端,一般都是为了测试显示 Array
collectAsMap 将数据从executors端收集到driver端。Map
var arr=Array(1,2,3,4,5,6,7,8,9) scala> sc.makeRDD(arr,3)
scala> res9.collect res18: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> res9.collectAsMap <console>:29: error: value collectAsMap is not a member of org.apache.spark.rdd.RDD[Int] res9.collectAsMap ^
scala> var arr=Array(("a",1),("b",1),("a",1)) arr: Array[(String, Int)] = Array((a,1), (b,1), (a,1))
scala> sc.makeRDD(arr,3) res14: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[9] at makeRDD at <console>:27
res14.countByKey() scala> res14.collectAsMap res15: scala.collection.Map[String,Int] = Map(b -> 1, a -> 1) |
算子在executor中执行的原理和过程
rdd是一个弹性的分布式的数据集,默认带有分区的,每个分区会被一个线程处理
rdd中存在数据?
sortBy排序
scala> res27.collect res30: Array[Int] = Array(1, 2, 3, 4, 6, 7, 9)
scala> res26.sortBy(t=>t,false,4) res31: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[27] at sortBy at <console>:29
scala> res31.partitions.size res32: Int = 4 |
sortBy可以改变分区数量,同时排序可以正序和倒序,并且带有shuffle
sortByKey按照key进行排序
scala> var arr = Array((1,2),(2,1),(3,3),(6,9),(5,0)) arr: Array[(Int, Int)] = Array((1,2), (2,1), (3,3), (6,9), (5,0))
scala> sc.makeRDD(arr,3) res33: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[28] at makeRDD at <console>:27
scala> res33.sortByKey() res34: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[31] at sortByKey at <console>:29
scala> res34.collect res35: Array[(Int, Int)] = Array((1,2), (2,1), (3,3), (5,0), (6,9))
scala> res33.sortByKey def sortByKey(ascending: Boolean,numPartitions: Int): org.apache.spark.rdd.RDD[(Int, Int)]
scala> res33.sortByKey(false) res36: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[34] at sortByKey at <console>:29
scala> res36.collect res37: Array[(Int, Int)] = Array((6,9), (5,0), (3,3), (2,1), (1,2))
scala> res33.sortByKey(false,10) res38: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[37] at sortByKey at <console>:29
scala> res38.partitions,size <console>:1: error: ';' expected but ',' found. res38.partitions,size ^
scala> res38.partitions.size res39: Int = 6
scala> res33.partitions.size res40: Int = 3
scala> res38.partitions.size res41: Int = 6 |
sortBykey产生shuffle,他的分区器就是rangePartitioner groupByKey reduceBykey他们的分区器都是hashPartitioner
比如:sortByKey它的分区器是rangePartitioner,这个分区器是按照范围和数据量进行自适配的,如果元素个数大于等于分区的个数,这样的分区不会产生差别,如果数据量比分区数量还要小,那么指定的分区个数和真正的分区个数就会产生差别
union intersection subtract
scala> var arr = Array(1,2,3,4,5) arr: Array[Int] = Array(1, 2, 3, 4, 5)
scala> var arr1 = Array(3,4,5,6,7) arr1: Array[Int] = Array(3, 4, 5, 6, 7)
scala> sc.makeRDD(arr) res42: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[38] at makeRDD at <console>:27
scala> sc.makeRDD(arr1) res43: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[39] at makeRDD at <console>:27
scala> res42 union res41 <console>:35: error: type mismatch; found : Int required: org.apache.spark.rdd.RDD[Int] res42 union res41 ^
scala> res42 union res43 res45: org.apache.spark.rdd.RDD[Int] = UnionRDD[40] at union at <console>:33
scala> res45.collect res46: Array[Int] = Array(1, 2, 3, 4, 5, 3, 4, 5, 6, 7) |
union只是结果集的联合,没有任何业务逻辑,所以分区数量是两个rdd的分区数量总和
scala> res42 intersection res43 res47: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[46] at intersection at <console>:33
scala> res47.collect res48: Array[Int] = Array(3, 4, 5)
scala> res42 substract res43 <console>:33: error: value substract is not a member of org.apache.spark.rdd.RDD[Int] res42 substract res43 ^
scala> res42 subtract res43 res50: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[50] at subtract at <console>:33
scala> res50.collect res51: Array[Int] = Array(1, 2) |
交集和差集都是原来分区中的数据,所以分区数量不会改变
distinct
scala> var arr = Array(1,1,1,1,1,12,2,3,3,3,3,3,4,4,4,4,4) arr: Array[Int] = Array(1, 1, 1, 1, 1, 12, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)
scala> sc.makeRDD(arr,3) res52: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at makeRDD at <console>:27
scala> res52.distinct res53: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[54] at distinct at <console>:29
scala> res53.collect res54: Array[Int] = Array(3, 12, 4, 1, 2) |
使用groupbyKey
scala> res52.map((_,null)) res55: org.apache.spark.rdd.RDD[(Int, Null)] = MapPartitionsRDD[55] at map at <console>:29
scala> res55.groupByKey() res56: org.apache.spark.rdd.RDD[(Int, Iterable[Null])] = ShuffledRDD[56] at groupByKey at <console>:31
scala> res55.map(_._1) res57: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[57] at map at <console>:31
scala> res57.collect res58: Array[Int] = Array(1, 1, 1, 1, 1, 12, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)
scala> res56.map(_._1) res59: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[58] at map at <console>:33
scala> res59.collect res60: Array[Int] = Array(3, 12, 4, 1, 2) |
使用reduceByKey
scala> res52.map((_,null)) res61: org.apache.spark.rdd.RDD[(Int, Null)] = MapPartitionsRDD[59] at map at <console>:29
scala> res61.reduceByKey((a,b)=>a) res62: org.apache.spark.rdd.RDD[(Int, Null)] = ShuffledRDD[60] at reduceByKey at <console>:31
scala> res62.map(_._1) res63: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[61] at map at <console>:33
scala> res63.collect res64: Array[Int] = Array(3, 12, 4, 1, 2) |
scala> res52.distinct(10) res66: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[64] at distinct at <console>:29
scala> res66.partitions.size res67: Int = 10
|
作业题:
http://bigdata.edu360.cn/laozhang http://bigdata.edu360.cn/laozhang http://bigdata.edu360.cn/laozhao http://bigdata.edu360.cn/laozhao http://bigdata.edu360.cn/laozhao http://bigdata.edu360.cn/laoduan http://bigdata.edu360.cn/laoduan http://javaee.edu360.cn/xiaoxu http://javaee.edu360.cn/xiaoxu http://javaee.edu360.cn/laoyang http://javaee.edu360.cn/laoyang http://javaee.edu360.cn/laoyang http://bigdata.edu360.cn/laozhao |
全局topN,整个学校里面的老师访问的排名的前几个(不区分专业)
学科topN,每个专业的老师的排名前几个?
第一种分组方式 第二种过滤器的方式
join leftOuterJoin rightOuterJoin cogroup
以上是join的关联操作,rdd必须是对偶元组的
scala> var arr = Array(("zhangsan",200),("lisi",250),("zhaosi",300),("wangwu",400)) arr: Array[(String, Int)] = Array((zhangsan,200), (lisi,250), (zhaosi,300), (wangwu,400))
scala> var arr1 = Array(("zhangsan",30),("lisi",25),("zhaosi",12),("liuneng",5)) arr1: Array[(String, Int)] = Array((zhangsan,30), (lisi,25), (zhaosi,12), (liuneng,5))
scala> sc.makeRDD(arr,3) res68: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[65] at makeRDD at <console>:27
scala> sc.makeRDD(arr1,3) res69: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[66] at makeRDD at <console>:27
scala> res68 join res69 res70: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[69] at join at <console>:33
scala> res70.collect res71: Array[(String, (Int, Int))] = Array((zhangsan,(200,30)), (lisi,(250,25)), (zhaosi,(300,12)))
scala> res68 leftOuterJoin res69 res72: org.apache.spark.rdd.RDD[(String, (Int, Option[Int]))] = MapPartitionsRDD[72] at leftOuterJoin at <console>:33
scala> res72.collect res73: Array[(String, (Int, Option[Int]))] = Array((zhangsan,(200,Some(30))), (wangwu,(400,None)), (lisi,(250,Some(25))), (zhaosi,(300,Some(12))))
scala> res68 rightOuterJoin res69 res74: org.apache.spark.rdd.RDD[(String, (Option[Int], Int))] = MapPartitionsRDD[75] at rightOuterJoin at <console>:33
scala> res74.collect res75: Array[(String, (Option[Int], Int))] = Array((zhangsan,(Some(200),30)), (lisi,(Some(250),25)), (zhaosi,(Some(300),12)), (liuneng,(None,5)))
scala> res68 cogroup res69 res76: org.apache.spark.rdd.RDD[(String, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[77] at cogroup at <console>:33
scala> res76.collect res77: Array[(String, (Iterable[Int], Iterable[Int]))] = Array((zhangsan,(CompactBuffer(200),CompactBuffer(30))), (wangwu,(CompactBuffer(400),CompactBuffer())), (lisi,(CompactBuffer(250),CompactBuffer(25))), (zhaosi,(CompactBuffer(300),CompactBuffer(12))), (liuneng,(CompactBuffer(),CompactBuffer(5)))) |