map:
map算子对数据的结构进行转换,是一个数据一个数据的进行结构转换
mapPartitions:
mapPartitions算子对数据的结构进行转换,只不过是一个分区一个分区的进行结构转换
效率高于map,减少了发送到执行器的交互次数,不过可能会出现内存溢出(OOM)
mapPartitionsWithIndex:
mapPartitionsWithIndex算子对数据的结构进行转换,不过该算子关注的是分区号,也就是Index
flatMap:
flatMap算子在进行map的算子的操作后还会进行扁平化操作
map:
val listRDD: RDD[Int]= sc.makeRDD(1 to 10)
val mapRDD: RDD[Int]= listRDD.map(_ *2)println(mapRDD.collect.mkString(","))//2,4,6,8,10,12,14,16,18,20
mapPartitions:
val listRDD: RDD[Int]= sc.makeRDD(1 to 10)
val mapPartitionsRDD = listRDD.mapPartitions(_.map(_ *2))println(mapPartitionsRDD .collect.mkString(","))//2,4,6,8,10,12,14,16,18,20
mapPartitionsWithIndex:
val listRDD: RDD[Int]= sc.makeRDD(1 to 5,3)
val tupleRDD: RDD[(Int, String)]= listRDD.mapPartitionsWithIndex {case(num, datas)=>{
datas.map(_ -> s"分区号:$num")}}println(tupleRDD.collect().mkString(","))//(1,分区号:0),(2,分区号:1),(3,分区号:1),(4,分区号:2),(5,分区号:2)
flatMap:
val arrayRDD = sc.makeRDD(Array(List(1,2),List(2,3,4)))
val arrays: RDD[Int]= arrayRDD.flatMap(datas => datas)println(arrays.collect().mkString(","))//1,2,2,3,4
【3】 glom
glom算子将同一个分区的数据放在一个Array数组中
val arrayRDD: RDD[Int]= sc.makeRDD(1 to 8,2)
val glomRDD: RDD[Array[Int]]= arrayRDD.glom()
glomRDD.foreach(glom =>println(glom.mkString(",")))//1,2,3,4//5,6,7,8
【4】 groupBy
groupBy算子根据用户传递的函数进行分区。有shuffle过程
val arrayRDD: RDD[Int]= sc.makeRDD(1 to 5)
val groupByRDD: RDD[(Int, Iterable[Int])]= arrayRDD.groupBy(_ %2)println(groupByRDD.collect.mkString(","))//(0,CompactBuffer(2, 4)),(1,CompactBuffer(1, 3, 5))
【5】 filter
filter算子根据用户传递的函数过滤数据,返回结果为true的数据。
val arrayRDD: RDD[Int]= sc.makeRDD(1 to 4)
val filterRDD: RDD[Int]= arrayRDD.filter(_ %2==0)println(filterRDD.collect.mkString(","))//2,4
【6】 distinct
distinct算子对数据进行去重操作,有shuffle过程
val arrayRDD: RDD[Int]= sc.makeRDD(Array(1,2,3,2,1,6,3,7,8,8))//distinct() distinct(numPartitions : scala.Int)//使用distinct算子对数据去重,但是因为去重后导致数据减少,所以可以改变默认的分区的数量
val distinctRDD: RDD[Int]= arrayRDD.distinct()println(distinctRDD.collect.mkString(","))//6,1,7,8,2,3
【7】 coalesce & repartition
coalesce算子对分区进行缩减,根据需求确定是否开启shuffle
repartition算子对分区进行缩减,底层调用coalesce算子,并开启shuffle
coalesce:
val arrayRDD: RDD[Int]= sc.makeRDD(0 to 16,4)println("缩减分区前分区数 = "+arrayRDD.partitions.size)//coalesce默认不shuffle,可以指定进行shuffle
val coalesceRDD: RDD[Int]= arrayRDD.coalesce(3)println("缩减分区后分区数 = "+coalesceRDD.partitions.size)//缩减分区前分区数 = 4//缩减分区后分区数 = 3
repartition:
val arrayRDD: RDD[Int]= sc.makeRDD(0 to 16,4)println("缩减分区前分区数 = "+arrayRDD.partitions.size)
val repartitionRDD: RDD[Int]= arrayRDD.repartition(3)println("缩减分区后分区数 = "+repartitionRDD.partitions.size)//缩减分区前分区数 = 4//缩减分区后分区数 = 3
【8】 sortBy
sortBy算子对数据进行排序操作,第二个布尔参数指定升序(true)和倒序(false),第三个Int参数是分区数量,默认为当前分区数量
val arrayRDD: RDD[Int]= sc.makeRDD(Array(3,4,1,2,5))
val sortByRDD: RDD[Int]= arrayRDD.sortBy(x => x)println(sortByRDD.collect.mkString(","))//1,2,3,4,5
val sortByRDD2: RDD[Int]= arrayRDD.sortBy(x => x,false)println(sortByRDD2.collect.mkString(","))//5,4,3,2,1
【9】 union & subtract & intersection & cartesian & zip
/**
* union 合并
*/
val rdd1: RDD[Int]= sc.makeRDD(1 to 5)
val rdd2: RDD[Int]= sc.makeRDD(5 to 10)
val rdd3: RDD[Int]= rdd1.union(rdd2)println(rdd3.collect.mkString(","))/**
* subtract 相对于前者求差集,保留不同元素,去除相同元素
*/
val rdd4: RDD[Int]= sc.makeRDD(3 to 8)
val rdd5: RDD[Int]= rdd4.subtract(rdd1)println(rdd5.collect.mkString(","))/**
* intersection 交集
*/
val rdd6: RDD[Int]= sc.makeRDD(1 to 7)
val rdd8: RDD[Int]= rdd6.intersection(rdd2)println(rdd8.collect.mkString(","))/**
* cartesian 笛卡尔积
*/
val rdd9: RDD[(Int, Int)]= rdd1.cartesian(rdd2)println(rdd9.collect.mkString(","))/**
* zip 拉链 在spark中拉链中每个分区数量都一致
*/
val rdd10: RDD[Int]= sc.makeRDD(Array(1,2,3),3)
val rdd11: RDD[String]= sc.makeRDD(Array("a","b","c"),3)
val rdd12: RDD[(Int, String)]= rdd10.zip(rdd11)println(rdd12.collect().mkString(","))
【11】 partitionBy
partitionBy算子根据用户传入的分区器规则进行分区,Spark默认提供两个分区器,HashPartitioner和RangePartitioner
val rdd:RDD[(Int,String)]= sc.makeRDD(Array((1,"a"),(2,"b"),(3,"c"),(4,"d")),4)
val partitionByRDD: RDD[(Int, String)]= rdd.partitionBy(new HashPartitioner(2))
partitionByRDD.glom().foreach(array =>println(array.mkString(",")))//(2,b),(4,d)
【12】groupByKey
groupByKey算子根据key进行分组,有shuffle过程
val values: RDD[(String, Int)]= sc.makeRDD(Array(("one",1),("two",1),("three",1),("three",1)))
val groupByKey: RDD[(String, Iterable[Int])]= values.groupByKey()println(groupByKey.collect().mkString(","))//(three,CompactBuffer(1, 1)),(two,CompactBuffer(1)),(one,CompactBuffer(1))
【13】 reduceByKey
reduceByKey算子根据用户传入的函数,对相同key的value进行操作
val values: RDD[(String, Int)]= sc.makeRDD(Array(("one",1),("two",1),("two",1),("three",1)))
val reduceByKey: RDD[(String, Int)]= values.reduceByKey(_ + _)println(reduceByKey.collect().mkString(","))//(three,1),(two,2),(one,1)
【14】 aggregateByKey & foldByKey
aggregateByKey算子需要用户传入三个参数,参数1是zeroValue,参数2是分区内函数,参数3是分区间函数
当分区内和分区间函数相同时可以使用foldByKey
foldByKey算子需要用户传入两个参数,参数1是zeroValue,参数2是分区函数
aggregateByKey:
val values = sc.makeRDD(Array(("a",3),("a",2),("c",4),("b",3),("c",6),("c",8)),2)
val aggregateByKeyRDD: RDD[(String, Int)]= values.aggregateByKey(0)(math.max(_, _), _ + _)println(aggregateByKeyRDD.collect().mkString(","))//(a,3)//(b,3)//(c,12)
foldByKey:
val values = sc.makeRDD(Array(("a",3),("a",2),("c",4),("b",3),("c",6),("c",8)),2)
val foldByKeyRDD: RDD[(String, Int)]= values.foldByKey(0)(_ + _)println(foldByKeyRDD.collect().mkString(","))
【15】 combineByKey
combineByKey算子需要用户传入三个参数,参数1是将value进行结构转换的函数,参数2是分区内函数,参数3是分区间函数
val values = sc.makeRDD(Array(("a",88),("a",91),("a",95),("b",95),("b",93),("b",98)),2)
val combineByKeyRDD:RDD[(String,(Int,Int))]= values.combineByKey(_ ->1,(x,y)=>(x._1 + y)->(x._2 +1),(x,y)=>(x._1 + y._1)->(x._2 + y._2))
combineByKeyRDD:RDD.foreach(println)//(a,(274,3))//(b,(286,3))
【16】 sortByKey
sortByKey算子需要用户传入两个参数,参数1是升序(true)或降序(false),参数2是分区数量
val values = sc.makeRDD(Array((3,"aa"),(6,"cc"),(2,"bb"),(1,"dd")))
val sortByKey: RDD[(Int, String)]= values.sortByKey(true)println(sortByKey.collect.mkString(","))//(1,dd),(2,bb),(3,aa),(6,cc)
【17】 mapValues
mapValues对所有的value进行结构转换操作
val values = sc.makeRDD(Array((1,"a"),(1,"d"),(2,"b"),(3,"c")))
val mapValues: RDD[(Int, String)]= values.mapValues("<"+ _ +">")println(mapValues.collect.mkString(","))//(1,<a>),(1,<d>),(2,<b>),(3,<c>)
【18】 join
join算子将两个Key-Value类型的RDD进行连接,key相同,则把两个value组成tuple2,组成的tuple2和key再组成新的tuple2。没有连接成功的则会省略连接失败的
val values1 = sc.makeRDD(Array((1,"a"),(2,"b"),(3,"c")))
val values2 = sc.makeRDD(Array((1,4),(2,5),(3,6)))
val join: RDD[(Int,(String, Int))]= values1.join(values2)println(join.collect.mkString(","))//(1,(a,4)),(2,(b,5)),(3,(c,6))
【19】 cogroup
cogroup算子将两个Key-Value类型的RDD进行连接,key相同,则把两个value组成分别封装成Iterable组成tuple2,组成的tuple2和key再组成新的tuple2。没有连接成功的则会和空值连接
val values1 = sc.makeRDD(Array((1,"a"),(2,"b"),(3,"c")))
val values2 = sc.makeRDD(Array((1,4),(2,5),(3,6)))
val cogroup: RDD[(Int,(Iterable[String], Iterable[Int]))]= values1.cogroup(values2)println(cogroup.collect.mkString(","))//(1,(CompactBuffer(a),CompactBuffer(4))),(2,(CompactBuffer(b),CompactBuffer(5))),(3,(CompactBuffer(c),CompactBuffer(6)))
【20】 reduce
val values = sc.makeRDD(0 to 10)
val reduce: Int = values.reduce(_ + _)println(reduce)//55
【21】 collect & count & first
/**
* first
*/
val values = sc.makeRDD(0 to 10)
val first: Int = values.first()println(first)//0/**
* count
*/
val count: Long = values.count()println(count)//11/**
* collect
*/
val collect: Array[Int]= values.collect()println(collect.mkString(","))//0,1,2,3,4,5,6,7,8,9,10
【22】 take & takeOrdered
/**
* take 不排序直接获取
*/
val countRDD: RDD[Int]= sc.makeRDD(Array(3,1,2,4,3,5),2)
val take: Array[Int]= countRDD.take(3)println(take.mkString(","))//3,1,2/**
* takeOrdered 排序后获取
*/
val takeOrdered: Array[Int]= countRDD.takeOrdered(3)println(takeOrdered.mkString(","))//1,2,3
【23】 aggregate & fold
/**
* aggregate,类似aggregateByKey,不过zeroValue在aggregateByKey中,有多少个分区就会使用多少次,
* 在aggregate中会多一次,是因为分区间也会相加依次zeroValue,但在aggregateByKey中只会在分区内相加
*/
val aggregateRDD: Int = values.aggregate(0)(_+_,_+_)println(aggregateRDD)//55
val aggregateRDD2: Int = values.aggregate(10)(_+_,_+_)println(aggregateRDD2)//85/**
* fold,当aggregate的分区内和分区间的函数相同时,就可以使用fold代替了,原理和aggregate相同
*/
val foldRDD: Int = values.fold(0)(_+_)println(foldRDD)//55
val foldRDD2: Int = values.fold(10)(_+_)println(foldRDD2)//85
【24】 saveAsTextFIle & saveAsObjectFile
val values: RDD[Int]= sc.makeRDD(0 to 10,3)/**
* saveAsTextFile
*/
values.saveAsTextFile("TextFileOutput")/**
* saveAsObjectFile
*/
values.saveAsObjectFile("ObjectFileOutput")
【25】 countByKey & foreach
val values: RDD[(Int, String)]= sc.makeRDD(Array((1,"a"),(1,"d"),(2,"b"),(3,"c")),2)/**
* countByKeyRDD:统计key的数量
*/
val countByKeyRDD: collection.Map[Int, Long]= values.countByKey()println(countByKeyRDD)/**
* foreach
*/
values.foreach(println)
【26】 checkpoint
//设置检查点保存目录
sc.setCheckpointDir("checkpointDir")
val values: RDD[Int]= sc.makeRDD(List(1,2,3,4))
val mapRDD: RDD[(Int, Int)]= values.map((_,1))
val reduceByKeyRDD: RDD[(Int, Int)]= mapRDD.reduceByKey(_ + _)
mapRDD.checkpoint()
reduceByKeyRDD.foreach(println)println(reduceByKeyRDD.toDebugString)