一、Transformations算子

1.map-一对一

特点就是一对一，进来一个，出去一个

lines.map(one=>{
      one+"#"
    }).foreach(println)

2.flatMap-一对多

进来一个，出去一堆。比如读取一行数据：hello world ，出去的就是hello和world两个单词。匿名函数定义了其中的逻辑：将读取到的数据按照空格切分，最终输出多个单词

    lines.flatMap(one=>{
      one.split(" ")
    }).foreach(println)

3.filter过滤

进来一个，只要符合定义规则的才能顺利输出出去。即：只有hello才能输出，其他单词全都被过滤掉

//先得到一堆单词
val rdd1 = lines.flatMap(one => {
      one.split(" ")
    })
    //定义过滤规则，进行过滤输出
    rdd1.filter(one=>{
      "hello".equals(one)
    }).foreach(println)

4.reduceByKey和sortBy

reduceByKey就是相当于MR的Reduce过程，将相同的key进行聚合，并执行对应的逻辑（这里是做+1，即统计单词个数）。
聚合后要对RDD进行排序了，这里是借助tuple二元组来做计数、排序的。按照tuple的第二位来进行排序，默认升序。如果想要降序，那就result.sortBy(tp=>{tp._2},false)

	//得到split切割后的N多个单词
	val words = lines.flatMap(one => {one.split(" ")})
	//map，1 To 1，hello--->（hello，1）
    val pairWords = words.map(one=>{(one,1)})
    //聚合
    val result = pairWords.reduceByKey((v1:Int,v2:Int)=>{v1+v2})
    result.sortBy(tp=>{tp._2})//按照第二位来排序，进来tp，得到第二位
//    result.sortBy(tp=>{tp._2},false)//降序输出，默认升序
    result.foreach(println)

5.sortByKey

使用sortByKey实现sortBy的功能：“hello world”—>“hello” “world”—>（hello，1）（world，1）
关键的时候来了，利用tuple的swap反转，（hello 1）—>（1，hello）
使用sortByKey来进行排序，然后再利用一次反转

	val words = lines.flatMap(one => {one.split(" ")})
    val pairWords = words.map(one=>{(one,1)})
    val result:RDD[(String,Int)] = pairWords.reduceByKey((v1:Int,v2:Int)=>{v1+v2})
    val transRDD = result.map(tp=>{tp.swap})//反转，string，int  变  int，string
    val r = transRDD.sortByKey(false)
    r.map(_.swap).foreach(println)

6.sample抽样

	/**
     * sample算子抽样
     * true:抽出来一个，完事再放回去，再继续抽。
     * 0.1:抽样的比例
     * 100L:指定种子，抽到的数据不管运行多少次都一样
     */
    val result: RDD[String] = lines.sample(true,0.1,100L)
    result.foreach(println)

7.join

(k,v) (k,w)—>(k,(v,w))，k相同的join在一起

val result = rdd1.join(rdd2)

7.1 leftOuterJoin

以左为主，没有的就用None占坑

val result = rdd1.leftOuterJoin(rdd2)

7.2 rightOuterJoin

以右为主，没有的就用None占位

val result = rdd1.rightOuterJoin(rdd2)

8.union

合并两个数据集，类型要一致

val result = rdd1.union(rdd2)

9.intersection

取2个RDD的交集

val result = rdd1.intersection(rdd2)

10.subtract

取差集

val result = rdd1.subtract(rdd2)

11.mapPartitions

和map类似，遍历的单位是每个partition上的数据

	val result = rdd1.mapPartitions(iter=>{
      val listBuffer = new ListBuffer[String]()
      println("打开")
      while (iter.hasNext){
        val s = iter.next()
        println("插入。。。"+s)
        listBuffer.append(s+"#")
      }
      println("关闭")
      listBuffer.iterator
    })

12.distinct:map+redeceByKey+map

相当于去重了

val rdd1: RDD[String] = sc.makeRDD(Array[String]("a", "b", "c", "a", "d", "e", "a", "b"))
    val result = rdd1.distinct()
    //等价
    val result = rdd1.map(s=>{(s,1)}).reduceByKey(_+_).map(tp=>tp._1)

13.cogroup

(K,V).cogroup(K,V)=>(K,([V],[W]))
输出：(zhangsan,(CompactBuffer(1),CompactBuffer(100)))

	val result = nameRDD.cogroup(scoreRDD)
    result.foreach(println)

14.mapPartitionsWithIndex

index:分区号；
iter：分区号下的数据

val rdd2 = rdd1.mapPartitionsWithIndex((index, iter) => {
      val list = new ListBuffer[String]()
      while (iter.hasNext) {
        val one = iter.next()
        list += (s"rdd1 partition = $index ,value = $one")
      }
      list.iterator
    })
    rdd2.foreach(println)//rdd1 partition = 1 ,value = b

15.repartition

可以增多、减少分区。宽依赖算子，会产生shuffle；
这里区别于coalesce，coalesce同样可能增加、减少分区。但是coalesce是窄依赖算子，默认无shuffle，可通过设置true来开启。当coalesce由少的分区分到多的分区时，不让产生shuffle，不起作用。
因此可以变相的理解为：repartition常用于增多分区，coalesce常用于减少分区

val rdd3 = rdd2.repartition(3)
    rdd3.mapPartitionsWithIndex((index, iter) => {
      val list = new ListBuffer[String]()
      while (iter.hasNext) {
        val one = iter.next()
        list += (s"rdd1 partition = $index ,value = $one")
      }
      list.iterator
    }).foreach(println)

16.zip & zipwithindex

zip：两个RDD可以通过zip压缩在一起，输出结果：(a,1)

zipwithindex：Long就是RDD的index下标0,1,2…和各自的下标压缩在一起，形成K-V格式RDD。如：(a,0)

	rdd1.zip(rdd2).foreach(println)
    val rdd: RDD[(String, Long)] = rdd1.zipWithIndex()
    rdd.foreach(println)

二、Action算子

1.count

	//count：计算数据源有多少行
    val l = lines.count()
    println(l)

2.collect

回收计算结果到Driver端的内存

    val strings: Array[String] = lines.collect()
    strings.foreach(println)

3.firs

拿到第一条数据。first就是由take(1)实现的

    val result = lines.first()
    println(result)

4.tak

拿到指定行数的数据

    val result = lines.take(5)
    result.foreach(println)

5.foreachPartition

遍历的是每个partition上的数据

	rdd1.foreachPartition(iter=>{
      println("创建数据库连接")
      while (iter.hasNext){
        val s = iter.next()
        println("插入数据库："+s)
      }
      println("关闭数据库连接")
    })

6.reduce &countByKey & countByValue

聚合执行对应逻辑，输出15

	val reslut = sc.parallelize(List[Int](1,2,3,4,5)).reduce((v1,v2)=>{v1+v2})
    println(reslut)

countByKey按照key分组，count整体相同的有几个

sc.parallelize(List[(String,Int)](("a",100),("b",200),("a",300),("d",400))).countByKey().foreach(println)

countByValue:整体作为value分组，计算出现次数。输出：((a,100),2)

sc.parallelize(List[(String,Int)](("a",100),("b",200),("a",300),("a",100))).countByValue().foreach(println)

Spark-Java算子

墨玉浮白

发布了197 篇原创文章 · 获赞 245 · 访问量 4万+

私信关注

Spark-Scala算子