scala in spark

晚上失眠,起来归总一些scala的基本操作,聊以解乏

打包 mvn scala:compile && mvn package

1、创建RDD--textFile+parallelize(makeRDD)

# 创建textFileRDD
val textFile = sc.textFile("README.md")
textFile.first()  #获取textFile RDD的第一个元素
res3:String = # Apache Spark

# 筛选出包括Spark关键字的RDD然后进行行计数
val  linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark.count()
res10:Long = 19

# 找出RDD textFile中包含单词数最多的行
textFile.map(line=>line.split(" ").size).reduce((a,b)=>Math.max(a,b))
res12:Int = 14  #第14行是包含单词最多的行

# 在scala shell中引入Java方法:
import java.lang.Math
textFile.map(line=>line.split(" ").size).reduce((a,b) => Math.max(a,b))

#将RDD linesWithSpark 缓存,然后进行计数
linesWithSpark.cache()
res13:linesWithSpark.type = 
MapPartitionsRDD[8] at filter at <console>:23
linesWithSpark.count()
res15:Long = 19

RDD:
makeRDD 和 parallelize是一样的,不过makeRDD好像只能scala用,parallelize是Python和 R都能用的

# 通过单词列表集合创建RDD thingsRDD
val thingsRDD = sc.parallelize(List("spoon","fork","plate","cup","bottle"))

# 计算RDD thingsRDD中单词的个数
thingsRDD.count()
res16:Long = 5

2、groupByKey

def groupByKey():RDD[( K, Iterable[V]) ]
def groupByKey(numPartitions:Int):RDD[(K, Iterable[V])]
def groupByKey(partitioner:Partitioner):RDD[(K,Iterable[v])]

该函数将RDD [K,V]中每个K对应的V值,一个集合Iterable[V]中;

  numPartitions用于指定分区数;
  partitioner用于指定分区函数。

  var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))

  println(rdd1.groupByKey().collect.toBuffer) 
  //ArrayBuffer((A,CompactBuffer(0, 2)), (B,CompactBuffer(1, 2)), (C,CompactBuffer(1)))

3、reduceByKey

def reduceBykey(func:(V,V)=>V):RDD[(K,V)]
def reduceByKey(func:(V,V)=>V),numPartitions:Int):RDD[(K,V)]
def reduceByKey(partitioner:Partitioner, func:(V,V)=>V):RDD[(K,V)]

该函数用于将RDD[K,V]中每个K对应的V值根据映射函数来运算。

  numPartitions用于指定分区数;
  参数partitioner用于指定分区函数;

var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))

rdd1.partitions.size //Int = 15

var rdd2 = rdd1.reduceByKey((x,y)=>x+y)

rdd2.collect
//Array[(String,Int)] = Array((A,2),(B,3),(C,1))

rdd2.partitions.size
// Int = 4

var rdd2 = rdd1.reduceByKey(new org.apache.spark.HashPartitioner(2),(x,y)=>x+y)

rdd2.collect
//Array[(String, Int)] = Array((B,3),(A,2),(C,1))

rdd2.partitions.size
//Int =2

4、reduceByKeyLocally

def reduceByKeyLocally(func:(V,V)=>V):Map[K,V]

该函数将RDD[K,V]中每个K对应的V值根据映射函数来运算,运算结果映射到一个Map[K,V]中,而不是RDD[K,V].

    var rdd1__ = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
    println(rdd1__.reduceByKeyLocally((x,y)=>x+y))
//    Map(A -> 2, B -> 3, C -> 1)
//    scala.collection.Map[String,Int] = Map(B->3,A->2,C->1)

5、groupByKey countByKey countByValue reduceByKey

val d = sc.makeRDD(Array(1,2,3,4,5,1,3,5))
val dd = d.map(x=>(x,1))  //构造pair RDD, dd:RDD[(Int,Int)]
结果:ArrayBuffer((1,1), (2,1), (3,1), (4,1), (5,1), (1,1), (3,1), (5,1))


5.1  groupByKey
val dg = dd.groupByKey()  //dg :RDD[(Int, Iterable[Int])]   dg: ArrayBuffer((4,CompactBuffer(1)), (1,CompactBuffer(1, 1)), (5,CompactBuffer(1, 1)), (2,CompactBuffer(1)), (3,CompactBuffer(1, 1)))
val dgc = dg.collectAsMap //dgc:Map[Int, Iterable[Int]]   Map(2 -> CompactBuffer(1), 5 -> CompactBuffer(1, 1), 4 -> CompactBuffer(1), 1 -> CompactBuffer(1, 1), 3 -> CompactBuffer(1, 1))
val dgci = dgc.foreach(println(_))
结果输出
(2,CompactBuffer(1))
(5,CompactBuffer(1, 1))
(4,CompactBuffer(1))
(1,CompactBuffer(1, 1))
(3,CompactBuffer(1, 1))

5.2 countByKey

val dk = dd.countByKey()  //dk: Map[Int, Long]
dk.foreach(println(_))
(5,2) //key为5出现了2次
(1,2) //key为1出现了2次
(2,1) //key为2出现了1次
(3,2) //key为3出现了2次

5.3 countByValue

val dv = dd.countByValue()  // dv: Map[(Int, Int),Long]
dv.foreach(println(_))      
//output
((5,1),2)
((3,1),2)
((4,1),1)
((1,1),2)
((2,1),1)

5.4 reduceByKey

    val rdd = sc.parallelize(List(
            (("a"), 1.0),
            (("a"), 3.0),
            (("a"), 2.0),
      (("b"),4.0)
    ))
    val reduceByKey = rdd.reduceByKey((a , b) => a+b)
    //reduceByKey:RDD[(String,Double)]
    reduceByKey.collect().foreach(println(_))
//(a,6.0)
//(b,4.0) 

6 reduce

reduce包含reduceLeft和reduceRight两种操作,前者从集合的头部开始操作,后者从集合的尾部开始操作。

val list = List(1,2,3,4,5)

list.reduceLeft(_ + _) //15

list.reduceRight(_ + _) //15

reduceLeft(_ + _)表示从列表头部开始,对两两元素进行求和操作,下划线是占位符。此时等价于reduce

1+2 = 3
3+3 = 6
6+4 = 10
10+5 = 15

reduceRight(_ + _)表示从列表尾部开始,对两两元素进行求和操作

4+5 = 9
3+9 = 12
2+12 = 14
1+14 = 15

对于减法操作效果安全不同

val list = List(1,2,3,4,5)

list.reduceLeft(_ - _) //-13

list.reduceRight(_ - _) //3

reduce将RDD中元素两两传递给输入函数,同时产生一个新的值,新产生的值与RDD中下一个元素再被传递给输入函数直到最后只有一个值为止。

val rdd = sc.parallelize(1 to 10)
rdd.reduce((x, y) => x + y) //55

此处对比 reduceByKey

reduceByKey就是对元素为KV对的RDD中Key相同的元素的Value进行reduce,因此,Key相同的多个元素的值被reduce为一个值,然后与原RDD中的Key组成一个新的KV对。

val rdd = sc.parallelize(List((1,2),(3,4),(3,6)))
rdd.reduceByKey((x,y) => x + y).collect
// Array[(Int, Int)] = Array((1,2), (3,10))

对Key相同的元素的值求和,因此Key为3的两个元素被转为了(3,10)

7 fold

折叠(fold)操作和reduce(归约)操作比较类似。fold操作需要从一个初始的“种子”值开始,并以该值作为上下文,处理集合中的每个元素。

scala> val list = List(1,2,3,4,5)
 
scala> list.fold(10)(_*_)
res0: Int = 1200

fold函数实现了对list中所有元素的累乘操作。fold函数需要两个参数,一个参数是初始种子值,这里是10,另一个参数是用于计算结果的累计函数,这里是累乘。执行list.fold(10)(*)时,首先把初始值拿去和list中的第一个值1做乘法操作,得到累乘值10,然后再拿这个累乘值10去和list中的第2个值2做乘法操作,得到累乘值20,依此类推,一直得到最终的累乘结果1200。fold有两个变体:foldLeft()和foldRight(),其中,foldLeft(),第一个参数为累计值,集合遍历的方向是从左到右。foldRight(),第二个参数为累计值,集合遍历的方向是从右到左。

foldByKey操作如下

val rdd1 = sc.parallelize(List("dog", "wolf", "cat", "bear"), 2)
val rdd2 = rdd1.map(x => (x.length, x)) //Array[(Int, String)] = Array((3,dog), (4,wolf), (3,cat), (4,bear)
val rdd3 = rdd2.foldByKey("")(_+_) //Array[(Int, String)] = Array((4,wolfbear), (3,dogcat))

8 join

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

var rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
var rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
 
scala> rdd1.join(rdd2).collect
res10: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))

join相当于SQL中的内关联join,只返回两个RDD根据K可以关联上的结果,join只能用于两个RDD之间的关联,如果要多个RDD关联,多关联几次即可。

scala> val visit=sc.parallelize(List(("index.html","1.2.3.4"),("about.html","3,4,5,6"),("index.html","1.3.3.1"),("hello.html","1,2,3,4")),2)
visit: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> val page=sc.parallelize(List(("index.html","home"),("about.html","about"),("hi.html","2.3.3.3")),2);  
page: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[4] at parallelize at <console>:24

scala> visit.join(page).collect
res10: Array[(String, (String, String))] = Array((about.html,(3,4,5,6,about)), (index.html,(1.2.3.4,home)), (index.html,(1.3.3.1,home)))

scala>  page.join(visit).collect
res11: Array[(String, (String, String))] = Array((about.html,(about,3,4,5,6)), (index.html,(home,1.2.3.4)), (index.html,(home,1.3.3.1)))

9 mapValue

def mapValues[U](f: (V) => U): RDD[(K, U)]

同基本转换操作中的map,只不过mapValues是针对[K,V]中的V值进行map操作。

scala> var rdd1 = sc.makeRDD(Array((1,"A"),(2,"B"),(3,"C"),(4,"D")),2)
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[17] at makeRDD at <console>:24

scala> rdd1.mapValues(x => x + "_").collect
res14: Array[(Int, String)] = Array((1,A_), (2,B_), (3,C_), (4,D_))

案例

scala> val key = sc.parallelize(List(("panda",0),("pink",3),("pirate",3),("panda",1),("pink",4)))
key: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[12] at parallelize at <console>:21

scala> key.collect
res15: Array[(String, Int)] = Array((panda,0), (pink,3), (pirate,3), (panda,1), (pink,4))

scala>key.mapValues(y=>(y,1)).collect
res16: Array[(String, (Int, Int))] = Array((panda,(0,1)), (pink,(3,1)), ((pirate,(3,1)), (panda,(1,1)), (pink,(4,1)))

10 aggregate

小试牛刀

val rdd2_ = sc.parallelize(List("a","b","c","d","e","f"),2)

println( rdd2_.aggregate("")(_ + _, _ + _)) //defabc
println( rdd2_.aggregate("=")(_ + _, _ + _)) //==def=abc

=出现几次根据分区次数决定,如上是2个分区

函数原型:

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

官方文档定义:

    Aggregate the elements of each partitions, 
and then the results for all the partitions, using given  combine 
functions and a neutral "zero value". This function can return a 
different result type, U, than the type of this RDD, T. Thus, we need 
one operation for merging a T into an U and one operation for merging 
two U's, as in scala. TraversableOnce. Both of these functions are 
allowed to modify and return their first argument instead of creating a 
new U to avoid memory allocation.

对每个partition得到的结果,和一个zero value带入给定的combine函数。本函数可以返回一个不同的类型,U,而不是这个RDD的类型,T。因此,我们需要一个操作来把T转成U,还有一个操作把两个U合并,成为scala.TraversableOnce。这两个函数都允许调整并返回它们的第一个参数,而不是创造一个新的U,从而避免分配内存。

实例 1:

def seqOP(a:Int, b:Int):Int = {
    println("seqOp: " + a + "\t" +b )
    math.min(a,b)
}
//seqOP : (a:Int, b:Int)Int

def combOP(a:Int, b:Int):Int = {
    println("combOP: " + a + "\t" + b)
    a + b
}
//combOP:(a:Int,b:Int)Int

val z = sc.parallelize(List(1,2,3,4,5,6),2)
z.aggregate(3)(seqOP, combOP)
//output
seqOp:3     1  //partition 1:  seqOP(3,1)=>1
seqOp:3     4  //partition 2:seqOP(3,4)=>3
seqOp:1     2  //partition 1:seqOP(1,2)=>1
seqOp:3     5  //partition 2:seqOP(3,5)=>3
seqOp:1     3  //partition 1:seqOP(1,3)=>1
seqOp:3     6  //partition 2:seqOP(3,6)=>3
combOp:3    1 //combOP(3,1)=>4,3:zero value, 1:partiton 1's output
combOp:4    3  //combOP(4,3)=>7,3:partition2's output

//final output:7

实例 2:

val z = sc. parallelize ( List (1 ,2 ,3 ,4 ,5 ,6),3)
val zz = z. aggregate(3)(seqOP, combOp)
seqOp: 3    3
seqOp: 3    4
seqOp: 3    5
seqOp: 3    6
seqOp: 3    1
seqOp: 1    2

combOp: 3   3  //combOp(3:zero value,3:partition 1's output)=>6
combOp: 6   3  //combOp(6,3:partition 2's output)=>6
combOp: 9   1  //combOp(9,1:partition3's output)=>10
10

实例 3:

def seqOp(a:String, b:String) : String = {
println("seqOp: " + a + "\t" + b)
math.min(a.length , b.length ).toString
}
//seqOp: (a: String, b: String)String

def combOp(a:String, b:String) : String = {
println("combOp: " + a + "\t" + b)
a + b
}
//combOp: (a: String, b: String)String

val z = sc. parallelize ( List ("12" ,"23" ,"345" ,"4567") ,2)
z. aggregate ("")(seqOp, combOp)
seqOp:  345  //partition 1: ("","345")=>"0"
seqOp:  12  //partition 2:("","12")=>"0"
seqOp: 0    4567//partition 1:("0","4567")=>"1"
seqOp: 0    23 //partition 2:("0","23")=>"1"
combOp:     1 //combOp("","1")=>"1"
combOp: 1   1 //combOp("1","1")=>"11"
//r    es25: String = 11

注意:

1.reduce函数和combine函数必须满足交换律(commutative)和结合律(associative)
2.从aggregate函数的定义可知,combine函数的输出类型必须和输入的类型一致。

2.Example of the Scala aggregate function

Let’s see if some ASCII art doesn’t help. Consider the type signature of aggregate:

def aggregate [B](z:B)(seqop:(B,A)=>B,combop:(B.B)=>B):B

the aggregate might work like this:
z   A   z   A   z   A   z   A
 \ /     \ /seqop\ /     \ /    
  B       B       B       B
    \   /  combop   \   /
      B _           _ B
         \ combop  /
              B


Now I have a GenSeq(“This”,”is”,”an”,”example”), and I want to know how many characters there are in it. I can write the following:

import scala.collection.GenSeq
val seq = GenSeq("This","is","an","example")
val chars = seq.aggregate(0)(_ + _.length, _ + _)

So, first it would compute this:

0 + "This".length    //4
0 + "is".length      //2
0 + "an".length      //2
0 + "example".length //7

What it does next cannot be predicted ( there are more than one way of combining the result ), but it might do this(like in the ascii art above):

4 + 2 // 6
2 + 7 // 9

At which it concludes with
6 + 9 // 15
which gives the final result. Now, this is a bit similar in structure to foldLeft, but it has an additional function(B,B)=>B, which fold doesn’t have. This function, however, enables it to work in parallel.

Consider, for example, that each of the four computations initial computations are independent of each other, and can be done in parallel. The next two (resulting in 6 and 9) can be started once their computations on which they depend are finished, but these two can also run in parallel.

The 7 computitions, parallelized above, could take as little as the same time 3 serial computations.
上面的7次计算可以并行化为3次计算。

Actually, with such a small collection the cost in synchronizing computation would be big enough to wipe out any gains. Furthermore, if you folded this, it would only take 4 computations total. Once your collections get larger, however, you start to see some real gains.

Consider, on the other hand, foldLeft. Because it doesn’t have the additional function, it cannot parallelize any computation:

(((0 + "This.length) + "is".length) + "an".length) + "example".length

Each of the inner parenthesis must be computed before the outer one can proceed.

参考http://blog.csdn.net/power0405hf/article/details/50347005

发布了131 篇原创文章 · 获赞 79 · 访问量 31万+

猜你喜欢

转载自blog.csdn.net/qq_31780525/article/details/79111932