Two, spark - spark core principles and use

[TOC]

A, spark some basic terminology

RDD: Elastic distributed data set is the core focus of spark
operator: RDD operating some functions of
application: the user writes spark program (DriverProgram + ExecutorProgram)
the Job: a class action triggered by operator operation
stage: a set of tasks, the job will be divided into several stage of a dependency
task: stage with a plurality of internal operation of the same task (but different data processing), the cluster is the smallest unit of execution

Having these concepts may, in fact, still do not understand, it does not matter, this is just the first little impression

Two, RDD basic principle and

2.1 What is RDD

RDD, full name: Resilient Distributed Dataset, which is resilient distributed datasets. Spark is the most basic data abstraction, which represents an immutable, can partition the set of elements which may be parallel computation. RDD characteristic data flow model having: automatic fault tolerance, location-aware scheduling and scalability. Perhaps this is not clear, I give an example:
Suppose I use sc.textFile (xxxx) to read data from a file hdfs, then the data file is equivalent to a RDD, but in fact the data in the file processing a worker at a plurality of different nodes, but logically, the spark in the cluster, these data are a part of the RDD. That is why RDD is a logical concept, it is an abstract object for the entire cluster, distributed in the cluster. This shows, RDD spark key is distributed computing for data processing. E.g:
Two, spark - spark core principles and use

Figure 2.1 RDD principle

2.2 RDD property

RDD on the property, there is a comment in the source code, as follows:

* Internally, each RDD is characterized by five main properties:
*  - A list of partitions
1、是一组分区
理解:RDD是由分区组成的,每个分区运行在不同的Worker上,通过这种方式,实现分布式计算,即数据集的基本组成单位。对于RDD来说,每个分片都会被一个计算任务处理,并决定并行计算的粒度。用户可以在创建RDD时指定RDD的分片个数,如果没有指定,那么就会采用默认值。默认值就是程序所分配到的CPU Core的数目。

*  - A function for computing each split
2、split理解为分区
在RDD中,有一系列函数,用于处理计算每个分区中的数据。这里把函数叫做算子。Spark中RDD的计算是以分片为单位的,每个RDD都会实现compute函数以达到这个目的。compute函数会对迭代器进行复合,不需要保存每次计算的结果。
算子类型:
transformation   action

*  - A list of dependencies on other RDDs
3、RDD之间存在依赖关系。窄依赖,宽依赖。
需要用依赖关系来划分Stage,任务是按照Stage来执行的。RDD的每次转换都会生成一个新的RDD,所以RDD之间就会形成类似于流水线一样的前后依赖关系。在部分分区数据丢失时,Spark可以通过这个依赖关系重新计算丢失的分区数据,而不是对RDD的所有分区进行重新计算。

*  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
4、可以自动以分区规则来创建RDD
创建RDD时,可以指定分区,也可以自定义分区规则。
当前Spark中实现了两种类型的分片函数,一个是基于哈希的HashPartitioner,另外一个是基于范围的RangePartitioner。只有对于于key-value的RDD,才会有Partitioner,非key-value的RDD的Parititioner的值是None。Partitioner函数不但决定了RDD本身的分片数量,也决定了parent RDD Shuffle输出时的分片数量。

*  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
*    an HDFS file)
5、优先选择离文件位置近的节点来执行任务。
移动计算,不移动数据
这点需要解释下:一般来说spark是构建在hdfs之上,从hdfs中读取数据进行处理的。而hdfs是一个分布式存储,比如有A、B、C三个datanode,假设spark要处理的数据刚好存储在C节点上。如果spark此时将任务放在B节点或者A节点上执行,那么就得先从C节点读取数据,然后经过网络传输到A或B节点,然后才能处理,这其实是很耗费性能。而这里spark的意思是优先在离处理数据比较近的节点上执行任务,也就是优先在C节点上执行任务。这就可以节省数据传输所耗费的时间和性能。也就是移动计算而不移动数据。

2.3 Creating RDD

创建RDD首先需要创建 SparkContext对象:
//创建spark配置文件对象.设置app名称,master地址,local表示为本地模式。
//如果是提交到集群中,通常不指定。因为可能在多个集群汇上跑,写死不方便
val conf = new SparkConf().setAppName("wordCount").setMaster("local")
//创建spark context对象
val sc = new SparkContext(conf)

() To create RDD by sc.parallelize:

sc.parallelize(seq,numPartitions)
seq:为序列对象,如list,array等
numPartitions:分区个数,可以不指定,默认是2

例子:
val rdd1 = sc.parallelize(Array(1,2,3,4,5),3)
rdd1.partitions.length

Created by an external data source

val rdd1 = sc.textFile("/usr/local/tmp_files/test_WordCount.txt")

2.4 Operator Type

Operators have divided transformation and action type,
transformation:

延迟计算,lazy懒值,不会触发计算。只有遇到action算子才会触发计算。它们只是记住这些应用到基础数据集(例如一个文件)上的转换动作。只有当发生一个要求返回结果给Driver的动作时,这些转换才会真正运行。这种设计让Spark更加有效率地运行

action:

和transformation类似,但是是直接触发计算的,不会等待

2.5 transformation operator

Here, for ease of discussion, implementation creates a rdd, use spark-shell in a presentation:

scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,8,34,100,79))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

2.5.1 map(func)

map[U](f:T=>U)()
参数是一个函数,并且要求函数参数是单个,返回值也是单个。用函数处理传入的数据,然后返回处理后的数据

例子:
//这里传入一个匿名函数,将rdd1中的每个值*2,并返回处理的新数组
scala> val rdd2 = rdd1.map(_*2)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26

//这里collect是一个action算子,触发计算并打印结果
scala> rdd2.collect
res0: Array[Int] = Array(2, 4, 6, 8, 10, 16, 68, 200, 158)

2.5.2 filter

filter(f:T=>boolean)
参数是一个判断函数,判断传入的参数,然后返回true还是false,常用来过滤数据。最后将true的数据返回

例子:
//过滤出大于20的数据
scala> rdd2.filter(_>20).collect
res4: Array[Int] = Array(68, 200, 158)

2.5.3 flatMap

flatMap(f:T=>U)
先map后flat,flat就是将多个列表等对象展开合并成一个大列表。并返回处理后的数据。这个函数一般用来处理多个一个列表中还包含多个列表的情况

例子:
scala> val rdd4 = sc.parallelize(Array("a b c","d e f","x y z"))
rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:24

//处理逻辑是:将array中每个字符串按空格切割,然后生成多个array,接着将多个array展开合并一个新的array
scala> val rdd5 = rdd4.flatMap(_.split(" "))
rdd5: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at flatMap at <console>:26

scala> rdd5.collect
res5: Array[String] = Array(a, b, c, d, e, f, x, y, z)

2.5.4 set operations

union(otherDataset) 并集
intersection(otherDataset) 交集
distinct([numTasks]))去重

例子:
scala> val rdd6 = sc.parallelize(List(5,6,7,8,9,10))
rdd6: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:24

scala> val rdd7 = sc.parallelize(List(1,2,3,4,5,6))
rdd7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at <console>:24

//并集
scala> val rdd8 = rdd6.union(rdd7)
rdd8: org.apache.spark.rdd.RDD[Int] = UnionRDD[22] at union at <console>:28

scala> rdd8.collect
res6: Array[Int] = Array(5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6)

//去重
scala> rdd8.distinct.collect
res7: Array[Int] = Array(4, 8, 1, 9, 5, 6, 10, 2, 7, 3)                         
//交集
scala> val rdd9 = rdd6.intersection(rdd7)
rdd9: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[31] at intersection at <console>:28

scala> rdd9.collect
res8: Array[Int] = Array(6, 5)

2.5.5 Operation packet

groupByKey([numTasks]):只是将同一key的进行分组聚合
reduceByKey(f:(V,V)=>V, [numTasks]) 首先是将同一key的KV进行聚合,然后对value进行操作。

scala> val rdd1 = sc.parallelize(List(("Tom",1000),("Jerry",3000),("Mary",2000)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[32] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(List(("Jerry",1000),("Tom",3000),("Mike",2000)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[33] at parallelize at <console>:24

scala> val rdd3 = rdd1 union rdd2
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[34] at union at <console>:28

scala> rdd3.collect
res9: Array[(String, Int)] = Array((Tom,1000), (Jerry,3000), (Mary,2000), (Jerry,1000), (Tom,3000), (Mike,2000))

scala> val rdd4 = rdd3.groupByKey
rdd4: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[35] at groupByKey at <console>:30

//分组
scala> rdd4.collect
res10: Array[(String, Iterable[Int])] = 
Array(
(Tom,CompactBuffer(1000, 3000)), 
(Jerry,CompactBuffer(3000, 1000)), 
(Mike,CompactBuffer(2000)), 
(Mary,CompactBuffer(2000)))

注意:使用分组函数时,不推荐使用groupByKey,因为性能不好,官方推荐reduceByKey
//分组并聚合
scala> rdd3.reduceByKey(_+_).collect
res11: Array[(String, Int)] = Array((Tom,4000), (Jerry,4000), (Mike,2000), (Mary,2000))

2.5.6 cogroup

这个函数的功能不太好总结,直接看例子吧
scala> val rdd1 = sc.parallelize(List(("Tom",1),("Tom",2),("jerry",1),("Mike",2)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[37] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(List(("jerry",2),("Tom",1),("Bob",2)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[38] at parallelize at <console>:24

scala> val rdd3 = rdd1.cogroup(rdd2)
rdd3: org.apache.spark.rdd.RDD[(String, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[40] at cogroup at <console>:28

scala> rdd3.collect
res12: Array[(String, (Iterable[Int], Iterable[Int]))] = 
Array(
(Tom,(CompactBuffer(1, 2),CompactBuffer(1))), 
(Mike,(CompactBuffer(2),CompactBuffer())), 
(jerry,(CompactBuffer(1),CompactBuffer(2))), 
(Bob,(CompactBuffer(),CompactBuffer(2))))

2.5.7 Sorting

sortByKey(acsending:true/false) 根据KV中的key排序
sortBy(f:T=>U,acsending:true/false) 一般排序,且是对处理后的数据进行排序,可以用来给KV的,按照value进行排序

例子:
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,8,34,100,79))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val rdd2 = rdd1.map(_*2)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26

scala> rdd2.collect
res0: Array[Int] = Array(2, 4, 6, 8, 10, 16, 68, 200, 158)

scala> rdd2.sortBy(x=>x,true)
res1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at sortBy at <console>:29

scala> rdd2.sortBy(x=>x,true).collect
res2: Array[Int] = Array(2, 4, 6, 8, 10, 16, 68, 158, 200)                      

scala> rdd2.sortBy(x=>x,false).collect
res3: Array[Int] = Array(200, 158, 68, 16, 10, 8, 6, 4, 2)

Another example:

需求:
我们想按照KV中的value进行排序,但是SortByKey按照key排序的。

做法一:
1、第一步交换,把key value交换,然后调用sortByKey
2、KV再次调换位置
scala> val rdd1 = sc.parallelize(List(("tom",1),("jerry",1),("kitty",2),("bob",1)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[42] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(List(("jerry",2),("tom",3),("kitty",5),("bob",2)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[43] at parallelize at <console>:24

scala> val rdd3 = rdd1 union(rdd2)
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[44] at union at <console>:28

scala> val rdd4 = rdd3.reduceByKey(_+_)
rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[45] at reduceByKey at <console>:30

scala> rdd4.collect
res13: Array[(String, Int)] = Array((bob,3), (tom,4), (jerry,3), (kitty,7))

//调换位置再排序,然后再调回来
scala> val rdd5 = rdd4.map(t => (t._2,t._1)).sortByKey(false).map(t=>(t._2,t._1))
rdd5: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[50] at map at <console>:32

scala> rdd5.collect
res14: Array[(String, Int)] = Array((kitty,7), (tom,4), (bob,3), (jerry,3)) 

做法二:
直接使用sortBy 这个算子,可以直接按照value排序

2.6 action operator

reduce

类似前面的reduceByKey,但是用于非KV的数据合并,且是action算子

scala> val rdd1 = sc.parallelize(List(1,2,3,4,5))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[41] at parallelize at <console>:24

scala> val rdd2 = rdd1.reduce(_+_)
rdd2: Int = 15

There are some action operators:

reduce(func)    通过func函数聚集RDD中的所有元素,这个功能必须是可交换且可并联的
collect()   在驱动程序中,以数组的形式返回数据集的所有元素。通常只是用于触发计算
count()  返回RDD的元素个数
first()   返回RDD的第一个元素(类似于take(1))
take(n)   返回一个由数据集的前n个元素组成的数组
takeSample(withReplacement,num, [seed]) 返回一个数组,该数组由从数据集中随机采样的num个元素组成,可以选择是否用随机数替换不足的部分,seed用于指定随机数生成器种子
takeOrdered(n, [ordering])  ,返回一个由数据集的前n个元素组成的数组,并排序
saveAsTextFile(path)    将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本
saveAsSequenceFile(path)    将数据集中的元素以Hadoop sequencefile的格式保存到指定的目录下,可以使HDFS或者其他Hadoop支持的文件系统。
saveAsObjectFile(path)  
countByKey()    针对(K,V)类型的RDD,返回一个(K,Int)的map,表示每一个key对应的元素个数。
foreach(func)   在数据集的每一个元素上,运行函数func进行更新。

2.7 RDD cache properties

RDD there caching mechanism, that is, the RDD cached in memory or on disk, without double counting.
Here it involves several operators:

cache()   标识该rdd可以被缓存,默认缓存在内存中,底层其实是调用persist() 
persist() 标识该rdd可以被缓存,默认缓存在内存中
persist(newLevel : org.apache.spark.storage.StorageLevel) 和上面类似,但是可以指定缓存的位置

可以缓存的位置有:
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)

基本就分为两类:
纯内存缓存
纯磁盘缓存
磁盘+内存缓存

In general, direct is the default position, that is cached in memory, better performance, but it will cost a lot of memory, which point to note, without need, do not cache.
For example:

读取一个大文件,统计行数

scala> val rdd1 = sc.textFile("hdfs://192.168.109.132:8020/tmp_files/test_Cache.txt")
rdd1: org.apache.spark.rdd.RDD[String] = hdfs://192.168.109.132:8020/tmp_files/test_Cache.txt MapPartitionsRDD[52] at textFile at <console>:24

scala> rdd1.count
res15: Long = 923452 
触发计算,统计行数

scala> rdd1.cache
res16: rdd1.type = hdfs://192.168.109.132:8020/tmp_files/test_Cache.txt MapPartitionsRDD[52] at textFile at <console>:24
标识这个RDD可以被缓存,不会触发计算

scala> rdd1.count
res17: Long = 923452 
触发计算,并把结果进行缓存

scala> rdd1.count
res18: Long = 923452
直接从缓存中读取数据。

A point to note: when calling cache method, but said at the time of the follow-up to identify the rdd trigger calculation, the results can be cached, rather than that of the current rdd are cached, and this point to make it clear

2.8 RDD fault tolerance --checkpoint

spark in the calculation, the intermediate conversion process involves a number of RDD, if this time RDD calculation of a partition failure, resulting in the loss results. The easiest way is to start from scratch naturally recalculated, but this is a waste of time. The checkpoint is triggered when calculated on the RDD check point state preservation, if the latter calculation is wrong, you can start again from the checkpoint calculated.
checkpoint is generally in the fault-tolerant, high reliable file system (such as HDFS, S3, etc.) is provided a checkpoint path, for storing checkpoint data. An error occurred when reading directly from a checkpoint directory. There are the local and remote directories modes.

2.8.1 local directory

In this mode, it required to run in local mode can not be used in the cluster mode, generally used for test development

sc.setCheckpointDir(本地路径)   设置本地checkpoint路径
rdd1.checkpoint    设置检查点
rdd1.count         遇到action类算子,触发计算,就会在checkpoint目录生成检查点

2.8.2 Remote Directory (hdfs for example)

This model requires to run in a clustered mode, used in the production environment

scala> sc.setCheckpointDir("hdfs://192.168.109.132:8020/sparkckpt0619")

scala> rdd1.checkpoint

scala> rdd1.count
res22: Long = 923452

用法都是类似,只是目录不一样

Note that when using the checkpoint, the source code for some words were as follows:

this function must be called before any job has been executed on this RDD. It is strongly recommend that  this rdd is
persisted in memory,otherwise saving it on a file will require recomputation.

大致意思就是:
要在开始计算前就调用这个方法,也就是action算子之前。而且最好设置这个rdd缓存在内存中,否则保存检查点的时候,需要重新计算一次。

Dependence and stage the principle of 2.9 RDD

2.9.1 RDD dependence

This concept is a key principle RDD operation.
First talk-dependent, which means that there is a dependency relationship between the RDD dependent, because the calculation process involves spark plurality RDD conversion. It depends on the relationship between parent and RDD RDD (S) of two different types, i.e., narrow-dependent (narrow dependency) and wide-dependent (wide dependency). Figure
Two, spark - spark core principles and use

FIG width dependence of 2.2 RDD

Width Dependence:
a parent partition is RDD RDD plurality of sub-partition-dependent. In fact, that is the parent of RDD data shuffle process, because the parent partition is a partition RDD RDD more dependent, meaning that you need to disrupt the distribution of data to multiple parent RDD RDD, disrupt the process is actually shuffle. The reality is that generally a plurality of partition parent RDD and RDD plurality of sub-partition interdigitated dependent.

Narrow-dependent:
a parent partition up to RDD is dependent on a partition of the sub RDD

2.9.2 stage division

Two, spark - spark core principles and use

FIG depend 2.3 RDD

​ DAG(Directed Acyclic Graph)叫做有向无环图,原始的RDD通过一系列的转换就就形成了DAG,根据RDD之间的依赖关系的不同将DAG划分成不同的Stage。宽窄依赖的作用就是用划分stage,stage之间是宽依赖,stage内部是窄依赖。
​ 对于窄依赖,由于父和子RDD的partition都是一对一的依赖关系,所以可以父和子的转换可以放在一个task中执行,例如上面的task0,CDF都是窄依赖,所以CDF直接的转换是可以放在一个task中执行的。一个stage内部都是窄依赖
​ 对于宽依赖,由于有shuffle的存在,那么就要求所有父RDD都处理完成后,才能执行执行shuffle,接着子RDD才能进行处理。由于shuffle的存在,导致task任务链必定不是连续的了,需要重新开始规划task任务链,所以宽依赖是划分stage的依据。
​ 再往深的说,为什么要划分stage?
​ 根据宽依赖划分了stage之后,因为宽依赖的shuffle,所以导致task链是无法连续。而划分stage就是让一个stage内部只有窄依赖,窄依赖是一对一的关系,那么task链就是连续的,没有shuffle,就比如上面task0中,C->D->F,中的一个分区,转换过程都是一对一的,所以是一个连续的task链,放在一个task中,而另外一个分区类似,就放在task1中。因为F->G是宽依赖,需要shuffle,所以task链无法连续。像这种一条线把RDD转换逻辑串起来,直到遇到宽依赖,就是一个task,而一个task处理的实际上是一个分区的数据转换过程。而在spark中,task是最小的调度单位,spark会将task分配到离分区数据近的worker节点上执行。所以其实spark调度是task。
​ 那么回到最开始的问题上,为什么要划分stage,因为根据宽窄依赖划分出stage之后,stage内部就可以很方便划分出一个个task,每个task处理一个分区的数据,然后spark就将task调度到对应的worker节点上执行。所以从划分stage到划分task,核心就在于实现并行计算。
​ 所以,最后就是一句话,划分stage的目的是为了更方便的划分task

2.9.3 RDD存储的是什么?

​ 说到这里,其实我们想到一个问题,RDD里面存储的是数据吗?实际上并不是,它存储的实际上对数据的转换链,说的具体点是对分区的转换链,也就是task中包含的算子。而当划分stage,接着划分task之后,一个task内部有什么算子就已经很清楚了,接着就是把计算任务发送到worker节点执行。这种计算我们称为 pipeline计算模式,算子就是在管道中的。
​ 由此,其实RDD叫做弹性分布式数据集,并不是说它存储数据,而是存储了操作数据的方法函数,也就是算子。

2.10 RDD高级算子

2.10.1 mapPartitionsWithIndex

def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) ⇒ Iterator[U])

参数说明:
f是一个函数参数,需要自定义。
f 接收两个参数,第一个参数是Int,代表分区号。第二个Iterator[T]代表该分区中的所有元素。

通过这两个参数,可以定义处理分区的函数。
Iterator[U] : 操作完成后,返回的结果。

举例:
将每个分区中的元素,包括分区号,直接打印出来。

//先创建一个rdd,指定分区数为3
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> def fun1(index:Int,iter:Iterator[Int]):Iterator[String]={
| iter.toList.map( x => "[PartId: " + index + " , value = " + x + " ]").iterator
| }
fun1: (index: Int, iter: Iterator[Int])Iterator[String]

scala> rdd1.mapPartitionsWithIndex(fun1).collect
res0: Array[String] = Array(
[PartId: 0 , value = 1 ], [PartId: 0 , value = 2 ], 
[PartId: 1 , value = 3 ], [PartId: 1 , value = 4 ], [PartId: 1 , value = 5 ], 
[PartId: 2 , value = 6 ], [PartId: 2 , value = 7 ], [PartId: 2 , value = 8 ]
)

2.10.2 aggregate

聚合操作,类似于分组(group by).
但是aggregate是先局部聚合(类似于mr中的combine),然后再全局聚合。性能比直接使用reduce算子要好,因为reduce是直接全局聚合的。

def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)
参数说明:
zeroValue:初始值,这个初始值会加入到每一个分区中,最后也会加入到全局操作中
seqOp: (U, T) ⇒ U:局部聚合操作函数
combOp: (U, U) ⇒ U:全局聚合操作函数

=================================================
例子1:
初始值是10
scala> val rdd2 = sc.parallelize(List(1,2,3,4,5),2)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:27

//打印看看分区情况
scala> rdd2.mapPartitionsWithIndex(fun1).collect
res7: Array[String] = Array([PartId: 0 , value = 1 ], [PartId: 0 , value = 2 ], [PartId: 1 , value = 3 ], [PartId: 1 , value = 4 ], [PartId: 1 , value = 5 ])

//求出每个分区的最大值,最后得出每个分区最大值,然后全局之后将每个最大值相加
scala> rdd2.aggregate(10)(max(_,_),_+_)
res8: Int = 30

为什么这里是10呢?
初始值是10 代表每个分区都多了一个10
局部操作,每个分区最大值都是10
全局操作,也多一个10 即 10(局部最大) + 10(局部最大) + 10(全局操作默认值) = 30

=================================================
例子2:
使用aggregate将所有分区全局数据求和,有两种方式:
1、reduce(_+_)
2、aggregate(0)(_+_,_+_)

2.10.3 aggregateByKey

类似于aggregate操作,区别:操作的 <key value> 的数据,只操作同一key的中的value。是将同一key的KV先局部分组,然后对value聚合。然后再全局分组,再对value聚合。

 aggregateByKey和reduceByKey实现的类似的功能,但是效率比reduceByKey高

例子:
val pairRDD = sc.parallelize(List(("cat",2),("cat",5),("mouse",4),("cat",12),("dog",12),("mouse",2)),2)
def fun1(index:Int,iter:Iterator[(String,Int)]):Iterator[String]={
iter.toList.map( x => "[PartId: " + index + " , value = " + x + " ]").iterator
}

pairRDD.mapPartitionsWithIndex(fun1).collect

scala> val pairRDD = sc.parallelize(List(("cat",2),("cat",5),("mouse",4),("cat",12),("dog",12),("mouse",2)),2)
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[16] at parallelize at <console>:27

scala> def fun1(index:Int,iter:Iterator[(String,Int)]):Iterator[String]={
| iter.toList.map( x => "[PartId: " + index + " , value = " + x + " ]").iterator
| }
fun1: (index: Int, iter: Iterator[(String, Int)])Iterator[String]

scala> pairRDD.mapPartitionsWithIndex(fun1).collect
res31: Array[String] = Array(
[PartId: 0 , value = (cat,2) ], [PartId: 0 , value = (cat,5) ], [PartId: 0 , value = (mouse,4) ],
[PartId: 1 , value = (cat,12) ], [PartId: 1 , value = (dog,12) ], [PartId: 1 , value = (mouse,2)
])

需求:
找到每个分区中,动物最多的动物,进行就和
pairRDD.aggregateByKey(0)(math.max(_,_),_+_).collect

0:(cat,2)和(cat,5) --> (cat,5)  (mouse,4)
1:(cat,12)   (dog,12)   (mouse,2)

求和:(cat,17)  (mouse,6)   (dog,12) 

2.10.4 coalesce 和 repartition

这两者都用于重分区
 repartition(numPartition)  指定重分区个数,一定会发生shuffle
 coalesce(numPartition,shuffleOrNot) 指定重分区个数,默认不会发生shuffle,可以指定shuffle

要看更多算子的用法,<http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html>;
写的很详细

2.11 分区

spark自带了两个分区类:
HashPartitioner:这个是默认的partitioner,在一些涉及到shuffle的算子会用到。在一些可以指定最小分区数量的算子中,就会涉及到分区。这些分区只能用于KV对
RangePartitioner:按照key的范围进行分区,比如1~100,101~200分别是不同分区的
用户也可以自己自定义分区,步骤如下:
1、先继承Partitioner类,里面写分区逻辑,形成一个新的分区类
2、rdd.partitionBy(new partiotionerClassxxx())
例子:

数据格式如下:
192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/web.jsp HTTP/1.1" 200 239
192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242

需求:
将同一个页面的访问日志各自写到一个单独的文件中 

代码:
package SparkExer

import org.apache.spark.{Partitioner, SparkConf, SparkContext}
import scala.collection.mutable

/**
  * 自定义分区:
  * 1、先继承Partitioner类,里面写分区逻辑,形成一个新的分区类
  * 2、rdd.partitionBy(new partiotionerClassxxx())
  */
object CustomPart {
  def main(args: Array[String]): Unit = {
    //指定hadoop的家目录,写入文件到本地时需要用到hadoop的一些包
    System.setProperty("hadoop.home.dir","F:\\hadoop-2.7.2")

    val conf = new SparkConf().setAppName("Tomcat Log Partitioner").setMaster("local")
    val sc = new SparkContext(conf)
    //切割文件
    val rdd1 = sc.textFile("G:\\test\\tomcat_localhost_access_log.2017-07-30.txt").map(
      line => {
        val jspName = line.split(" ")(6)
        (jspName,line)
      }
    )

    //提取出所有key,也就是网页名
    val rdd2 = rdd1.map(_._1).distinct().collect()
    //分区
    val rdd3 = rdd1.partitionBy(new TomcatWebPartitioner(rdd2))
    //将分区数据写到文件中
    rdd3.saveAsTextFile("G:\\test\\tomcat_localhost")
  }
}

class TomcatWebPartitioner(jspList:Array[String]) extends Partitioner{
  private val listMap = new mutable.HashMap[String,Int]()
  var partitionNum = 0

  //根据网页名称,规划整个分区个数
  for (s<-jspList) {
    listMap.put(s, partitionNum)
    partitionNum += 1
  }

  //返回分区总个数
  override def numPartitions: Int = listMap.size

  //按照key返回某个分区号
  override def getPartition(key: Any): Int = listMap.getOrElse(key.toString, 0)
}

2.12 序列化问题

​ 首先我们知道一点,一个spark程序其实是分为两部分的,一部分driver,它也是在executor中运行的,另一部分则是普通的executor,用于运行操作RDD的task的。所以其实也可以看出,只有是对RDD操作的代码才会进行分布式运行,分配到多个executor中运行,但是不属于RDD的代码是不会的,它仅仅是在driver中执行。这就是关键了。
例子:

object test {
    val sc = new SparkContext()
    print("xxxx1")

    val rdd1 = sc.textFile(xxxx)
    rdd1.map(print(xxx2)) 

}

例如上面的例子,rdd1中的print(xxx2)会在多个executor中执行,因为它是在rdd内部执行,而外面的print("xxxx1")只会在driver中执行,也没有实现序列化,所以实际上也不可能通过网络传递。所以这种区别一定要了解。由此我们可以知道,如果变量什么的不是在rdd内部的话,是不可能被多个executor上的程序获得的。但是如果我们想这样呢?而且是不需要定义在rdd内部。那么就得用到下面的共享变量了

2.13 spark中广播变量(共享变量)

Broadcast variables that can be implemented in the driver variables to rdd running in a different executor operator calls, but no longer defined within rdd child count. Common connection object such as a database connection mysql, etc., can be set to broadcast variables, so that you can create only one connected.
Usage example:

//定义共享变量,用于共享从mongodb读取的数据,需要将数据封装成 map(mid1,[map(mid2,score),map(mid3,score)....])的形式

    val moviesRecsMap = spark.read
      .option("uri", mongoConfig.uri)
      .option("collection", MOVIES_RECS)
      .format("com.mongodb.spark.sql")
      .load().as[MoviesRecs].rdd.map(item=> {
      (item.mid, item.recs.map(itemRecs=>(itemRecs.mid,itemRecs.socre)).toMap)
    }).collectAsMap()

    这是关键的一步,就是广播变量出去
    //将此变量广播,后面就可以在任意一个executor中调用了
    val moviesRecsMapBroadcast = spark.sparkContext.broadcast(moviesRecsMap)
    //因为是懒值加载,所以需要手动调用一次才会真正广播出去
    moviesRecsMapBroadcast.id

Three, spark small case

Website page visits statistics before 3.1 N name

需求:根据网站访问日志统计出访问量前N位的网页名称
数据格式如下:
192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/web.jsp HTTP/1.1" 200 239
192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242

代码:
package SparkExer

import org.apache.spark.{SparkConf, SparkContext}

/**
  * 分析tomcat日志
  * 日志例子:
  * 192.168.88.1 - - [30/Jul/2017:12:53:43 +0800] "GET /MyDemoWeb/ HTTP/1.1" 200 259
  *
  * 统计每个页面的访问量
  */
object TomcatLog {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Tomcat Log analysis").setMaster("local")
    val sc = new SparkContext(conf)

    val rdd1 = sc.textFile("G:\\test\\tomcat_localhost_access_log.2017-07-30.txt")
      .map(_.split(" ")(6))
      .map((_,1))
      .reduceByKey(_+_)
      .map(t=>(t._2,t._1))
      .sortByKey(false)
      .map(t=>(t._2,t._1))
      .collect()
    //也可以使用 sortBy(_._2,false)直接根据value进行排序

    //取出rdd中的前N个数据
    rdd1.take(2).foreach(x=>println(x._1 + ":" + x._2))
    println("=========================================")
    //取出rdd中的后N个数据
    rdd1.takeRight(2).foreach(x=>println(x._1 + ":" + x._2))
    sc.stop()
  }
}

3.2 custom partitioning example

See previous examples partition 2.11

3.3 spark connected mysql

package SparkExer

import java.sql.{Connection, DriverManager, PreparedStatement}

import org.apache.spark.{SparkConf, SparkContext}

object SparkConMysql {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Tomcat Log To Mysql").setMaster("local")
    val sc = new SparkContext(conf)
    val rdd1 = sc.textFile("G:\\test\\tomcat_localhost_access_log.2017-07-30.txt")
      .map(_.split(" ")(6))

    rdd1.foreach(l=>{
      //jdbc操作需要包含在rdd中才能被所有worker上的executor调用,也就是借用rdd实现序列化
      val jdbcUrl = "jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8"
      var conn:Connection = null
      //sql语句编辑对象
      var ps:PreparedStatement = null

      conn = DriverManager.getConnection(jdbcUrl, "root", "wjt86912572")
      //?是占位符,后面需要ps1.setxxx(rowkey,value)的形式填充值进去的,按先后顺序
      ps = conn.prepareStatement("insert into customer values (?,?)")
      ps.setString(1,l)
      ps.setInt(2,1)

    })
  }
}

注意:
spark操作jdbc时,如果直接使用jdbc操作数据库,会有序列化的问题。
因为在spark分布式框架中,所有操作RDD的对象应该是属于RDD内部的,
才有可能在整个分布式集群中使用。也就是需要序列化。
通俗来说:5个worker共享一个jdbc连接对象,和5个worker每个单独创建一个连接对象的区别
所以在定义jdbc连接对象时,需要在RDD内部定义

The above methods are too cumbersome, and each data will create a connection object jdbc
optimization: Use rdd1.foreachPartition () operates on each partition, instead of each data operation
such as can be created only by connecting each partition a jdbc Object database to save resources

package SparkExer

import java.sql.{Connection, DriverManager, PreparedStatement}

import org.apache.spark.{SparkConf, SparkContext}

object SparkConMysql {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Tomcat Log To Mysql").setMaster("local")
    val sc = new SparkContext(conf)
    val rdd1 = sc.textFile("G:\\test\\tomcat_localhost_access_log.2017-07-30.txt")
      .map(_.split(" ")(6))

    rdd1.foreachPartition(updateMysql)
    /**
      * 上面的方式过于繁琐,而且每个数据都会新建一个jdbc连接对象
      * 优化:使用rdd1.foreachPartition()来对每个分区操作,而不是对每条数据操作
      * 这样可以通过只为每个分区创建一个jdbc连接对象来节省数据库资源
      */

  }

  def updateMysql(it:Iterator[String]) = {
    val jdbcUrl = "jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8"
    var conn:Connection = null
    //sql语句编辑对象
    var ps:PreparedStatement = null

    conn = DriverManager.getConnection(jdbcUrl, "root", "wjt86912572")
    //conn.createStatement()

    //ps = conn.prepareStatement("select * from customer")
    //?是占位符,后面需要ps1.setxxx(rowkey,value)的形式填充值进去的,按先后顺序
    ps = conn.prepareStatement("insert into customer values (?,?)")
    it.foreach(data=>{
      ps.setString(1,data)
      ps.setInt(2,1)
      ps.executeUpdate()
    })
    ps.close()
    conn.close()
  }
}

Otherwise mysql is a connector to connect the object through JdbcRDD

package SparkExer

import java.sql.DriverManager

import org.apache
import org.apache.spark
import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.{SparkConf, SparkContext}

object MysqlJDBCRdd {
  def main(args: Array[String]): Unit = {
    val conn = () => {
      Class.forName("com.mysql.jdbc.Driver").newInstance()
      DriverManager.getConnection("jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8",
      "root",
      "wjt86912572")
    }
    val conf = new SparkConf().setAppName("Tomcat Log To Mysql").setMaster("local")
    val sc = new SparkContext(conf)
    //创建jdbcrdd对象
    val mysqlRdd = new JdbcRDD(sc,conn,"select * from customer where id>? and id<?", 2,7,2,r=> {
      r.getString(2)
    })

  }
}

这个对象的使用局限性很大,只能用于select,而且必须传入where中的两个限制值,还要指定分区

Four, shuffle problem

4.1 shuffle data analysis due to tilt

https://www.cnblogs.com/diaozhaojian/p/9635829.html

1、数据倾斜原理
(1)在进行shuffle的时候,必须将各个节点上相同的key拉取到某个节点上的一个task来进行处理,此时如果某个key对应的数据量特别大的话,就会发生数据倾斜。
(2)由于shuffle之后的分区规则,导致某个分区数据量过多,导致数据倾斜  

2、数据倾斜问题发现与定位
   通过Spark Web UI来查看当前运行的stage各个task分配的数据量,从而进一步确定是不是task分配的数据不均匀导致了数据倾斜。
    知道数据倾斜发生在哪一个stage之后,接着我们就需要根据stage划分原理,推算出来发生倾斜的那个stage对应代码中的哪一部分,这部分代码中肯定会有一个shuffle类算子。通过countByKey查看各个key的分布。

3、数据倾斜解决方案
过滤少数导致倾斜的key
提高shuffle操作的并行度
局部聚合和全局聚合

4.2 shuffle Class Operator

1、去重:
def distinct()
def distinct(numPartitions: Int)

2、聚合
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
def groupBy[K](f: T => K, p: Partitioner):RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner):RDD[(K, Iterable[V])]
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner): RDD[(K, U)]
def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int): RDD[(K, U)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null): RDD[(K, C)]

3、排序
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)]

def sortBy[K](f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]

4、重分区

def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null)

5、集合或者表操作
def intersection(other: RDD[T]): RDD[T]

def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

def intersection(other: RDD[T], numPartitions: Int): RDD[T]

def subtract(other: RDD[T], numPartitions: Int): RDD[T]

def subtract(other: RDD[T], p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]

def subtractByKey[W: ClassTag](other: RDD[(K, W)], numPartitions: Int): RDD[(K, V)]

def subtractByKey[W: ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K, V)]

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]

def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

Guess you like

Origin blog.51cto.com/kinglab/2450770