spark notes

was conf = new SparkConf ()

was sc: SparkContext = new SparkContext (conf)

val rawRDDA = sc.parallelize(List("!! bb ## cc","%% cc bb %%","cc && ++ aa"),3) 

           

# Sc.parallelize (, 3) the parallel data is loaded onto three machines

 

var tmpRDDA1 = rawRDDA.flatMap(line=>line.split(" "))

var tmpRDDA2 = tmpRDDA1.filter(allWord=>{allWord.contains("aa") || allWord.contains("bb")})

var tmpRDDA3 = tmpRDDA2.map(word=>(word,1))

import org.apache.spark.HashPartitioner

was tmpRDDA4 = tmpRDDA.partitionBy (New hash partitions (2)). groupByKey ()

 

#partitionBy (new HashPartitioner (2)). The machine before groupByKey 3 into two machines Shuffle

 

var tmpResultRDDA = tmpRDDA4.map((P:(String,Iterable[Int]))=>(P._1,P._2.sum))

# Summing the value of the same key

Partition: a machine on a fixed data block is a combination of a series of related Partition RDD.

                As tmpRDDA2 has three Partition, and tmpResultRDDA has two Partition

 

RDD: data location of a unified operation, any operation code (e.g. faltMap, filter, map), are all within the Partition performed RDD

          As rawRDDA-> tmpRDDA1, executing flatMap (line => line.split ( "")), then the three rawRDD Partition (respectively on cslave0 "!! bb ## cc",

          On cslave1 "- cc bb $$" and "cc ^^ ++ aa" on cslave2 have to perform flatMap operation)

 

RDD location data parallelism, all belonging to a RDD Partition must perform the same operation, if the Partition present in different machines, different machines will be executed simultaneously 

        OK, that is executed in parallel

 

RDD parallel paradigm mainly Map and Shuffle

        Map Paradigm: only operation on the data present on the Partition, the data objects do not span a plurality of operation Partition, i.e. not across the network.

        Shuffle paradigm: the restructuring of the data on different Partition, the operation of the data objects across a plurality or even all Partition, i.e., across the network

 

 

 

Scene: Multiple input sources

And two original file rawFile1 rawFile2, asking that it be evenly loaded onto rawFile1 cslave3, the cslave4, rawFile1 Next, the data de-duplication,

Requires rawFile2 loaded into cslave5, then the entry rawFile2 processing result contained in the removed rawFile1 

was conf = new SparkConf ()

was sc: SparkContext = new SparkContext (conf)

var rawRDDB = sc.parallelize(List(("xx",99),("yy",88),("xx",99),("zz",99)),2)

var rawRDDC = sc.parallelize(List(("yy",88)),1)

var tmpResultRDDBC = rawRDDB.distinct.subtract(rawRDDC)

 

subtract () is to subtract two RDD, which is different from the two input files RDD

       

 

Scene: complications

Multiple input sources, de-emphasis, installed for, recombined

was conf = new SparkConf ()

was sc: SparkContext = new SparkContext (conf)

var rawRDDA = sc.parallelize(List("!! bb ## cc","%% cc bb %%","cc && ++ aa"),3)

var rawRDDB = sc.paralleliz(List(("xx,99),("yy",88),("xx",99),("zz",99)),2)

var rawRDDC = sc.parallelize(List(("yy",88)),1)

import org.apache.spark.HashPartitioner

var tmpResultRDDA = rawRDDA.flatMap(line=>line.split(" ")).filter(allWord=>{allWord.contains("aa")||allWord.contains("bb")}).map(word=>(word,1)).partitionBy(new HashPartitioner(2)).groupByKey().map((P:String,Iterable[Int]))=>(P._1,P._2.sum))

var tmpResultRDDBC = rawRDDB.distinct.subtract(rawRDDC)

was resultRDDABC = tmpResultRDDA.union (tmpResultRDDBC)

resultRDDABC.saveAsTextFile("HDFS路径")

 

map范式作用于RDD时,不会改变前后两个RDD内Partition数量, 当partitionBy,union作用于RDD时,会改变前后两个RDD内Partition数量

 

RDD持久化到HDFS时,RDD对应一个文件夹,属于该RDD的每个Partition对应一个独立文件

RDD之间的中间数据不存入本地磁盘或HDFS

RDD的多个操作可以用点‘.’连接,如 RDD1.map().filter().groupBy()

 

RDD可以对指定的某个Partition进行操作,而不更改其他Partition

 

Spark-app执行流程:
1.用户调用RDD API接口,编写rdd转换应用代码

2.使用spark提交job到Master

3.Master收到job,通知各个Worker启动Executor

4.各个Executor向Driver注册 (用户编写的代码和提交任务的客户端统一称Driver)

5.RDD Graph将用户的RDD串组织成DAG-RDD

6.DAGSchedule 以Shuffle为原则(即遇Shuffle就拆分)将DAG-RDD拆分成一系列StageDAG-RDD(StageDAG-RDD0->StageDAG-RDD1->StageDAG-rdd2->...)

7.RDD通过访问NameNode,将DataNode上的数据块装入RDD的Partition

8.TaskSchedule将StageDAG-RDD0发往隶属于本RDD的所有Partition执行,在Partition执行过程中,Partition上的Executor优先执行本Partition.

9.TaskSchedule将StageDAG-RDD1发往隶属于本RDD(已改变)的所有Partition执行

10.重复上面8,9步的步骤,直至执行完所有Stage-DAG-RDD

 

 

资源隔离性

每个执行的Spark-APP都有自己一系列的Executor进程(分布在不同的机器上或内核上),这些Executor会协作完成该任务。

单个Executor会以多线程复用方式运行该Spark-APP分配来的所有Task .

一个Executor只属于一个Spark-APP,一个Spark-APP可以有多个Executor

这与MapReduce不同。  比如某个由Map->Reduce1->Reduce2构成的ML-App,有十个Slave同时执行该任务,从某一个slave机器上来看,

MapReduce框架执行时会启动Map进程,Reduce1进程,Reduce2进程,三个进程顺序执行该任务

而Spark则使用一个Executor进程完成这四个操作。

 

spark-APP本身感知不到集群的存在

 

Guess you like

Origin www.cnblogs.com/Ting-light/p/11103989.html