Spark analysis and use

Spark analysis and use

A, Spark ( large-scale data processing engine ) features

What are the advantages 0, Spark is? What data structure spark framework is? For different situations, spark how to change the data structure? Transfer spark the data stream is how? Security spark of how to ensure that data?

1, speed: Spark intermediate data application is in memory, storage Spark faster than Hadoop MapReduce 100 times the data stored in the disk 10 times faster

2, Easy of Use: Spark application can use Java, Scala, Python, R, and other programming languages

3, Generality: Spark provides SparkSQL, Streaming, MLlib, GraphX ​​function module, powerful

FIG -1 Spark ecosystem

4, Runs Everywhere: Hadoop running on the Yarn, and the Mesos own standalone, the processing system includes a file HDFS, Cassandra, HBase, S3, Hive and other data sources.

Note: The above section is taken from the official website: http://spark.apache.org/

Second, the design and implementation of the principle of Spark

1, the basic data types Spark: RDD (resilientdistributed dataset), DataFrame, Dataset

1.1, the basic data types of Spark's core RDD : a RDD is a collection of distributed objects can be partitioned . RDD different partitions may be stored on different nodes in the cluster can be performed in parallel on different computing nodes in the cluster. NarrowDependency on a cluster node in a pipelined manner (Pipeline) calculated for all the parent partition.

1.2, RDD : a, high fault tolerance , two, intermediate results to persistent memory, three data may be stored in Java objects , avoiding unnecessary object serialization and deserialization

1.3, RDD of dependencies : Narrow dependent (NarrowDependency), each sub-partition RDD depends on the constant a parent partition (i.e., regardless of the data size) (map, filter, union, join); width dependence (ShuffleDependency), sub RDD each partition is dependent on all the parent partition RDD (groupBy; join)

2, Stage division

2.1, division stage principles: First, the dependency of RDD: encounters a wide dependency is disconnected; narrow face put dependent current RDD added to the Stage, dependent on the possible division into a narrow Stage same. Second, the dependency chain is broken, the interior of each stage can be run in parallel .

2.2, consists of a plurality of Stage Job, according to a Job sequentially performed Stage sequence, that finish the Job .

2.3, DAGScheduler Stage partitioning algorithms : would from triggering action that operation RDD began to reverse analysis , the first will be the last RDD create a stage, reverse analysis when he met the current narrow reliance put RDD added to the Stage, encountered wide rely on open, will depend on the width of the RDD create a new stage , the RDD is this new stage last RDD . And so on, through all the RDD.

FIG exemplary partitioning -2 Stage

3, the model calculates the MapReduce Spark also belongs, but are not limited Map and Reduce operations, may also be provided a plurality of types of operation data set, the programming model is more flexible than Hadoop MapReduce.

图-2 MapReduce vs Spark

4, the application of the various components of ecological Spark:

Figure -3 scenarios

5, Spark operating framework

First, the basic concept:

4.1, RDD (elastic distributed data sets): is a distributed memory abstraction , there is provided a highly restricted in shared memory model.

4.2, DAG (Directed Acyclic Graph): RDD reflected between dependencies , task execution scheduling mechanism

4.3、Executor(是运行在工作节点的一个进程):,负责运行Task

4.4、Application:用户编写Spark应用程序,一个Application由一个Driver(DriverApplication)和若干个Job构成

4.5、Task:运行在Executor上的工作单元,Task分为ShuffleMapTaskResultTask两种,Task是运行Application的基本单位。Task的调度和管理等是由TaskScheduler负责。

4.6、Job:一个application里面每遇到一个action的操作就会生成一个job,一个Job由多个Stage构成,一个job包含多个Rdd及作用于相应RDD上的各种操作,。

4.7、Stage:是Job的基本调度单位,一个Job会分为多组Task,每组Task被称为Stage,或者被称为TaskSet,代表了一组关联的、相互之间没有Shuffle依赖关系(宽依赖)任务组成的任务集。

二、Spark 运行框架的组成

1、Spark 运行框架的组成:集群资源管理器(Cluster Manager) + 任务控制节点(Driver) + 工作节点(Worker Node) + 执行进行(Executor)

2、集群资源管理器(Cluster Manager)主要负责资源的分配与管理。集群管理器分配的资源属于一级分配,它将各个Worker Node上的内存、CPU等资源分配给应用程序,但是并不负责对Execoutor的资源分配。

3、Driver Program:运行Application的main()函数,用于将任务程序转换为RDD和DAG,并与Cluster Manager进行通信与调度

4、Worker Node:控制计算节点,创建并启动Executor,将资源和任务进一步分配给Executor,同步资源信息给Cluster Manager

5、Executor:某个Appliation运行在Worker node上的一个进程。主要负责任务的执行以及与Worker、Driver App的信息同步

6、Executor的优点:1、采用多线程来执行具体的任务,减少任务的启动开销;2、BlockManager存储模块将内存和磁盘共同作为存储设备,有效减少IO开销

图-4 Spark运行框架

三、Spark 运行框架的运行

1、概述:当执行一个Application时,Driver会向集群管理器申请资源,启动Executor,并向Executor发送应用程序代码和文件,然后在Executor上执行Task,运行结束后,执行结果会返回给Driver,或者写到HDFS(分布式文件系统)或者其他数据库

2、具体的执行步骤:

2.1、Driver会向集群管理器申请资源任务发分配和监控:Driver创建一个SparkContext

2.2、资源管理器为Executor分配资源,并启动Executor进程:

2.3、SparkContext根据RDD的依赖关系构建DAG图(有向无环图)DAG图提交给DAGScheduler解析成Stage(TaskSet)

2.4、TaskSet提交给底层调度器Taskscheduler处理

2.5、ExecutorSparkContext申请Task发给Executor运行,并提供应用程序代码

2.6、TaskExecutor上运行,把执行结果反馈给TaskScheduler,然后反馈给DAGScheduler

图-5 Spark运行过程

 

三、Spark编译和源码解析

1、Maven 是一个项目管理工具,可以对 Java 项目进行构建、依赖管理。

 

四、Spark编程模型

Spark应用程序从编写到提交、执行、输出的整个过程:

1、用户使用SparkContext提供的API(常用的有textFile、sequenceFile、runJob、stop等)编写Driver application程序。此外SQLContext、HiveContext及StreamingContext对SparkContext进行封装,并提供了SQL、Hive及流式计算相关的 API

2、使用SparkContext提交的用户应用程序,首先会使用BlockManagerBroadcaseManager将任务的Hadoop配置进行广播。然后由DAGScheduler将任务转换为RDD并组织成DAG,DAG还将被划分为不同的Stage,一个Stage会由多个Task组成,多个Task将会被存放在TaskSet集合里,TaskSet即为Stage。最后由TaskScheduler将Task借助Netty通信框架将任务提交给集群管理器(Cluster Manager)

3、集群管理器(Cluster Manager)给任务分配资源,即将具体任务分配到Worker上,Worker创建Executor来处理任务的运行。Standalone、YARN、Mesos、Kubernetes、EC2等都可以作为Spark的集群管理器

图-5 Spark编程模型

 

 

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/yinminbo/p/11830591.html