大数据技术原理与应用 第三篇 大数据处理与分析(三)Spark

一、 Spark简介

Spark最初由美国加州伯克利大学(UCBerkeley)的AMP实验室于2009年开发,是基于内存计算的大数据并行计算框架,可用于构建大型的、低延迟的数据分析应用程序

1.1 Spark特点

运行速度快:使用DAG执行引擎以支持循环数据流与内存计算

容易使用:支持使用Scala、Java、Python和R语言进行编程,可以通过Spark Shell进行交互式编程

通用性:Spark提供了完整而强大的技术栈,包括SQL查询、流式计算、机器学习和图算法组件

运行模式多样:可运行于独立的集群模式中,可运行于Hadoop中,也可运行于Amazon EC2等云环境中,并且可以访问HDFS、Cassandra、HBase、Hive等多种数据源

1.2 Scala简介

Scala是一门现代的多范式编程语言,运行于Java平台(JVM,Java虚拟机),并兼容现有的Java程序

Scala是Spark的主要编程语言,但Spark还支持Java、Python、R作为编程语言

Scala的优势是提供了REPL(Read-Eval-Print Loop,交互式解释器),提高程序开发效率

特点:

Scala具备强大的并发性,支持函数式编程,可以更好地支持分布式系统

Scala语法简洁,能提供优雅的API

Scala兼容Java,运行速度快,且能融合到Hadoop生态圈中

1.3 Spark与Hadoop的对比

虽然Hadoop是大数据技术的事实标准,但主要缺陷是其MapReduce计算模型延迟高,无法胜任实时快速计算的需求,只适用于离线批处理的应用场景

Hadoop存在如下一些缺点:

表达能力有限:计算必须转化为Map和Reduce

磁盘IO开销大:每次都从磁盘读取数据,且中间结果也写入磁盘

延迟高 :任务之间的衔接涉及IO开销 在前一个任务执行完成之前,其他任务就无法开始,难以胜任复杂、多阶段的计算任务

相比于Hadoop MapReduce,Spark主要具有如下优点:

Spark的计算模式也属于MapReduce,但不局限于Map和Reduce操作,还提供了多种数据集操作类型,编程模型比Hadoop MapReduce更灵活

Spark提供了内存计算,可将中间结果放到内存中,对于迭代运算效率更高S

spark基于DAG的任务调度执行机制,要优于Hadoop MapReduce的迭代执行机制

相比于Hadoop MapReduce,Spark主要具有如下缺点:

Hadoop可以使用廉价异构的计算机做分布式存储与计算,Spark对硬件要求较高

1.4 Spark生态系统

在实际应用中,大数据处理主要包括以下三个类型:

复杂的批量数据处理:通常时间跨度在数十分钟到数小时之间-MapReduce

基于历史数据的交互式查询:时间跨度在数十秒到数分钟之间-Impala

基于实时数据流的数据处理:时间跨度在数百毫秒到数秒之间-Storm

Spark的设计遵循“一个软件栈满足不同应用场景”的理念:

逐渐形成了一套完整的生态系统,既能够提供内存计算框架,也可以支持SQL即时查询、实时流式计算、机器学习和图计算等

Spark可以部署在资源管理器YARN之上,提供一站式的大数据解决方案

因此,Spark所提供的生态系统足以应对上述三种场景,即同时支持批处理、交互式查询和流数据处理

Spark生态系统

Spark的生态系统主要包含了Spark Core、Spark SQL、Spark Streaming、MLLib和GraphX 等组件

应用场景

时间跨度

其他框架

Spark生态系统中的组件

复杂的批量数据处理

小时级

MapReduce、Hive

Spark

基于历史数据的交互式查询

分钟级、秒级

Impala、Dremel、Drill

Spark SQL

基于实时数据流的数据处理

毫秒、秒级

Storm、S4

Spark Streaming

基于历史数据的数据挖掘

-

Mahout

MLlib

图结构数据的处理

-

Pregel、Hama

GraphX

二、 Spark运行架构

2.1 基本概念

RDD:Resillient Distributed Dataset(弹性分布式数据集)的简称,是分布式内存的一个抽象概念,提供了一种高度受限的共享内存模型

DAG:Directed Acyclic Graph(有向无环图)的简称,反映RDD间的依赖关系

Executor:运行在工作节点(WorkerNode)的一个进程,负责运行Task

Application:用户编写的Spark应用程序

Task:运行在Executor上的工作单元

Job:一个Job包含多个RDD及作用于相应RDD上的各种操作

Stage:Job的基本调度单位,一个Job会分为多组Task,每组Task被称为Stage,或者也被称为TaskSet,代表了一组关联的、相互之间没有Shuffle依赖关系的任务组成的任务集

2.2 架构设计

Spark运行架构包括:集群资源管理器(Cluster Manager)、运行作业任务的工作节点(Worker Node)、每个应用的任务控制节点(Driver)和每个工作节点上负责具体任务的执行进程(Executor)。资源管理器可以使用自带的或Mesos、YARN等资源管理调度框架

与Hadoop MapReduce计算框架相比,Spark所采用的Executor有两个优点:一是利用多线程来执行具体的任务,减少任务的启动开销。二是Executor中有一个BlockManager存储模块,会将内存和磁盘共同作为存储设备,有效减少IO开销

一个Application由一个Driver和若干个Job构成,一个Job由多个Stage构成,一个Stage由多个没有Shuffle关系的Task组成

When executing an Application, the Driver will apply for resources from the cluster manager, start the Executor, and send the application code and files to the Executor, and then execute the Task on the Executor. After the execution, the execution result is returned to the Driver, or written to HDFS or other in the database

2.3 Spark running basic process

(1) First build a basic operating environment for the application, that is, the Driver creates a SparkContext to apply for resources, assign and monitor tasks

(2) The resource manager allocates resources for the Executor and starts the Executor process

(3) SparkContext builds a DAG graph based on the RDD dependencies, and the DAG graph is submitted to the DAG Scheduler to be parsed into a Stage, and then each TaskSet is submitted to the underlying scheduler Task Scheduler for processing; the Executor applies for a Task from the SparkContext, and the Task Scheduler issues the Task to the Executor run, and provide the application code

(4) Task runs on Executor, feeds back the execution result to Task Scheduler, and then feeds back to DAG Scheduler, writes data and releases all resources after running

The Spark running architecture has the following characteristics:

(1) Each Application has its own exclusive Executor process, and this process stays in place while the Application is running. The Executor process runs Task in a multi-threaded manner

(2) The running process of Spark has nothing to do with the resource manager, as long as the Executor process can be obtained and communication can be maintained

(3) Task adopts optimization mechanisms such as data locality and speculative execution

2.4 Design and operation principle of RDD

design concept

RDD was created to meet this need. It provides an abstract data architecture, without worrying about the distributed nature of the underlying data. It only needs to express the specific application logic as a series of conversion processes, and the conversion operations between different RDDs Form dependencies, which can be pipelined and avoid intermediate data storage

RDD concept

An RDD is a collection of distributed objects, essentially a read-only collection of partition records, each RDD can be divided into multiple partitions, each partition is a data set fragment, and different partitions of an RDD can be saved to the cluster different nodes, so that parallel computing can be performed on different nodes in the cluster

RDD provides a highly restricted shared memory model , that is, RDD is a collection of read-only record partitions and cannot be directly modified. RDDs can only be created based on data sets in stable physical storage, or by performing determinations on other RDDs. Transformation operations (such as map, join and group by) to create a new RDD

RDD provides a rich set of operations to support common data operations, which are divided into two types: "Action" and "Transformation".

The conversion interfaces provided by RDD are very simple, and they are coarse-grained data conversion operations such as map, filter, groupBy, join, etc., rather than fine-grained modification of a data item (not suitable for web crawlers)

On the surface, the function of RDD is very limited and not powerful enough. In fact, RDD has been proven to be able to efficiently express the programming models of many frameworks (such as MapReduce, SQL, Pregel)

Spark implements the API of RDD in Scala language, and programmers can realize various operations on RDD by calling the API

RDD typical execution process

RDD is read into an external data source for creation

RDD undergoes a series of transformation (Transformation) operations, each time a different RDD is generated for the next transformation operation

The last RDD is transformed by the "action" operation and output to an external data source

【P.S.】

The conversion operation does not actually calculate, but only records the trajectory of the conversion. The action operation will trigger the real calculation from beginning to end and get the result

Advantages: lazy call, pipeline, avoid synchronous waiting, no need to save intermediate results, each operation becomes simple

RDD characteristics

RDD is inherently fault-tolerant: blood relationship, recalculation of lost partitions, no need to roll back the system, recalculation process is parallel among different nodes, and only coarse-grained operations are recorded

Intermediate results are persisted to memory: data is passed between multiple RDD operations in memory

避免了不必要的读写磁盘开销

存放的数据可以是Java对象:避免了不必要的对象序列化和反序列化

RDD之间的依赖关系

窄依赖表现为一个父RDD的分区对应于一个子RDD的分区或多个父RDD的分区对应于一个子RDD的分区

宽依赖则表现为存在一个父RDD的一个分区对应一个子RDD的多个分区

是否包含Shuffle操作是区分窄依赖和宽依赖的根据

阶段的划分

Spark根据DAG图中的RDD依赖关系,把一个作业分成多个阶段。阶段划分的依据是窄依赖和宽依赖。窄依赖可以实现流水线优化,对于作业的优化很有利,宽依赖包含Shuffle过程,无法实现流水线方式处理。

具体划分方法:

在DAG中进行反向解析,遇到宽依赖就断开

遇到窄依赖就把当前的RDD加入到Stage中

将窄依赖尽量划分在同一个Stage中,可以实现流水线计算从而使得数据可以直接在内存中进行交换,避免了磁盘IO开销

RDD的运行过程

(1)创建RDD对象;

(2)SparkContext负责计算RDD之间的依赖关系,构建DAG;

(3)DAGScheduler负责把DAG图分解成多个Stage,每个Stage中包含了多个Task,每个Task会被TaskScheduler分发给各个WorkerNode上的Executor去执行。

Guess you like

Origin blog.csdn.net/CNDefoliation/article/details/129321378