Spark series (a) - Spark Introduction

I. Introduction

Spark was born in 2009 at the University of California, Berkeley AMPLab, 2013 years was donated to the Apache Software Foundation, in February 2014 to become the top-level Apache project. With respect to the MapReduce batch computing, Spark can bring a hundred times performance increase, so it becomes a distributed computing framework Following MapReduce, the most widely used.

Second, characteristics

Apache Spark has the following characteristics:

  • The use of advanced DAG scheduler, the query optimizer and execution engine physics to achieve a guarantee of performance;
  • Multi-language support, currently supported are Java, Scala, Python and R;
  • Offers more than 80 high-level API, you can easily build applications;
  • Support for batch processing, streaming and sophisticated business analysis;
  • Rich library support: including SQL, MLlib, GraphX ​​Spark Streaming and the like library, and they can be seamlessly combined;
  • Various Deployment: Support local mode and comes with the cluster model, also supports running on Hadoop, Mesos, Kubernetes;
  • Multiple Data Source Support: Support for data access as well as HDFS, Alluxio, Cassandra, HBase, Hive hundreds of other data sources.

https://github.com/heibaiying

Third, the cluster architecture

Term (term) Meaning (meaning)
Application Spark application, a node on a Driver Executor clusters and a plurality of nodes.
Driver program The main use of the program, the process of running the application main () method and create SparkContext
Cluster manager Cluster resource manager (for example, Standlone Manager, Mesos, YARN)
Worker node Perform computing tasks on worker nodes
Executor Located on the job application process node, responsible for performing computing tasks and save the output data to memory or disk
Task Executor is transmitted to the unit of work

https://github.com/heibaiying
Implementation process :

  1. 用户程序创建 SparkContext 后,它会连接到集群资源管理器,集群资源管理器会为用户程序分配计算资源,并启动 Executor;
  2. Dirver 将计算程序划分为不同的执行阶段和多个 Task,之后将 Task 发送给 Executor;
  3. Executor 负责执行 Task,并将执行状态汇报给 Driver,同时也会将当前节点资源的使用情况汇报给集群资源管理器。

四、核心组件

Spark 基于 Spark Core 扩展了四个核心组件,分别用于满足不同领域的计算需求。

https://github.com/heibaiying

3.1 Spark SQL

Spark SQL 主要用于结构化数据的处理。其具有以下特点:

  • 能够将 SQL 查询与 Spark 程序无缝混合,允许您使用 SQL 或 DataFrame API 对结构化数据进行查询;
  • 支持多种数据源,包括 Hive,Avro,Parquet,ORC,JSON 和 JDBC;
  • 支持 HiveQL 语法以及用户自定义函数 (UDF),允许你访问现有的 Hive 仓库;
  • 支持标准的 JDBC 和 ODBC 连接;
  • 支持优化器,列式存储和代码生成等特性,以提高查询效率。

3.2 Spark Streaming

Spark Streaming 主要用于快速构建可扩展,高吞吐量,高容错的流处理程序。支持从 HDFS,Flume,Kafka,Twitter 和 ZeroMQ 读取数据,并进行处理。

https://github.com/heibaiying

Spark Streaming 的本质是微批处理,它将数据流进行极小粒度的拆分,拆分为多个批处理,从而达到接近于流处理的效果。

https://github.com/heibaiying

3.3 MLlib

MLlib 是 Spark 的机器学习库。其设计目标是使得机器学习变得简单且可扩展。它提供了以下工具:

  • 常见的机器学习算法:如分类,回归,聚类和协同过滤;
  • 特征化:特征提取,转换,降维和选择;
  • 管道:用于构建,评估和调整 ML 管道的工具;
  • Persistence : saving and loading algorithms, models, data pipeline;
  • Utilities : linear algebra, statistics, data processing.

3.4 Graphx

Spark GraphX ​​is a new component for the graphics computing and graphics parallel computing. At a high level, GraphX ​​be extended by introducing a new RDD abstract pattern (one additional attribute to each of the vertices and edges of the pattern having multiple orientation). In order to support a calculation, GraphX ​​provides a basic set of operators (such as: subgraph, joinVertices and aggregateMessages) and optimized Pregel API. In addition, GraphX ​​also includes a growing number of graphics algorithms and builder, to simplify the graphical analysis tasks.

More big data series can be found GitHub open source project : Big Data Getting Started

Guess you like

Origin blog.51cto.com/14183932/2438756