I. Introduction
Spark was born in 2009 at the University of California, Berkeley AMPLab, 2013 years was donated to the Apache Software Foundation, in February 2014 to become the top-level Apache project. With respect to the MapReduce batch computing, Spark can bring a hundred times performance increase, so it becomes a distributed computing framework Following MapReduce, the most widely used.
Second, characteristics
Apache Spark has the following characteristics:
- The use of advanced DAG scheduler, the query optimizer and execution engine physics to achieve a guarantee of performance;
- Multi-language support, currently supported are Java, Scala, Python and R;
- Offers more than 80 high-level API, you can easily build applications;
- Support for batch processing, streaming and sophisticated business analysis;
- Rich library support: including SQL, MLlib, GraphX Spark Streaming and the like library, and they can be seamlessly combined;
- Various Deployment: Support local mode and comes with the cluster model, also supports running on Hadoop, Mesos, Kubernetes;
- Multiple Data Source Support: Support for data access as well as HDFS, Alluxio, Cassandra, HBase, Hive hundreds of other data sources.
Third, the cluster architecture
Term (term) | Meaning (meaning) |
---|---|
Application | Spark application, a node on a Driver Executor clusters and a plurality of nodes. |
Driver program | The main use of the program, the process of running the application main () method and create SparkContext |
Cluster manager | Cluster resource manager (for example, Standlone Manager, Mesos, YARN) |
Worker node | Perform computing tasks on worker nodes |
Executor | Located on the job application process node, responsible for performing computing tasks and save the output data to memory or disk |
Task | Executor is transmitted to the unit of work |
Implementation process :
- 用户程序创建 SparkContext 后,它会连接到集群资源管理器,集群资源管理器会为用户程序分配计算资源,并启动 Executor;
- Dirver 将计算程序划分为不同的执行阶段和多个 Task,之后将 Task 发送给 Executor;
- Executor 负责执行 Task,并将执行状态汇报给 Driver,同时也会将当前节点资源的使用情况汇报给集群资源管理器。
四、核心组件
Spark 基于 Spark Core 扩展了四个核心组件,分别用于满足不同领域的计算需求。
3.1 Spark SQL
Spark SQL 主要用于结构化数据的处理。其具有以下特点:
- 能够将 SQL 查询与 Spark 程序无缝混合,允许您使用 SQL 或 DataFrame API 对结构化数据进行查询;
- 支持多种数据源,包括 Hive,Avro,Parquet,ORC,JSON 和 JDBC;
- 支持 HiveQL 语法以及用户自定义函数 (UDF),允许你访问现有的 Hive 仓库;
- 支持标准的 JDBC 和 ODBC 连接;
- 支持优化器,列式存储和代码生成等特性,以提高查询效率。
3.2 Spark Streaming
Spark Streaming 主要用于快速构建可扩展,高吞吐量,高容错的流处理程序。支持从 HDFS,Flume,Kafka,Twitter 和 ZeroMQ 读取数据,并进行处理。
Spark Streaming 的本质是微批处理,它将数据流进行极小粒度的拆分,拆分为多个批处理,从而达到接近于流处理的效果。
3.3 MLlib
MLlib 是 Spark 的机器学习库。其设计目标是使得机器学习变得简单且可扩展。它提供了以下工具:
- 常见的机器学习算法:如分类,回归,聚类和协同过滤;
- 特征化:特征提取,转换,降维和选择;
- 管道:用于构建,评估和调整 ML 管道的工具;
- Persistence : saving and loading algorithms, models, data pipeline;
- Utilities : linear algebra, statistics, data processing.
3.4 Graphx
Spark GraphX is a new component for the graphics computing and graphics parallel computing. At a high level, GraphX be extended by introducing a new RDD abstract pattern (one additional attribute to each of the vertices and edges of the pattern having multiple orientation). In order to support a calculation, GraphX provides a basic set of operators (such as: subgraph, joinVertices and aggregateMessages) and optimized Pregel API. In addition, GraphX also includes a growing number of graphics algorithms and builder, to simplify the graphical analysis tasks.
More big data series can be found GitHub open source project : Big Data Getting Started