Spark重要概念阐述

Spark:

高速的同意分析引擎

  • 特点
  1. Speed(在批处理和流处理都有很好的表现)
  2. Easy of Use(可以使用Java、Scala、Python、R、SQL开发应用)
  3. Generality(集成了SQL、Streaming、复制的分析)
  4. Runs Everywhere(可以在Hadoop、Apache Mesos、standlone、或者cloud上运行)

RDD

弹性分布式数据集,是Spark的一个基本抽象;代表一个不可变的,可并行操作的,分区的元素集合

  1. 弹性:如果其中一个RDD计算失败,会从父RDD创建恢复计算
  2. 分布式: 支持分区,并行计算
  • 五大特点
  1. 一个分区集合(A list of partitions)
  2. 计算每个分片的函数(A function for computing each split)
  3. 一个依赖于其它RDDS的列表(A list of dependencies on other RDDs)
  4. 键值RDD分区器(例如说RDD是hash分区的)(Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned))
  5. 计算每个每个分割的首选位置列表(例如块位置)(Optionally, a list of preferred locations to compute each split on (e.g. block locations for
    an HDFS file)
    )

Spark Context

  1. 是Spark 功能的主要入口
  2. 代表了一个Spark 集群的连接
  3. 可以在集群上创建RDD、累加器、广播变量
  4. 每个JVM上只有一个active的SparkContext
SparkConf
  1. Spark应用程序的配置
  2. 以键值对设置Spark 参数
  3. 通过 New SparkConf 创建,然后以spark.*加载(load)数据
集群概念

官网地址:http://spark.apache.org/docs/latest/cluster-overview.html

Term Meaning 自己总结
Application User program built on Spark. Consists of a driver program and executors on the cluster. 运行在Spark上的应用,由driver program 和 executors组成
Application jar A jar containing the user’s Spark application. In some cases users will want to create an “uber jar” containing their application along with its dependencies. The user’s jar should never include Hadoop or Spark libraries, however, these will be added at runtime. 一个Spark程序的jar,包含应用程序和依赖,hadoop和spark的库在运行的时候才加载进来
Driver program The process running the main() function of the application and creating the SparkContext 一个 progress(进程),运行main 函数,创建SparkContext(Spark程序的入口)
Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN) 在集群上获取资源的一个外部服务
Deploy mode Distinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside of the cluster. In “client” mode, the submitter launches the driver outside of the cluster. 部署节点:区分driver progress在哪里运行.1.集群模式下:框架在集群内部启动驱动程序,2.在客户端模式下,提交到cluster外部运行
Worker node Any node that can run application code in the cluster 任何一个节点都可以运行程序代码
Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. 一个进程(为应用在工作节点启动).可以执行任务,把数据存储在内存或者磁盘中,每个应用都有他自己的executors
Task A unit of work that will be sent to one executor 一个工作单元,会发送给executor
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you’ll see this term used in the driver’s logs. 由多个任务做成的并行计算,在action是触发生成
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you’ll see this term used in the driver’s logs. 每个job根据依赖被分成由几个tasks组成集合,及时Stage,一个shuffle就会产生一个Stage
  • Spark 执行流程 如下图
    在这里插入图片描述
  1. 初始化SparkContext,要在集群上运行,SparkContext需要连接到集群管理器,一旦链接,Spark将获得集群节点上的Executor(运行计算和存储数据的进程)
  2. 接下来将应用程序的代码发送到Executor上,
  3. 最后SparkContex将task发送给执行者
转换(transformation)
Transformation Meaning Note
map(func) Return a new distributed dataset formed by passing each element of the source through a function func. 映射.对 每一个元素执行一个函数,返回一个新的数据集
filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true 过滤之后返回一个新的数据集
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). 扁平化和map类似,但是映射最终输出是0到n个元素,而map元素个数不变
mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator => Iterator when running on an RDD of type T. 和map类似,但是在每个RDD的分区上单独运行,
groupByKey([numPartitions]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numPartitions argument to set a different number of tasks. 调用之后,返回是一个键值对,key是之前的key,value是value的迭代(值未聚合):(key,value)=>(key,Iterable)并行度取决于父RDD的分区数量
reduceByKey(func, [numPartitions]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument. (key,value)=>(key,(value1+value2…)),key相同的键值对聚合,
coalesce(numPartitions) Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset. 分区可以减少,可以使用coalesce(numPartition)
repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network. 随机地对RDD中的数据重新洗牌,可以创建更过或更少的分区,并且分区之间是基本均匀的.

猜你喜欢

转载自blog.csdn.net/huonan_123/article/details/84937149
今日推荐