spark
1. Spark of four properties
- high speed
Two reasons spark faster than the mapreduce
-
- Memory Based
1. mapreduce任务后期在计算的是时候,每一个job的输出结果都会落地到磁盘,后续有其他的job要依赖于前面job的输出结果,这个时候就需要进行大量的磁盘io操作,性能较低 2. spark任务后期在进行计算的时候,job的结果是可以保存在内存中的,后面有其他的job需要以言语前面job的输出结果,这个时候可以直接从内存中读取,避免了磁盘io操作,性能比较高 spark程序和mapreduce程序都会产生shuffle阶段,在shuffle阶段中他们产生的数据都会保留在磁盘中
-
- Processes and threads
1 mapreduce任务以进程的方式运行在yarn集群中,比如说有100个mapTask,一个task就需要一个进程,这些task要运行就需要开启100个进程 2 spark任务以线程的方式运行在进程中,比如说有100个task,则一个task就对应一个线程
- Ease of use
- You can use different languages java, scala, python, R, etc. to quickly write spark program
- CURRENCY
- compatibility
- spark program has a variety of operating modes
- standAlone
- spark comes with stand-alone mode, the entire task of resource allocation by Master spark cluster to be responsible for
- yarn
- The spark can run the program submitted to the yarn, resource allocation throughout the mission is the responsibility of yarn in the ResourceManager
- months
- apache open source a similar yarn of resource scheduling platform
- standAlone
- spark program has a variety of operating modes
-
2. spark cluster architecture
- Driver
- He will perform the main method of client-written, to build a SparkContext object (which is to perform all the spark entry program)
- Application
- It is a spark of the application, including the code resource information and tasks running client
- ClusterManager
- Providing computing resources to external service program
- standAlone
- spark comes cluster model, resource allocation throughout the mission is the responsibility of Master spark cluster
- yarn
- The spark program submitted to the yarn run, resource allocation throughout the mission is the responsibility of yarn in the ResourceManager
- months
- apache open source a similar yarn of resource scheduling platform
- standAlone
- Providing computing resources to external service program
- Master
- Spark distribution master node cluster, resource responsible for task
- Worker
- From Spark cluster node is responsible for the task of computing nodes
- Executor
- It is a start in the worker node processes
- Task
- spark task of running executor process worker nodes in a manner task threads
What is 3. RDD
- RDD (Resilient Distributed Dataset) is called the flexible distributed data sets, is the most basic abstraction units Spark. It represents an immutable, partitionable, parallel computing elements which set of data.
- Resilient elasticity, expressed in both RDD data can be stored on disk can be saved in memory
- Distibuted distributed, RDD indicates data is stored in a distributed computing facilitate a variety of post
- Dataset a data set, a lot of data can be stored
4. RDD five attributes
- A list of partitions
- A list of partitions, the basic unit of data collection
- Here is a representation of the RDD may have multiple partitions, each partition will be part of the data stored in the RDD's, Spark task is to run the task threads, corresponding to a partition on a task
- A list of partitions, the basic unit of data collection
- A function of computing each split
- A computing function for each partition
- Spark is based on the calculation of the RDD partition units
- A computing function for each partition
A list of dependencies on other RDDs
RDD would depend on a number of other RDD
这里是说RDD和RDD之间是有依赖关系的,spark任务的容错机制就是根据这个特性(血统)而来
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
A Patitioner, ie RDD partition function (optional)
spark中实现了两种类型的分区函数 1 基于哈希的HashPartitioner,(key.hashcode % 分区数 = 分区号) 2 基于范围的RangePartitioner 只有对于key-value的RDD,并且产生shuffle,才会有Partitioner 非key-value的RDD的Partitioner的值是None
Optionally, a list of preferred locations to compute each split in (e.g. block locations for an HDFS file)
A list of priority store each Partition (optional)
spark任务在调度的时候会优先考虑存有数据的节点开启计算任务,以减少数据的网络传输,提成计算效率
5. RDD Category Operator
- transformation (conversion)
- RDD existing converter further generates a new RDD according lazy loading it, is not performed immediately
- Such as
- map、flatMap、reduceByKey
- action (action)
- Run will trigger the task
- The calculation result data back to the RDD Driver side or saved to an external storage medium (disk, memory, the HDFS)
- Such as
- collect、saveAsTextFile
- Run will trigger the task
6. RDD common arithmetic operators
6.1 transformation operator
Change | meaning |
---|---|
map(func) | Returns a new RDD, the RDD by each input element composition after conversion function func |
filter (func) | Returns a new RDD, after the function func function calculates and returns a value of true input element consists of |
flatMap(func) | Similar map, but each input element may be mapped to zero or more output elements (func it should return a sequence, rather than a single element) |
mapPartitions (func) | Similar map, but run independently on each slice of RDD, thus running the RDD type T when, func function type must be Iterator [T] => Iterator [U] |
mapPartitionsWithIndex (func) | Similar mapPartitions, but with func integer parameter index value indicates a slice, thus the type of operation when the RDD T, the function func type must be Iterator [T] => Iterator [U] |
union(otherDataset) | Seek and set the source RDD and RDD parameters, and returns a new RDD |
intersection(otherDataset) | Intersection of the source parameters RDD and RDD, and returns a new RDD |
distinct([numTasks])) | It returns a new source RDD RDD After de re |
groupByKey([numTasks]) | In a (k, v) RDD call type, returns a (k, v) of the RDD |
reduceByKey(func, [numTasks]) | In a (k, v) RDD call type, returns a (k, v) of RDD, reduce the specified function, the same key value is polymerized together with groupByKey Similarly, reduce the number of tasks can be prepared by two parameters to set |
sortByKey([ascending], [numTasks]) | In a (k, v) calls the RDD, k must implement the Ordered interface returns a sorted according to a key (k, v) RDD |
sortBy(func,[ascending], [numTasks]) | 与sortByKey类似,但是更灵活,可以自定义排序func |
join(otherDataset, [numTasks]) | 在类型为(k,v)和(k,w)的RDD上调用,返回一个相同 key对应的所有元素对在一起的(k,(v,w))的RDD |
cogroup(otherDataset, [numTasks]) | 在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable
|
coalesce(numPartitions) | 减少RDD的分区数到指定值 |
repartition(numPartitions) | 重新给RDD分区 |
repartitionAndSortWithinPartitions(partitioner) | 重新给RDD分区,并且每个分区内以记录的key排序 |
6.2 action算子
动作 | 含义 |
---|---|
reduce(func) | reduce将RDD中元素前两个传给输入函数,产生一个新的return值,新产生的return值与RDD中下一个元素(第三个元素)组成两个元素,再被传给输入函数,直到最后只有一个值为止。 |
collect() | 在驱动程序中,以数组的形式返回数据集的所有元素 |
count() | 返回RDD的元素个数 |
first() | 返回RDD的第一个元素(类似于take(1)) |
take(n) | 返回一个由数据集的前n个元素组成的数组 |
takeOrdered(n, [ordering]) | 返回自然顺序或者自定义顺序的前 n 个元素 |
saveAsTextFile(path) | 将数据集中的元素以textFile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本 |
saveAsSequenceFile(path) | The dataset elements stored in a Hadoop sequenceFile format to the specified directory, can be Hadoop HDFS or other supported file system |
saveAsObjectFile(path) | The elements of the data set in the Java serialization mode stored in a special directory |
countByKey() | For (k, v) the type of RDD, returns a (k, int) of the map, represents the number of elements corresponding to each key |
foreach(func) | Each element in the data set, operation function func |
foreachPartition(func) | On each partition of the data set, operation function func |