Spark basis and RDD

spark

1. Spark of four properties

  1. high speed
  • Two reasons spark faster than the mapreduce

      1. Memory Based
    1. mapreduce任务后期在计算的是时候,每一个job的输出结果都会落地到磁盘,后续有其他的job要依赖于前面job的输出结果,这个时候就需要进行大量的磁盘io操作,性能较低
    
    2. spark任务后期在进行计算的时候,job的结果是可以保存在内存中的,后面有其他的job需要以言语前面job的输出结果,这个时候可以直接从内存中读取,避免了磁盘io操作,性能比较高
    
    spark程序和mapreduce程序都会产生shuffle阶段,在shuffle阶段中他们产生的数据都会保留在磁盘中
      1. Processes and threads
      1 mapreduce任务以进程的方式运行在yarn集群中,比如说有100个mapTask,一个task就需要一个进程,这些task要运行就需要开启100个进程
    
      2 spark任务以线程的方式运行在进程中,比如说有100个task,则一个task就对应一个线程
    1. Ease of use
    2. You can use different languages ​​java, scala, python, R, etc. to quickly write spark program
    3. CURRENCY
    4. compatibility
      1. spark program has a variety of operating modes
        • standAlone
          • spark comes with stand-alone mode, the entire task of resource allocation by Master spark cluster to be responsible for
        • yarn
          • The spark can run the program submitted to the yarn, resource allocation throughout the mission is the responsibility of yarn in the ResourceManager
        • months
          • apache open source a similar yarn of resource scheduling platform

2. spark cluster architecture

Spark cluster components

  • Driver
    • He will perform the main method of client-written, to build a SparkContext object (which is to perform all the spark entry program)
  • Application
    • It is a spark of the application, including the code resource information and tasks running client
  • ClusterManager
    • Providing computing resources to external service program
      • standAlone
        • spark comes cluster model, resource allocation throughout the mission is the responsibility of Master spark cluster
      • yarn
        • The spark program submitted to the yarn run, resource allocation throughout the mission is the responsibility of yarn in the ResourceManager
      • months
        • apache open source a similar yarn of resource scheduling platform
  • Master
    • Spark distribution master node cluster, resource responsible for task
  • Worker
    • From Spark cluster node is responsible for the task of computing nodes
  • Executor
    • It is a start in the worker node processes
  • Task
    • spark task of running executor process worker nodes in a manner task threads

What is 3. RDD

  • RDD (Resilient Distributed Dataset) is called the flexible distributed data sets, is the most basic abstraction units Spark. It represents an immutable, partitionable, parallel computing elements which set of data.
    • Resilient elasticity, expressed in both RDD data can be stored on disk can be saved in memory
    • Distibuted distributed, RDD indicates data is stored in a distributed computing facilitate a variety of post
    • Dataset a data set, a lot of data can be stored

4. RDD five attributes

  • A list of partitions
    • A list of partitions, the basic unit of data collection
      • Here is a representation of the RDD may have multiple partitions, each partition will be part of the data stored in the RDD's, Spark task is to run the task threads, corresponding to a partition on a task
  • A function of computing each split
    • A computing function for each partition
      • Spark is based on the calculation of the RDD partition units
  • A list of dependencies on other RDDs

    • RDD would depend on a number of other RDD

      这里是说RDD和RDD之间是有依赖关系的,spark任务的容错机制就是根据这个特性(血统)而来
  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

    • A Patitioner, ie RDD partition function (optional)

          spark中实现了两种类型的分区函数
          1 基于哈希的HashPartitioner,(key.hashcode % 分区数 = 分区号)
          2 基于范围的RangePartitioner
      
          只有对于key-value的RDD,并且产生shuffle,才会有Partitioner
          非key-value的RDD的Partitioner的值是None
  • Optionally, a list of preferred locations to compute each split in (e.g. block locations for an HDFS file)

    • A list of priority store each Partition (optional)

      spark任务在调度的时候会优先考虑存有数据的节点开启计算任务,以减少数据的网络传输,提成计算效率

5. RDD Category Operator

  • transformation (conversion)
    • RDD existing converter further generates a new RDD according lazy loading it, is not performed immediately
    • Such as
      • map、flatMap、reduceByKey
  • action (action)
    • Run will trigger the task
      • The calculation result data back to the RDD Driver side or saved to an external storage medium (disk, memory, the HDFS)
    • Such as
      • collect、saveAsTextFile

6. RDD common arithmetic operators

6.1 transformation operator

Change meaning
map(func) Returns a new RDD, the RDD by each input element composition after conversion function func
filter (func) Returns a new RDD, after the function func function calculates and returns a value of true input element consists of
flatMap(func) Similar map, but each input element may be mapped to zero or more output elements (func it should return a sequence, rather than a single element)
mapPartitions (func) Similar map, but run independently on each slice of RDD, thus running the RDD type T when, func function type must be Iterator [T] => Iterator [U]
mapPartitionsWithIndex (func) Similar mapPartitions, but with func integer parameter index value indicates a slice, thus the type of operation when the RDD T, the function func type must be Iterator [T] => Iterator [U]
union(otherDataset) Seek and set the source RDD and RDD parameters, and returns a new RDD
intersection(otherDataset) Intersection of the source parameters RDD and RDD, and returns a new RDD
distinct([numTasks])) It returns a new source RDD RDD After de re
groupByKey([numTasks]) In a (k, v) RDD call type, returns a (k, v) of the RDD
reduceByKey(func, [numTasks]) In a (k, v) RDD call type, returns a (k, v) of RDD, reduce the specified function, the same key value is polymerized together with groupByKey Similarly, reduce the number of tasks can be prepared by two parameters to set
sortByKey([ascending], [numTasks]) In a (k, v) calls the RDD, k must implement the Ordered interface returns a sorted according to a key (k, v) RDD
sortBy(func,[ascending], [numTasks]) 与sortByKey类似,但是更灵活,可以自定义排序func
join(otherDataset, [numTasks]) 在类型为(k,v)和(k,w)的RDD上调用,返回一个相同 key对应的所有元素对在一起的(k,(v,w))的RDD
cogroup(otherDataset, [numTasks]) 在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable ,Iterable ))类型的RDD
coalesce(numPartitions) 减少RDD的分区数到指定值
repartition(numPartitions) 重新给RDD分区
repartitionAndSortWithinPartitions(partitioner) 重新给RDD分区,并且每个分区内以记录的key排序

6.2 action算子

动作 含义
reduce(func) reduce将RDD中元素前两个传给输入函数,产生一个新的return值,新产生的return值与RDD中下一个元素(第三个元素)组成两个元素,再被传给输入函数,直到最后只有一个值为止。
collect() 在驱动程序中,以数组的形式返回数据集的所有元素
count() 返回RDD的元素个数
first() 返回RDD的第一个元素(类似于take(1))
take(n) 返回一个由数据集的前n个元素组成的数组
takeOrdered(n, [ordering]) 返回自然顺序或者自定义顺序的前 n 个元素
saveAsTextFile(path) 将数据集中的元素以textFile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本
saveAsSequenceFile(path) The dataset elements stored in a Hadoop sequenceFile format to the specified directory, can be Hadoop HDFS or other supported file system
saveAsObjectFile(path) The elements of the data set in the Java serialization mode stored in a special directory
countByKey() For (k, v) the type of RDD, returns a (k, int) of the map, represents the number of elements corresponding to each key
foreach(func) Each element in the data set, operation function func
foreachPartition(func) On each partition of the data set, operation function func

Guess you like

Origin www.cnblogs.com/William364248886/p/12239488.html