spark running architecture storm stream data processing

Review: Several elements of the spark programming model: driver program, input, transformation, action, cache, shared variable
RDD: Features: partition, dependency, function, partition strategy (K, V), locality strategy

Spark running architecture :
Job: Parallel computing consisting of multiple Tasks, which is spawned by Spark action
Stage: Job transfer unit, corresponding to TaskSet
TaskSet:
Task: Work unit sent to an executor

  DAGScheduler构建Stage
  纪录哪个RDD或者Stage输出被物化--物化，计算结果复杂
  重新提交shuffle输出丢失的stage
  将Taskset 传给底层调度器
              -spark-cluster taskcheduler
              -yarn-cluster yarnclusetcheduler
              -yarn-client yarnclientclustercheduler

    taskScheduler:
            为每个taskset构建一个tasksetmanager实例管理这个taskset的生命周期
            数据本地性决定每个task最佳位置（process-local，node-local，rock-local and then any）
            提交taskset（一组task）到集群运行并监控
            推测执行，碰到straggle任务（卡住的任务）需要放到别的节点上重试
            出现shuffle输出lost要报告fetch failed错误
        driver->executerbackend:
            driver -> （触发action runjob）sparkcontext->（提交job runjob）DAGScheduler ->（把stage划分成task submittasks）
            taskscheduler->（把task添加到队列reviveOffers）schedulerBackend ->（调度task给指定的executer launchtask）executerbackend
        executerbackend->driver:
            1,任务执行成功status update -> schedulerBackend -（2,,任务执行成功status update ） ->taskScheduler    
             -（3，删除task removeRunningTask）->TaskSet Manager...
    两种task：shuffleMaptask和resultTask，被执行的task多数都是shuffleMaptask
         resultTask --finalStage 所对应任务，返回给driver的是运算结果本身
                    结果足够小，则直接放在directorTaskResult对象内
                    超过特定尺寸（默认10MB）则在executor端会将directTaskResult先序列化，再把序列化结果作为一个block存放在blockManager里，

Then put the blockID returned by the blockManager in the indirectaskresult object and return it to the driver
shuffleMaptask, and return a MapStatus object to the DAGScheduler. The MapStatus object manages the
related storage information of the operation output results of the shuffleMapTask in the shuffleblockManager, not the results themselves. These storage locations The information will be used as the basis for obtaining the input data of the next Stage task. The
number of partitions is determined by the partitioner object in shuffleDependcy. The
spark kernel will provide a pluggable shuffle interface.

     更多细节：blockManager，AKKA（消息传递组件），NETTY（网络IO组件）。。。

     例子：
     val lines=ssc.textfile（args(1))
     val words=lines.flatMap(x=>[x.split("")])
     wordCounts=words....


    （多组）        
    hadoopRDD->MapperRDD(String)->FlatMapperRDD(String)->PairRDD->MapPartitionRDD-(一对多)-->ShuffleRDD->MapPartitionsRDD->PairRDDFunction

storm stream data processing
solution: process running permanently, data in memory
Common stream processing system: storm, trlent, s4, spark streaming

storm introduces
nimbus resource allocation and task scheduling, writes task-related information into the corresponding directory of zookeeper supervlsor accepts nimbus assignment tasks, starts and stops worker processes that
belong to its own management Each spout/bolt thread is called a task

    topology storm中运行的实时应用呈现，消息在各个组件间流动形成逻辑上拓扑结构
    spout 在一个topology中产生源数据流的组件，主动
    bolt 在以topology中接受数据然后执行处理的组件，被动

     tuple 消息传递的单元
     stream 源源不断传递的tuple
     stream grouping 消息的partition。。。shuffle，fields，hash
 多语言编程：java，ruby，python，增加其他语言支持，增加一个简单storm通信协议计科
 容错性，确保一个计算一直运行下棋
 水平拓展：计算是在多个线程，进程和服务器之间并行进行的
 快速
 系统可靠性：建立在zookeeper上大量系统运行状态的元信息都序列化在zk上

spark running architecture storm stream data processing

Guess you like