Min Kai

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

background:

  1. Data Growth -> Stand-alone processing power did not keep up -> multi-machine, cluster processing;
  2. Programming multi-processing machine and not the same as the original, new programming framework -> Hadoop, F1, MillWheel, Storm, Impala, Pregel, BSP

Existing problems:

Above systems are proprietary systems, dedicated only to solve the problem, but also left many questions unresolved:

  1. Partially distributed processing, many of which are repeated development; (Spark providing a common framework, by Spark distributed processing system without special consideration, eliminating repeated development)
  2. Solving a big problem, the system requires a plurality of cooperation at the expense of the collaboration data import and export, are expensive; (the Spark a distributed data set representing generic, eliminating the need for data export import multiple systems, and very high)
  3. These systems do not solve the problem of adapting existing codes; (to tell the truth, Spark did not solve the problem)
  4. When multiple systems cooperate, integrated environment to deploy, manage, also requires additional development; (Spark to take a course simple)

eet

RDD design:

  1. RDD RDD or other data can only build, and then constructs are immutable;
  2. RDD contains the following levels:
    1. transformation: e.g.. map, filter
    2. action: e.g.. collect
    3. persistence: memory or disk or any other storage
    4. partitioning: how to partition data in RDD
  3. Support only coarse-grained transformation does not support fine-grained changes to an entry;

RDD advantages (and Distributed Shared Memory comparison):

  1. Support only coarse-grained transformation, reducing the difficulty Fault Tolerence, you can re-run and MapReduce as Straggler, and is for the Partition of Straggler;
  2. Limits the user's use, fast operation can not use fine-grained RDD implemented;

Therefore, a large amount of data RDD is calculated for the scene;

RDD said:

FIG RDD is DAG representation, comprising an internal RDD basic interface to achieve a new method for transformation:

  1. partitions: the original RDD large column  Min Kai sub-section;
  2. dependencies: eet 的 parents has eet;
  3. preferredLocation(p)
  4. iterator(p, parentIters)
  5. partitioner map, union, HDFS files and other methods, are based on or use these interfaces implemented;

Spark implementation:

Scheduler:

Similar Dryad achieved; the dependence between the Spark RDD into Narrow Dependencies (map, union ..) and Wide Dependencies (reduce), for the following reasons:

  1. One mapping between the partion narrow dep is, the wide dep is many to many, which can easily result in series between the narrow dep, and easier to achieve concurrent execution, the OK as long as the upstream partition can continue downstream partition calculation, without waiting for all the OK RDD;
  2. narrow dep error recovery need only rely on the upstream portion of the recovery, without the need to recover all the same as wide dep; RDD of the DAG is divided into a plurality of Stage, is as much internal Stage narrow dep RDD connection played, touch down to become a wide dep Stage, there is no need to wait for each other so Stage, it can be executed concurrently;

Interpreter:

Scala support interactive Interpreter, the interactive terminal will modify the user Scala code, each row compiled into a class, go throw JVM execution; modified to achieve the Spark Scala Interpreter of:

  1. It provides HTTP interface, so that it becomes possible to perform cross between Driver and Worker;
  2. Scala optimized logic, not compiled into a class each line, the necessary compiled into a class on the line;

Memory Management:

RDD persistence in three ways:

  1. in-memory java obj: JVM called directly, fast
  2. in-memory serialized data: it can not be called directly, but accounted for a small memory
  3. on-disk storage

The new RDD new partition if required storage is not enough, then it will use the LRU algorithm before partition recovery;

Tachyon for sharing between the RDD, improve memory utilization Spark clusters;

Checkpoint:

The reason: For some long lineage (descent) of the DAG, the recalculated RDD or too time-consuming, so the RDD will do checkpoint; Advantages: RDD is unchanged, so no snapshots RDD, direct slowly do checkpoint in the background on the line ; Todo: automated checkpoint RDD, system statistics calculation time RDD, you should know that those who need to do checkpoint RDD;

Guess you like

Origin www.cnblogs.com/lijianming180/p/12239689.html