Spark thesis research - article thoroughly understand the RDD

2.2 RDD abstract

2.2.1 RDD Description

  • RDD is what?
      Formally, RDD is a read-only partition record collection. A parallel data structure of a fault-tolerant. It allows the user to explicitly present data on disk or in memory, which the control partition, and rich with operator to manipulate the data.

  • RDD how to produce?
      It can only be produced in two ways: the exact stresses can only be generated by RDD conversion operation

    • Operating on the data applied to the determination of stable storage
    • Acting on other determined operating RDD
  • RDD generate motivation?
      Already mentioned in the previous article, Spark researchers found that existing treatment system its biggest problem is that data can not be efficient common share (for data sharing with external storage), even if some systems optimized in this regard, but only support a specific mode.

  • General memory storage abstraction is how fault-tolerant?
      Copy data across nodes or updating the log to record across the nodes, but in this way the data sensitive applications is expensive, on the one hand a large number of clusters that need to copy the data to the network, and bandwidth is much smaller than RAM, On the other hand, which in turn takes up a lot of storage space.

  • RDD is how fault-tolerant?
      Providing RDD conversion interface based on coarse-grained, the same arithmetic operation can be applied to a large number of data records. Thus, only records generated RDD conversion can be, that is, the official said lineage, without the need to record the actual data. As a result, whenever there is a partition lost, RDD record enough information about it which is calculated from the RDD, and then recalculated accordingly lost partition.

  • Other
      RDD have enough information to show how it is derived from other data set (its lineage), in order to calculate its partition data from the storage stabilization. This is a powerful attribute: In essence, the program can not be referenced RDD after a failure can not be rebuilt.
      Users can control the other two aspects of RDD: persistence and partitions. Users may need to choose a storage strategies (eg: the presence of memory or disk exists) for the RDD based on future use, users can also cross-machine partition element RDD based on key values of each record.

2.2.2 Spark Programming Interface

  Spark method exposes its RDD (user available) by an analogous DryadLINQ FlumeJava language integration and the API, wherein each RDD are used as an object, and use this call to the object conversion.
  Spark divided into two methods, one for the transformations (eg: map, filter) , another for the actions (eg: count, collect, save). Users can usually create RDD data from stable memory operation by transformations, these data sets may then be used in the action. Like DryadLINQ, the Spark of RDDs execution lazy calculation, that is only the case when the first encounter to be used RDDs of (action operation) Spark began to perform RDDs calculations. Thus, Spark pipe can be treated for all transformations.
  In anticipation of an RDD will be reused in the future, Spark provides interfaces to persist persistence of the RDD, Spark will be persistent default RDD stored in memory in the RAM will not be enough to overflow disk. spark offer different persistence strategy, the user can choose only persisted to disk, or RDD for persistent cross-machine replication. These are done by the flag persist. In addition, users can also choose to persist priority, in order to decide where to store the memory of RDDs should be the first to spill to disk.
  
  Let's look at how well the model is the use of spark api is how fault-tolerant

lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()

Note that the third row, so this operation is persistent errors in memory, subsequent queries can share the data. Also note that until here, Spark does not have any work begins. After the operation, for example, by performing an action errors.count (), the user can use the RDD. By partitioning the errors stored in memory, greatly accelerated the subsequent calculations based on this RDDs.
RDD pedigree chart

FIG 2.1 RDD dendrogram (FIG lineage)



  In this example, the two-step transition Spark scheduler will be combined into a one conduit. Then sends a set of tasks to the node buffers the calculated errors partitions the conversion. If the loss of a partition errors, spark will filter (_. StartsWith ( "ERROR")) is only applied to the respective partition lines partition up lost weight calculation.

Advantages 2.2.3 RDD models

Compare RDDs and distributed shared memory
        3.1 RDDs comparison chart and distributed shared memory


1, you can see, the main difference between that and the DST RDD, RDD can only be created by converting a coarse-grained manner. DST while allowing read and write memory anywhere. This will limit the RDD program execution batch write, but it also allows for more efficient fault tolerance. RDD using the pedigree chart to recovery, will not bring checkpoint overhead. And only lost partitions need to be recalculated, heavy multi-node computing tasks can be executed in parallel, without the need to roll back the program.
2, RDD second benefit is that it is an immutable characteristic, this feature makes backups can run slow processing tasks node to node ease the task execution backwardness.
3, relative to the benefits of DSM, RDD there are two other aspects. First, the RDD batch operation, the runtime may schedule tasks according to the position of data to improve performance. Then, when the RDD for the scanning operation, RDD can degrades gracefully, which is not enough memory in the case, it will overflow to disk.

2.2.4 application is not suitable for RDDs


  First, be clear, RDD's goal is to batch applications, that is, to apply the same operation to each element of a data set. In this case, RDD can each perform a switching operation as a step in the effective pedigree chart record. So that does not require large data record information when you recover lost partitions. So, RDD is not suitable for fine-grained application asynchronously updated, for example, the storage system is a web application or web crawler program.


2.3 Spark Programming Interface


1. Why did you choose to write scala spark?

  官方解释:因为scala是简洁性(体现在交互上的便利性)和高效性(静态类型)的结合。
2、怎么使用?
  开发者使用driver(驱动程序)连接到集群workers(工作程序),driver定义RDD并且在其上调用action。driver上的spark代码也追踪RDD沿袭。workers是长期存在的进程,可以跨操作在RAM中存储分区。
  用户给RDD操作如map提供参数,通过传入闭包。Scala将每个闭包视为一个java Object,这些对象可以被序列化并可以被加载到另一个节点上以便于把闭包在网络间传输。Scala也将跟闭包绑定的任何变量当做域保存在java Object中。
  RDD是由其元素类型参数化的静态类型对象。eg:RDD[int]表示integer类型的RDD,但是,由于Scala的类型推断特性,一般情况下我们都将类型省略。
  尽管用scala把RDD公开的方法的概念是简单的,但是挑战是通过反射的方式处理Scala闭包对象所带来的问题。而且,让Scala解释器可以使用spark也需要做很多工作。

2.3.1 Spark中的RDD操作

Operation of spark rdd

图5.1 Spark中的一些RDD可用操作



Note: Transformations are lazy to perform operations.
  It should be noted that some operating only the join operator e.g. RDD value pairs available. Function name also matches the Scala and other functional languages. For example, map one to one mapping, flatMap mapped to each of the one or more input output.
  Further, the user may use the operator will persist RDD persistence addition, users can also obtain partition of a sequence represented by RDD Partitioner class, and can use this class to another data set Partitioner partition. Such groupBykey, reduceBykey, sort etc generated by hash or range partitioning RDD. Automatically re-partitioning.

2.3.2 Application Examples

  • Logistic regression
      This is a general classification algorithm, it can be used to search a two-point best Hyperplanes to leave. The gradient descent algorithm: w begins with a random value, at each iteration, the values of w data summing function, so to improve the movement direction w w.
val points = spark.textFile(...).map(parsePoint).persist()
var w = // random initial vector
for (i <- 1 to ITERATIONS) {
val gradient = points.map { p =>
p.x * (1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y
}.reduce((a,b) => a + b)
w -= gradient
}

  Map switching operation by acting on a text to each line of text parsed into a Point object, and persistence of which obtain a persistent RDD. Then the points in the map and reduce repeatedly performed each iteration the gradient is calculated (by summing the values ​​of the current function of w), by the persistence points into memory so that speed is 20 times improvement.

  • Page Rank
    Page Rank involved in data sharing mode slightly more complex. The algorithm will be linked to the contribution of value added to the article document iterative update a document of rank, in each iteration, the contribution of each document sends a r / n to his neighbor documents. Where r is the document ranking rank, n is the number of its neighbors. Then rank the document to update a / N + (1-a ) * sum (Ci), summing section representing the document received sum contribution. N is the total number of documents. It is a tunable parameter.
// Load graph as an RDD of (URL, outlinks) pairs
val links = spark.textFile(...).map(...).persist()
var ranks = // RDD of (URL, rank) pairs
for (i <- 1 to ITERATIONS) {
// Build an RDD of (targetURL, float) pairs with contributions sent by each page
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
// Sum contributions by URL and get new ranks
ranks = contribs.reduceByKey((x,y) => x+y).mapValues(sum => a/N + (1-a)*sum)
}

Look at page rank of pedigree lineage diagram
pagerank

Figure 6.1 PageRank pedigree chart



  In each iteration, it is based on the first iteration of the ranks, contribs result of static data links and data sets generated a new set of data ranks. As can be seen in this map it will increase the number of iterations becomes longer. Therefore, in order to reduce the recovery time, some versions of the iterative process of ranks preserved reliably is necessary. Users can invoke the persist operation and to identify incoming RELIABLE this save operation. However, note that, links need to replicate data set operations. Since a map operation performed on the input data set can be reconstructed effectively partitions links. links datasets much larger than the ranks, because each document has a lot of link, but only one rank. Therefore, the use of lineage way to restore it than using checkpoints logger full memory state to save a lot of time.
  Finally, the communication can be optimized by controlling the RDD PageRank partition. We can links and ranks in the same manner as a partition, so that the join between the communication links and does not require ranks. You can also write a custom class Partitioner page to be linked to each other are grouped. eg: URLs according to the packet domain. You can just use partitionBy:

links = spark.textFile(...).map(...).partitionBy(myPartFunc).persist()

  After performing this operation, join operation between the links and the ranks will automatically aggregate value of the URL for each contribution to its list of links on the computer is located, and recalculate its ranking and its links and link up. This cross-consistent iterative optimization dedicated partition is one of the main framework of the Pregel and so on. RDD allows users to directly express this goal.

It represents RDD interface

6.2 Spark FIG interface for representing RDD


2.4 RDD representation

  As one of the RDD provides an abstract challenge is to select a lineage can be traced various conversion operations represent for them. At the same time, using a RDD system should provide as much as possible the conversion operator, and allows users to use any combination of them. We use a simple RDD based on graph representation to promote this goal. In a spark in this representation, for providing a wealth of switching operation without adding a special logic for the scheduler, which greatly simplifies system design.

 Spark uses a common interface represents the RDD, the public interface exposes five messages:

  • A set of partitions, which is part of the data set atom
  • A set of dependency on the parent RDD
  • The data set is calculated based on the parent function RDD
  • Metadata about its partitioning scheme
  • Location data

 For example, RDD indicates a file on HDFS each block having partitions, and know that each block on which machine. Results on this map RDD has the same partition, but the stresses applied to the map data in the calculation of the parent element.
The dependency into two categories is sufficient, and useful:

Narrow dependent: each partition at most one parent RDD RDD to a partition to use the sub-
wide dependency: RDD parent of each partition is partitioned using a plurality of sub-RDD

Narrow width dependent

Figure 7.1 dependence narrow, wide dependency, each box represents RDD, shaded rectangle represents the partition

For example: map dependent operation is narrow, and the width is dependent join operation (unless the parent is a hash partition RDD)

  这一特性在两方面很有用,首先,窄依赖操作可以在一个集群节点上以管道的方式执行,能计算所有的父分区。例如可以逐个元素地应用map然后再应用filter。相反,宽依赖要求所有父分区的数据可用,并且使用类似MapReduce操作在节点间进行洗牌操作。其次,在节点故障恢复时,窄依赖操作的恢复效率更高,由于仅仅是丢失的父分区需要重新计算,同时可以多节点并行计算。相反,在一个宽依赖的谱系图中,单节点失败可能导致RDD的所有的祖先丢失某些分区,从而需要完全重新执行。
  RDD的这个通用接口使得用少于20行的代码中实现Spark中的大多数转换成为可能。使用这些接口实现转换并不需要知道调度器的调度细节。
  HDFS 文件:例子中的输入RDD是HDFS文件,文件的每一个block就是一个partition(每个block的偏移都保存在partition中)。preferredLocations给出块所在的节点,然后迭代器读取块。
  map: 在任何RDD上调用map都返回MappedRDD对象,这个对象和其父RDD具有相同的分区和首选位置。但将传递给map的函数应用于在其迭代器方法中的父RDD记录。
  union: 在两个RDD上调用union,返回一个RDD,结果分区是这两个父分区的union,每个子分区通过对一个父分区的窄依赖来计算。
  sample: 采样类似于map,除了RDD为每个分区保存一个随机数生成器种子用以确定性的对父记录采样。
  join: 对两个RDD进行join会出现要么两个窄依赖(两个RDD都采用同一个哈希/范围分区器进行分区的)、两个宽依赖、或者一个混合的(一个有partitioner而另一个没有)。任意一种情况下,输出都有一个partitioner(或者是从父RDD继承的或者是默认的哈希分区器)。

2.5 实现

  Spark 用了大概34000行scala语言实现。可以在多种集群管理器上运行,包括Apache Mesos、Hadoop Yarn、Amazon EC2或者其内置集群管理器。每个Spark程序都作为一个独立应用在集群上运行,有它自己的driver(master)和workers。应用之间的集群资源共享问题由集群管理器来解决。
Spark可以使用Hadoop现有的插件API从任何Hadoop数据源读取数据,并在未经修改的Scala版本上运行。

2.5.1 任务调度

  Spark调度器也用了RDD表示。
  Spark调度器与Dryad的调度器相似,只不过Spark调度器考虑了内存中RDD的哪个分区可用。当用户对一个RDD执行一个action操作的时候(eg:count或者save),调度器便检查RDD的谱系图来构建一个面向stage阶段的DAG(有向无环图)来执行,每个stage包含以管道方式组织的尽可能多的窄依赖转换操作,直到出现需要洗牌的宽依赖操作、或者已经计算好的分区可以使父RDD的计算短路时,便出现一个stage分界,这个操作便被划分到下一个stage中。然后调度器就开始启动任务来计算每个stage中丢失的分区,直到目标RDD被计算完成。
  Spark exemplary computing tasks stage
  

图2.5 Spark 计算任务Stage的示例

  Figure, the solid lines represent block RDD, shaded rectangle represents a partition, if the partitions are represented in black in memory. To run an action on RDD G, as a wide-dependent boundary, and the pipelined operation in which each conversion stage in order to construct the DAG stage. In this case, stage 1 results already in memory, therefore, run directly stage2 and stage3.
  
  Spark scheduler uses delay scheduling policy and based on data locality to send a mission to the machine where the RDD. If the task requires processing a partition, the partition in memory of a node, we will send the task to that node. If the task requires a process RDD preferred position provides a partition (eg: HDFS file), Spark task scheduler will be sent to these preferred location is calculated.
  For wide dependency is currently the intermediate result held by the parent node to the chemical and physical partition, in order to simplify failure recovery. As the output of MapReduce materialized Map of the same.
  If a task fails, the scheduler will run again on another node, as long as the stage of the parent stage may also be used, if the stage has become unavailable (eg: output map, such as loss of shuffle operation), the scheduler task rescheduling calculating in parallel the weight lost partitions. We did not fail to scheduler fault-tolerant, although RDD copy of the lineage diagram will be very simple.
  If the task is running slow (ie laggards), the scheduler will start a speculative other node backup copy, such as MapReduce so, who should use it who complete output.
  Although the calculation Spark in response to action is a program called Operation driver and running, we are trying to make the task calls lookup operation running on a cluster, this operation provides random access to the keys were RDD hash partitions, task You need to tell the scheduler if it fails to calculate the required partition.

2.5.2 multi-tenant

  RDD模型将计算拆分成独立的、细粒度的任务,因此它允许集群多租户资源共享。在执行期间,每个RDD应用可以动态扩展和收缩,应用可以轮流访问每台机器,也可被高优先级的应用抢占。Spark任务大多都在50ms到几秒之间,从而实现高度响应的共享。

  • 在每个Spark应用中,允许多线程并发提交job,资源分配方式类似于Hadoop 公平调度的分层公平调度方式。这一特性主要用于在相同的内存数据上构建多用户应用,例如Spark SQL的服务引擎模式,多用户可以同时运行查询。公平调度使得job之间相互隔离,同时短任务能够很快的返回,即使长任务占据了整个集群资源。
  • Spark公平调度器也使用延迟调度来保证数据本地性同时保持公平性,通过让任务轮流访问每台机器上的数据来实现。Spark支持多种情况的数据本地性,包括内存、磁盘、机架,以此掌控整个集群中数据访问的不同成本。
  • 由于tasks是相互独立的,调度器支持取消job来为高级别的job腾出空间。
  • 在spark应用之间,通过mesos中的资源分派概念,Spark依然支持细粒度的资源共享,这使得不同的应用可以用相同的API来在集群上提交细粒度任务。这允许Spark 应用之间以及Spark应用与其他计算框架如Hadoop进行动态资源共享。提供数据本地性的延迟调度工作在资源分派模型中依然可以正常进行。
  • Spark已经扩展为使用Sparrow system执行分布式调度,这一系统允许多Spark应用程序以去中心化的方式在同一集群上对工作进行入队,同时提供数据本地性、低延迟、以及公平性。分布式调度在多应用并发提交job时通过避免一个中心化的Master的方式极大地提高了可扩展性。

  由于大多数的集群都是多租户的,并且运行其上的工作负载变得越来越具有交互性。这些功能使Spark比传统的集群静态分区具有显著的性能优势。

2.5.3 集成解释器

  Scala and comprising a shell ruby ​​and similar interactive python, low latency through the memory data, Spark may be desirable to provide users with interactive query large data sets by the interpreter.
Spark interpreter

2.6 Spark FIG user input interpreter line transforming into a Java object exemplary

  Each row Scala interpreter to compiled into a user-entered class, is loaded into the JVM, then call a function thereof. This class contains a singleton object row containing the variable or function, and operation of the line is in an initialization process. For example, if a user types var x = 5 only followed by a line the println (X), the interpreter definition of a class called Line1, comprising a variable x, and then the second row would be compiled into println (Line1.getInstance (). X) .

  Made changes in Spark two aspects:
  1, the transmission class: In order to make the worker pulls node class byte code created on each line, spark using the HTTP service these classes.
  2, changing the byte code generation: Under normal circumstances, the object of each embodiment is a single line of code is created by a static method on the respective class to which access. This means that when we serialize a reference variable when the preceding line defines closures, java FIG tracked object will not be transmitted x Line1 Examples of packaging, therefore, will not be working node x. Spark modified bytecode generation logic, a direct reference to an object instance of each row.

2.5.4 Memory Management

  Spark为持久化RDD提供了三种选择:内存中的原始的Java 对象、内存中的序列化的数据、磁盘存储。第一种选择具有最快性能,因为JVM可以本机直接访问每个RDD元素。在空间有限的情况下,第二种选择提供更节省内存的表示相比于Java 对象图,访问性能稍差一些。在RDD特别大而且RAM放不下同时在需要用时重计算代价较大时使用第三种选择。
  为了管理有限的内存资源,我们在RDD层面上实施LRU赶出策略,当一个新的RDD被计算出来而又没有足够内存存储它的时候,我们就从内存移出最近最少使用的RDD的一个partition,除非这与具有新分区的RDD相同。在这种情况下我们将旧分区保留在内存中,以防相同RDD的分区循环进出。因为大多数操作都是在整个RDD上运行任务,所以,已经存在于内存中的分区在将来某个时刻被需要是非常可能的。目前这种默认策略在所有应用中工作良好,但Spark仍然通过“持久化优先级”为用户对每一个RDD提供进一步控制。
  目前,集群上的每个Spark实例都有其各自独立的内存空间,在将来的工作中,我们计划通过统一内存管理起来在Spark实例之间共享RDD。

2.5.5 支持检查点

  尽管沿袭图总是可以被用于失败之后恢复RDD,但是在沿袭图链条很长的情况下,恢复是耗时的。所以,必要时将某些RDD设置成检查点存储到稳定存储中会有帮助。
  通常,检查点对具有长沿袭图并包含宽依赖的RDD是很有用的,比如之前PageRank示例中的rank数据集。在这个情况下,集群中的某个节点失败,可能会导致来自每个父RDD的某些分片的丢失,这时必须进行整个重计算。相反,对于在稳定存储中的在数据上是窄依赖关系的RDD,如前逻辑回归例子中的points和PageRank中的links,检查点没有价值。集群上的某个节点失败,从这些RDD上丢失的分区可以在其他节点进行并行重计算,而复制整个RDD的成本只是一小部分。
  Spark提供设置检查点的API(通过一个持久化REPLICATE flag),而哪些数据需要设置检查点由用户自己决定。然而,我们也在研究如何自动实施检查点。因为Spark调度器知道数据集的大小和第一次计算该数据集所花的时间。它应该能为检查点选择一组最佳RDD,用以大大缩短系统恢复时间。
  RDD的只读特性使得对其设置检查点比进行内存共享要简单,因为不需要考虑一致性,RDD可以后台写出而不需要程序暂停或者进行分布式快照方案。

2.6 评估

本节主要是对Spark性能表现的一些展示及官方总结。

  • 在进行迭代式机器学习和图计算的时候,Spark要比Hadoop快最高80倍,主要原因在于,Spark通过内存存储Java Object避免了I/O以及反序列化对象带来的开销。
  • 特别地,用Spark来进行数据分析通常会比在Hadoop上运行快40倍。
  • 在节点失败时,Spark能够通过仅计算丢失分区很快修复失败。
  • Spark能被用于1TB的数据交互,延迟仅仅在5-7s。

2.6.1 迭代式机器学习应用

evaluation

图2.7 在100个节点的集群上通过Spark、Hadoop、HadoopBinMem对100G数据执行逻辑回归和k-邻近算法的时间对比

HadoopBinMem:这是Hadoop的一个部署,在第一次迭代时将输入数据转化成二进制格式以消除在后续迭代过程中的文本解析操作,并将其存储在HDFS上。
注:k-邻近算法是计算密集型的,而逻辑回归算法不是,因此逻辑回归算法对I/O和反序列化时间更敏感。

  • 第一次迭代:都要以从HDFS上读取text文件开始。Spark比Hadoop快,但差距不是巨大,这一差距主要是由于Hadoop有主从节点进行心跳通信的信令开销。HadoopBinMem最慢,因为它要运行一个额外的MapReduce来将数据转换为二进制,并且跨集群节点进行复制存储(HDFS存储)。
  • 后续迭代:从图中可以看出,对于逻辑回归,Spark分别比Hadoop和HadoopBinMem快85和70倍,对于k-邻近算法,Spark分别比Hadoop和HadoopBinMem快26和21倍,可以看出Spark RDD模型在其他系统对I/O,序列化涉及较多的情况下,优势十分明显。
  • 在这几种计算框架中Spark快的原因:
    –Hadoop软件堆栈的最小开销
    –提供数据服务时HDFS开销
    –将二进制记录转换成为可用的Java Object内存对象的成本
    官方测试,对于一个无任何计算操作的Hadoop job,从job建立、开始到清理,最少需要25s。对于HDFS开销,在提供每个块服务的时候,HDFS要执行多个内存拷贝并生成一个校验和。在逻辑回归算法中,执行二进制反序列化的步骤耗费的时间比耗费在计算上的时间还长,这样解释了为什么HadoopBinMem最慢。

2.6.2 PageRank

  Were sort of website Wikipedia 54GB of data with Hadoop and Spark. PageRank algorithm to ten iterations, processing link nearly 4 million articles.
Hadoop and Spark page rank comparison process

Hadoop and Spark processing performance comparison page rank algorithm Figure 2.8

  When the node 30, so that separate memory storage Hadoop Spark improved processing speed than 2.4 times. RDD partition so as to control the consistency in the iteration, to enhance the accelerated 7.4 fold. When expanding the number of clusters, the processing speed has also been a near-linear increase.

2.6.3 Recovery

Recovery

Figure 2.9 In the event of a failure k- iteration time near the algorithm, when the sixth iteration start of a node is kill, triggered by the lineage diagram on the part of the partition reconstruction

  This assessment was simulated overhead iterative reconstruction process RDD partition algorithm k- adjacent node fails lineage according to FIG. 75 cluster nodes adjacent to k- algorithm 10 iterations, without fail, each iteration of the 400 tasks and processing 100GB of data.

  After the node is kill, tasks on the node stop, lost partition is stored in the node. This time, Spark will be re-run in parallel tasks in other nodes, re-read the corresponding input data lineage diagram RDD and rebuild the lost partition. (Here we must note, Spark node in the choice of re-execution will consider the data locality).
  Note that if the error recovery mechanism based checkpoint settings, error recovery process at least a few iterations to be performed, depending on the frequency of checkpoints set. The system needs to copy the data across the network to 100G. When using Spark, you need to consume twice as much memory to copy the checkpoint data into memory, or waiting time for data to be copied to the disk. In contrast, in the present embodiment, the size of the RDD FIG lineage are smaller than 10KB. (Visible, followed RDD in FIG failure recovery is very obvious advantages, a reduced storage space requirements, two reduced the weight calculation).

2.6.4 performance when insufficient memory

Here Insert Picture Description

100GB logical processing data at different scales 2.10 regression algorithm dataset 25 in memory of the machine performance

2.6.5 Interactive Data Mining

Here Insert Picture Description

Figure 2.11 on 100 machines an interactive query response time rapid growth of input data with Spark

  Analysis 1TB Wikipedia page to view the log, the cluster configuration 100 machines, each 8-core 68GB of memory, each time the query scans the entire input data. Figure 2.11 shows that even 1TB of data query, the response time is only in Spark 5-7s. Exact Match page with the query input with the title exact match, Substring Match page with the title match query input portion, Total View query all pages. This data faster than using disk one order of magnitude, for example, query data from disk 1TB need 170s. This shows that the RDD Spark a powerful tool for interactive data mining.


2.7 discussion

  This section focuses on the programming model RDD which can be expressed, and why can be widely applied. From the official description of them, we can indeed find RDD can represent today's most cluster computing framework model, and it is important, RDD allows users to model different combinations in a program (for example, to run a Mapreduce operation to build a map, and then on which to run Pregel), and share data in these models.

2.7.1 represent the existing programming model

效率从哪里来?
  1)将特定数据持久化到内存中
  2)将数据分区来减少访问开销
  3)发生失败时只需重计算丢失分区
可用RDD表示的模型包含:
  MapReduce:可用Spark中的flatMap、GroupByKey、或者是reduceByKey来表示。
  DryadLINQ:这一系统提供了比MapReduce更大范围的操作,但这些操作都是批操作,Spark中的RDD 转换操作可直接对标使用。
  SQL:数据集上的并行操作,RDD转换操作均可完成。
  Pregel:谷歌的一个用于图计算应用的专业模型。程序以一系列的superstep执行,每个superstep中,每个顶点都执行同一个用户函数来改变该顶点相关的状态、改变图的拓扑结构、并发送信息给其他顶点用于下一次的superstep。这一模型可以表示很多算法,比如最短路径、二分匹配、网页排名等。
由于迭代过程会将同一函数应用于每个顶点,因此,这便是可用RDD表示这一模型的关键。可将顶点状态保存在RDD中、然后使用批转换函数(flatMap)应用用户函数并产生消息RDD,然后再将顶点状态和消息RDD进行join来进行消息交换。更重要的是,RDD允许我们将顶点的状态和Pregel一样保存在内存中,可控制其分区以减少访问开销,并支持失败时小比例执行恢复数据。
  Iterative MapReduce:例如HaLoop和Twister,用户给以一系列MapReduce job并形成一个环。系统在迭代计算中保持数据分区一致性,Twister还可以将其保存在内存中。这些优化均是RDD的一部分,故均可用RDD来表示。

2.7.2 RDD表达性的解释

  • 尽管RDD只能通过批转换操作得来,这也不影响其对大多数编程模型的表示,因为很多并行计算都是将同一操作应用到大量的数据上,于是RDD便很容易表示。
  • RDD的不可变性,也不是其障碍,因为可以创建多个RDD来表示数据集的不同版本,而且,现有的运行于文件系统上的MapReduce应用也不运行对文件进行更改,例如HDFS。
    那RDD这么神,难道其他专业系统设计时都没想到吗?Spark开发者给出的解释:专业系统在设计时针对的是应用的特定问题,而不是考虑其更普遍的原因。这里所指的普遍原因是作业间的数据共享。

2.7.3 利用RDD进行调试

  通过记录job RDD的沿袭图,可以后续重建RDD并可以进行交互式的查询。通过重新计算它所依赖的分区,在单进程调试器中重新运行作业的任何任务。不像传统分布式系统的重现调试器,必须捕获或者推断出多节点事件的顺序,通过只记录RDD沿袭图的方式,使调试器做记录的开销近零。


2.8 相关工作

  • 集群编程模型
      首先,像MapReduce、Dryad、CIEL的数据流模型提供了丰富的算子来处理数据,但是他们稳定存储来提供数据共享。RDD提供了一个更有效的共享抽象因为它避免了数据复制、I/O、序列化开销。
      其次,像DryadLINQ、FlumeJava这样的具有高级别的编程接口的数据流系统,用户可以通过其集成语言API提供的算子如map、join来操作“并行数据集”。但是这一并行数据集仅仅代表磁盘文件或者是表示查询计划的短暂数据集。尽管系统可以在同一个查询中管道化组织多个算子(eg:一个map接着另一个map),它也不能在多个查询中有效地共享数据。我们将Spark API构建在并行数据集模型的基础上,是因为其便利性,也不是因为集成语言的新奇性,而仅仅是为接口提供RDD作为一种存储抽象。我们使之能够支持广泛的应用。
      第三,为需要数据共享的并且提供高级别接口便是为特定应用类设计的系统。例如Pregel支持迭代式图计算,而Twister和HaLoop是运行时迭代的MapReduce。然而问题是这些系统只为他们支持的计算模式提供数据共享。而并没有提供通用抽象来使用户在其使用的操作间共享其选择的数据。例如,用户无法使用Pregel和Twister将数据加载到内存中,然后选择在其上运行哪一个查询。RDD显式提供分布式存储抽象故而支持很多这些专业系统不支持的应用。例如交互式数据挖掘。
      最后,一些系统公开共享的可变状态来使用户可执行内存计算。例如Piccolo允许用户运行并行函数来读取和更新一个分布式哈希表中的单元。分布式内存共享(DSM)系统和key-value形式存储的RAMCloud也提供相似的模型。 RDD在两方面与这些系统不同,首先,RDD提供基于map, sort 和 join这种操作算子的高级别的编程接口,而Piccolo和DSM仅仅读取和更新数据表单元。其次,Piccolo和DSM通过检查点和回滚来提供容错恢复。这在很多应用中都是比基于RDD 沿袭图策略进行容错恢复要昂贵的。另外,RDD在落后者缓解方面也是相对于DSM的优势。
  • 缓存系统
      Nectar能够通过程序分析,识别常见的子表达式,从而在DryadLINQ作业中重用中间结果。添加到基于RDD的系统将具有强大的功能。然而,Nectar并不支持将数据缓存在内存中(其将数据放在分布式文件系统上),也不支持用户显式控制将哪些数据持久化并决定怎样对数据分区。CIEL和FlumeJava同样可以缓存任务结果,但是也并不支持内存缓存并显式控制缓存哪些数据。
      Ananthanarayanan等建议在分布式文件系统中添加内存缓存,来利用数据访问的时间和空间本地性。这种方式加快了已在文件系统中的数据的访问速度。但是这种方式还是没有通过RDD在应用中共享中间结果的方式有效。因为这种方式在不同stage之间还是需要应用将的结果写到文件系统中去。
  • 沿袭
      捕获数据的谱系和起源信息一直以来都是科学计算和数据库研究课题,用于解释结果、用于再生产、用于在工作流中发现bug或者丢失某些数据集时进行重计算。在获取细粒度的谱系图比较昂贵的情况中,RDD提供一个并行编程模型,于是RDD可被用于失败恢复。
    基于谱系图的恢复机制和MapReduce和Dryad作业恢复机制相似,他们追踪DAG任务的依赖关系。然而在这些系统中,当作业结束时他们的沿袭信息就不存在了,需要使用复制的存储系统来在计算之间共享数据。相反,RDD使用沿袭来有效地跨计算保存内存数据,而无需复制和磁盘I/O成本。
  • Relational Database:
      RDD similar to database views on the concept. And lasting RDD like materialized views. Then, for the DSM system and database, allowing fine-grained to read and write all the records, you need to take notes on the operation and maintenance of data, fault tolerance and data consistency overhead. For coarse-grained model conversion -RDD, these costs do not need.

2.9 summary

  Our main article presents a resilient distributed datasets RDD, an efficient, versatile, abstract fault-tolerant clusters for data sharing. RDD can represent a wide range of applications, including other professional systems applications currently being used for iterative calculation of these professional systems failed to accomplish. Unlike existing need for data replication clustered storage abstraction fault-tolerant, RDD coarse granularity conversion API allows the user to efficiently recover data from lineage FIG. Spark RDD has been implemented in the system, which performs iterative calculation speeds up to 80 times faster than Hadoop can be used to interactively query hundreds of GB of data.

Note: The contents of the article is simple and refined to the author of the original Spark Spark according to the understanding of the experience of the use of paper and Spark, designed to trace the origin, understand the ins and outs of Spark, articles, pictures are from the original papers.

Guess you like

Origin blog.csdn.net/weixin_43878293/article/details/90256290