-RDD general idea of Spark Papers

5.1 Introduction

  Is well known, a memory RDD abstract, this abstract reduce latency, and the calculation function can be obtained such that the system matches the RDD professional based system, and more efficient. This article aims to explain clearly such internal reasons.
  However, from beginning to end, it should be recognized that although the Spark is widely used to a certain extent, because it is able to meet most needs. Spark decidedly not to say the best. Because, for network latency highly sensitive applications, Spark is not applicable, that can not meet the Spark quasi-real-time applications. But Spark can definitely be called big data inside very good computational framework calculations. Abstract data structure is worth considering learn something. Here mainly from the point of view of the two RDD are expanded.
  First of all, it is expressive, RDD can be simulated said before any distributed system near real-time requirements in addition to, and in most scenarios is highly efficient. Such efficiency can be said to be achieved by providing data sharing between MapReduce batch job. Secondly, from the perspective of the system, allowing the user to control RDD on most common resource bottlenecks, especially on the network and storage I / O.

5.2 expressive perspective

  the student surpasses the master. RDD inherited the idea of ​​MapReduce, but also plug in a pair of wings in this thinking.

5.2.1 MapReduce can do what

  In fact, MapReduce can do any distributed computing. Because simple terms is distributed by the local computing nodes plus the exchange of timely information components. Look at MapReduce model, map provide local computing, reduce information exchange between all nodes. Then, no matter what computing can be step by step, whole step to finish eventually form a unified result. Thus, MapReduce can be applied to each step, after a series of MapReduce effect, the final result is formed. It can be any distributed computing MapReduce.
  But the problem is that this computational efficiency is not high enough. The reason is that the results of each form of MapReduce need the help of external storage, so as to be next MapReduce use, but also to store multiple copies, so the writing process is very slow. In addition, MapReduce task itself with a delay. It has its own inherent characteristics, mentioned in front of the article.
  The RDD's Spark-based system, the one by the intermediate results in memory as much as possible, as a series of batch data sharing, to avoid doing slow and do write copy. Secondly, it does not have high inherent delay, pending the outcome of the article also mentioned in the commercial cluster of more than 100 nodes, Spark only bring 100ms delay.
  This inherent delay how come? Before the task to obtain the right to use CPU starts executing, the scheduling, submission, start phases, depending on the implementation, this process is not the same duration. Therefore, a different system will be delayed not the same. For the Spark will come, this delay of about 100ms to meet the basic compute-intensive batch job.

FIG lineage and fault 5.2.2

  First be clear: RDD operators in the chain in each step, step size of RDD FIG followed the chain is fixed.
In Spark 2.4.0 version RDD mainly in the following five attributes:

  • A partition list
  • A function for calculating each tile
  • Dependence on a list of other RDD
  • Optional: a key-value type of RDD partitioner
  • Optional: preferred location of a computing slice list

  To be fault-tolerant, different versions of the one you want to save intermediate state, the state of these results is the effect of different conversion data generated on the upstream. Second, we want to copy the data preserved by rumor. In the face of wide-dependent, status map the upstream data is to be calculated to reduce rumor for further processing. However, if a program requires dozens of steps switching operation is performed, then performing the operation-dependent output width. That the content needs to be stored is very large, and accordingly it seems, is the price of a relatively large fault-tolerant. So Spark also provides an asynchronous checkpoint function to limit this chain is very long storage growth applications. Thesis gives a general standard, is the case of the checkpoint memory bandwidth than the bandwidth of 10 times slower, usually once after every check point setting 10 switching operation.

5.2.3 BSP and compare

  I.e. synchronous parallel BSP model, there is a feature, specifically (a communication delay, synchronization expensive) hardware for the most prominent aspects of transactions simultaneously used for mathematical analysis and simple enough, the cost may be adjusted by a factor preferably several (for communication transmission number of steps, the amount of each of the local computing step, the amount of transmission traffic on each step to be performed in each processor). Usually use this model to design parallel algorithms, Google's Pregel will be based on this. But Pregel only supports checkpoint and roll back the entire system fault tolerance, with the common cluster size increases, and failed, this would HOLD to continue with. Pregel restrictive but recovery, allowing the rumor in the development of the information recording and logs parallel to recover lost state, as this touches RDD.
On the other hand, due to the RDD has based iterative interface, it is very beneficial to be written in different computing lib to form organized pipeline, the combination of the program is facilitated. RDD is also provided to enable users to use the correct combination of high level of abstraction (e.g.: dividing the state set of the plurality of partitioned data RA, or set dependent on a narrow, wide-dependent manner, so that at each step does not require all-to -all of rumor communication), while maintaining its simple generic interface.

5.3 system perspective

  Here is the main bottleneck for the system is, of course, in large data clusters inside the main bottlenecks have to be regarded as a communications and storage. RDD data partitioning and data locality give sufficient control applications to optimize the resources of these bottlenecks.

5.3.1 bottleneck resources

The following set of data can help you understand the characteristics of the hardware business clusters:

  • Local memory bandwidth for each node of 50GB / s, as well as a lot of hard, Hadoop clusters usually have 12 to 24, it usually means there is a bandwidth 2GB / s, assuming 20 disks, each disk 100M / s.
  • Each node has an external link 10Gbps (1.3GB / s), which is slower than the memory bandwidth about 40 times, but also an aggregate bandwidth of about two times slower than the disk node.
  • 20-40 node has been organized into a rack, that is to say with inter-rack bandwidth 20-40Gbps, which surpasses the performance of the network in the rack about 10 times slower.

  Characteristics listed above, mainly to emphasize network communications and data placement problem in large data calculation inside concern. As already mentioned RDD high-level abstraction, which allows the application to control the level of mobile computing rather than moving data at run time, MapReduce the map did. And the definition of a "preferred locations" API interface in RDD. RDD provide control of the user partition and on the partition Association. MapReduce data sharing across the network always implies transmission of data. In contrast, only in the application of a call operation across the cluster nodes or on a data set checkpoint when, RDD will be for network transmission.
  From the perspective of the system, if the majority of applications are bandwidth bottlenecks type of efficiency within that node compared to control in terms of network traffic will be much less. In Spark application, if enough memory to accommodate the data, then the bandwidth bottleneck lies. If the memory can not tolerate, and that the bottleneck lies in the network I / O, the most important factor at this time is to determine the performance of the local data line when the task scheduling. In fact, this is well understood, the task could not be better executed in the node data just stored. Front mentioned several times, the only cost is the added uncertainty RDD network latency. Of course, Spark MapReduce-based engine to do a lot of optimization, making the delay this indicator becomes very small, small enough to support streaming.

5.3.2 Fault Tolerance costs

  MapReduce, if the data will be distributed to reduce immediately after the completion of each map, a map that is complete, reduce data put pull over, then the best situation is when the last map processed and reduce pull after all the pulling work is already done.
  Spark realize there is a barrier, only started after the wide-dependent stage, reduce the map task until all tasks are completed. This is for avoiding complexity of the fault tolerance of the recovery, if the recording is in a manner to push the pipe reducer directly from the mapper.
However, Spark does not provide competitive in this respect. Because of this increase may be modest benefit, but far less importance to avoid copying the intermediate state and placed near the map task node data. Further, if the run-time component is dominated (eg: mapper take a long time to read data from the disk or shuffle slower than the calculation). In this case, mapper done directly transmitted to the benefit of the reducer will be greatly reduced.
  Although the error does not always happen, but Spark job is divided into fine-grained, independent tasks in a cluster setup, there are other important benefits. One is the laggard relief, laggards more common on the cluster. On the other hand is a dynamic resource shared multi-tenant, so that each user can get a better interactive performance. Spark is designed to be fault tolerant and resilient on the most important aspects to consider these features for multi-tenant and large clusters query terms, are more easily expandable.

5.4 Limitations and extensions

How many have mentioned the limitations of Spark and where it does not apply previously.

5.4.1 Delayed

  This delay comes from where? Since RDD operation is deterministic and across the cluster are synchronized. Therefore, start at each time step calculation, there are more delays in nature, so there are a lot of computing time step will be relatively slow.
  So, while Spark Streaming official gives low latency, able to adapt to many applications. But this application is based more on the time span of human events, so the RDD as low as 100ms latency is much lower than this event.

5.4.2 All-to-All communication mode

  RDD like MapReduce, can be used in a cluster communicate reduce any point to point. Under normal circumstances are such, but also a real network other high primitives, such as within the broadcast network and polymerization. It plays in a larger data-intensive applications in action. But this primitive operations, reduce if using point to point transmission of information, too costly. Therefore, RDD is directly extended to support such primitives, e.g. Spark the broadcast operation.

5.4.3 Asynchronous

  RDD full node shuffle the reduce operation is synchronous, in order to ensure deterministic operation. But because of this, if the amount of data per node or nodes of uneven load is not the same, but also make the calculation becomes slow. Perhaps the developers will follow this issue discover something interesting, if RDD also supports asynchronous computing and support fault tolerance, meaning that nodes can asynchronously send a message that the batch calculation can continue, even if there is a node slow. For example, a node can log message ID received at each iteration, rather than the log message itself, so the cost is much smaller. But also in the sense of a failure to rebuild a different node status is not lost, because the subsequent calculation may have used the lost state.
  For as statistical optimization algorithm, even if the state can continue to destroy, rather than abandon the entire schedule. For such an algorithm can be run on the system-based fault-tolerant recovery RDD does not need to perform a special mode, simply checkpoint results used for subsequent calculations.

Fine-grained 5.4.4 update

  RDD is not suitable for fine-grained update operation, since the operation for each lineage FIG recording too much overhead. For example, you want a distributed key-value store, RDD system is not suitable. Even batch update operation, the system is not Spark delay low enough for a coarse-grained key-value. If it can be like Calvin distributed databases, batch and transaction certainty to execute them, in order to achieve better performance and reliability of large-scale.

5.4.5 immutability and version tracking

  Mentioned earlier too, it is not used for denaturation followed FIG track, in order to restore the state on the old version of the data set. However, due to the immutability, RDD-based systems will need to copy more data, which increases the cost. So, you can use other ways at runtime and track these dependent variable based on the state of the RDD. But also mentioned front, you can be replaced by increments. If implemented, the processing flow for updating and fine-grained or feasible.

5.5 summary

  This article focuses on the versatility of the RDD its expression in terms of system-level exhibit. Most parallel computing systems are working to define a new model with a similar real machine, and let MapReduce more generalized in order to support a more efficient computing. However, data sharing, latency, fault tolerance, laggards processing always different. In short, no system is perfect, and all of the parallel computing system is a trade-off between cost and efficiency.

Guess you like

Origin blog.csdn.net/weixin_43878293/article/details/91643408