Comparison of Spark with the Hadoop

Spark is a distributed computing framework, the MapReduce Hadoop of benchmarking; the MapReduce suitable for off-line batch processing (processing delays in the order of minutes) may be done off-line and batch Spark can do real-time processing (SparkStreaming)

  ①Spark set batch, real-time streaming, interactive query, machine learning and drawing one calculation

  ②Spark implements a distributed memory abstraction, called elastic distributed data set; RDD when executed allows a user to query a plurality of the working set explicitly cached in memory, subsequent queries can reuse the working set, greatly enhance query speed.

A Hadoop Job's usually through the following steps:

  ① reading the input data from the HDFS

  ② mapper function using user-defined in the Map stage, then the result will spill to disk

  Map reading ③ machine is calculated from the respective stages of the Map Reduce stage intermediate results, using a user-defined reduce function, and finally write the result back generally HDFS

  Hadoop problem is that the Job Hadoop be a multiple disk read and write, such as writing to disk local to the machine, or to write a distributed file system (this process involves reading and writing disk and network transmission). Considering the disk read several orders of magnitude slower than memory read, so Hadoop such as disk reads and writes are highly dependent on the architecture there will be a performance bottleneck; and some scenes such as some of the iterative nature of the algorithm (logistic regression) will reuse a some Job result, cause a trigger to recalculate bring a lot of disk I / O.

Spark Hadoop not be used like a disk reader, and switch a much higher performance memory to store input data, intermediate results of the processing and storage of intermediate results. Large data in the scene, many of the cycle characteristics are calculated, so as to allow Spark written to the output buffer in memory, on a result of the Job immediately next use, performance better than natural Hadoop Map Reduce.

Guess you like

Origin www.cnblogs.com/xuange1/p/12222742.html