What Spark is? The difference between Hadoop and Spark

Spark is a universal memory parallel computing framework University of California, Berkeley AMP (Algorithms, Machines, People) developed in the laboratory.

Spark into the Apache Incubator project to be in June 2013, eight months after becoming the top-level Apache project.

Spark its advanced design concept, quickly became popular community project, launched SparkSQL, SparkStreaming, MLlib and GraphX and other components around the Spark, gradually formed a large data processing one-stop solution platform.

Hadoop and Spark

Hadoop big data technology has become a de facto standard, Hadoop MapReduce is also very suitable for large-scale data collection batch operations, but there are still some flaws of its own. In particular there is a delay MapReduce too high to be competent in real time, the problem fast computing needs, making the work process use cases require multiple calculations and iterative algorithm is not very efficient.

According to the workflow Hadoop MapReduce, you can analyze some of the shortcomings of Hadoop MapRedcue.

1) the limited ability to express Hadoop MapRedue.

All calculations are required to convert two Map and Reduce operation, can not be applied to all scenarios, for complex data processing is difficult to describe.

2) disk I / O overhead large.

Hadoop MapReduce requires data between each step serialized to disk, so high I / O costs, resulting in a large interactive analysis and iterative algorithm overhead, while almost all of the optimization and machine learning is iterative. So, Hadoop MapReduce is not suitable for interactive analysis and machine learning.

3) calculate the delay is high.

If you want to complete more complex work, it must be a series of MapReduce jobs in series and then perform these operations sequentially. Every job is high latency, but only after the completion of the previous one job to the next job started. Therefore, Hadoop MapReduce can not do more complex, multi-stage computing services.

Spark is borrowed Hadoop MapReduce technology evolved, it inherits the advantages of parallel computing and improved MapReduce obvious flaws.

Spark implemented using Scala language, which is a function of an object-oriented programming languages, a collection of objects can be the same as the operation of the local operation of the distributed data set easily. It has a fast running speed, ease of use is good, versatile features, and run anywhere, the specific advantages are as follows.

. 1) the Spark provides computing memory, the intermediate results into memory, resulting in higher efficiency iteration. By programming framework supports the distributed acyclic graph (DAG) of parallel computing, the Spark iterative process reduces the need for data to be written to disk, the processing efficiency is improved.

2) the Spark provides us with a comprehensive, unified framework for managing all kinds have different properties (text data, graphics data, etc.) of data sets and data sources (batch or real-time data stream data) of large data processing demand.

Spark function using the MapReduce programming paradigm expanded to support more computing model types, may encompass a wide workflow is implemented as a specific system over the prior Hadoop these workflows.

Spark uses memory caching to improve performance, interactive analysis also fast enough, the cache while improving the performance of the iterative algorithm, which makes it ideally suited for data Spark theoretical task, especially in machine learning.

3) Spark is more versatile than Hadoop. Hadoop provides only two treatments Map and Reduce operations, and the data set type of operation Spark provides a richer, which can support more types of applications.

Spark calculation mode also belongs MapReduce type, but the operation includes not only provide Map and Reduce, also offers a variety of conversion Map, Filter, FlatMap, Sample, GroupByKey, ReduceByKey, Union, Join, Cogroup, MapValues, Sort, PartionBy the like comprising operation, and Count, Collect, Reduce, Lookup, Save acts such operations.

4) Spark task scheduling mechanism based on the implementation of the DAG superior to Hadoop MapReduce implementation mechanisms of iterations.

Spark communication model between the respective processing nodes are no longer the same as Hadoop Shuffle only one mode, program developers can use to develop complex multistep DAG data pipeline, the control store intermediate results, partitions and the like.

1 and the implementation process Hadoop Spark were compared.

Hadoop and Spark execution Process Comparison
FIG 1 Hadoop and performs comparison process Spark

As can be seen, Hadoop is not suited to do iterative calculations, because each iteration will need to read from disk into the data, writing intermediate results to disk, and each task needs to read data from the disk, the result should be treated written to disk, large disk I / O overhead. Spark while the data loaded into memory, the later iteration may be used as intermediate results of calculations made in the memory, thereby avoiding frequent reading data from the disk.

The same is true for random query multidimensional. When hundreds or thousands of HDFS made inquiries with a number of dimensions of the data, Hadoop each do a separate query, the data must be read from disk, and Spark just read from disk once, can be reserved for intermediate results in memory of repeated queries.

Spark 2014 broke reference sorting Hadoop held (SortBenchmark) recorded using the 206 nodes completed ordering 100TB data at 23 minute period, while Hadoop is the use of a 2000 node 72 minutes to complete the same sorting data. In other words, Spark uses only ten percent of computing resources, gained the Hadoop 3 times faster.

Although compared with Hadoop, Spark have a greater advantage, but is not able to replace Hadoop.

Because Spark is a memory-based data processing, it is not suitable for particularly large amount of data, less demanding real-time applications. In addition, Hadoop can be used to build low-cost generic server clusters, while Spark hardware requirements are relatively high, especially those with higher requirements for memory and CPU.

40 + annual salary of big data development [W] tutorial, all here!

Spark of application scenarios

In summary, large data processing scenario has the following types.

1) complex batch processing

Partial focused on the ability to handle huge amounts of data, the processing speed tolerable, the usual time may be in the tens of minutes to several hours.

2) interactive query based on historical data

The usual time between several tens of seconds to tens of minutes.

3) based on real-time data stream data processing

Usually between a few hundred milliseconds to several seconds.

Currently the demand for the above three scenarios are more mature processing framework. ,

  • Hadoop MapReduce technique to use bulk massive data processing.
  • Interactive queries Impala.
  • Processing the real-time streaming data using a distributed processing framework Storm.

More than three are relatively independent, so maintenance costs are high, and stop Spark can meet these requirements.

Through the above analysis can be summarized Spark adaptation scene following.

. 1) the Spark frame memory is based on an iterative calculation for a particular data set operations require multiple applications. The more often require repeated operations, the greater the amount of data need to be read, the greater the benefit; the amount of data is small but larger computationally intensive applications, relatively small benefit.

2) the Spark apply to the amount of data is not particularly large, but requires real-time statistical analysis of the scene.

3) Due to the nature of RDD, Spark Does not apply to the kind of fine-grained asynchronous status update, for example, Web storage services, or incremental Web crawlers and index, which is not suitable for incremental model changes.

41. The the Spark overall structure and operating procedures
42. The the Spark ecosystem
43. The the Spark Development Example
44. The the Spark Streaming Introduction
45. The the Spark Streaming system architecture
46. The the Spark Streaming programming model
47. The the Spark DSTREAM related operations
48. The the Spark Streaming development examples
49. Data About mining

Guess you like

Origin blog.csdn.net/yuidsd/article/details/92170801