Talking about whether big data Spark technology can replace Hadoop

The idea of ​​using Spark to replace Hadoop has been heard a long time ago. In fact, the reason for this view is the difference between Spark and Hadoop. First, both are open source, which enables them to be applied in the field of big data analysis on a large scale, and can also be developed on the basis of diversity; secondly, Spark is based on Scala, which enables Scala to have a high-performance computing framework At the same time, compared with Hadoop, Spark is not limited by HDFS, and is superior to Hadoop in computing and mining performance of massive data. Spark also has better performance in the more popular machine learning, so the role of these factors In the future, Spark is increasingly favored by users.
But this does not mean that Hadoop no longer has advantages. Hadoop has a strong ecosystem. As a distributed system architecture, Hadoop is suitable for low-cost, large-scale data analysis environments and can accept massive data storage and computing. Although Spark improves A lot of MapReduce algorithms, but in fact more as a supplement to Hadoop.
To deeply understand the relationship between the two, you first need to have a detailed understanding of Hadoop:

What problems can Hadoop solve?
Hadoop solves the reliable storage and processing of big data (so large that one computer cannot store it, and one computer cannot process it in the required time). 
HDFS provides highly reliable file storage on a cluster composed of ordinary PCs, and solves the problem of server or hard disk failure by saving multiple copies of blocks. 
MapReduce provides a programming model through the simple abstraction of Mapper and Reducer, which can concurrently and distributedly process a large number of data sets on an unreliable cluster consisting of dozens or hundreds of PCs. Computational details such as modes (such as machine-to-machine communication) and failure recovery are hidden. The abstraction of Mapper and Reducer is the basic element that various complex data processing can be decomposed into. Complex data processing can be decomposed into a directed acyclic graph (DAG) composed of multiple Jobs (including a Mapper and a Reducer), and then each Mapper and Reducer are executed on the Hadoop cluster to obtain the result. 
In MapReduce, Shuffle is a very important process. It is with the invisible Shuffle process that developers who write data processing on MapReduce are completely unaware of the existence of distribution and concurrency.

So, what are the limitations of Hadoop?
However, MapRecue has the following limitations and is difficult to use. 
1. The abstraction level is low, it needs to write code manually, and it is difficult to use;
2. Only two operations are provided, Map and Reduce, which lacks expressive power;
3. A Job has only two phases (Phase), Map and Reduce, which is complex The calculation requires a large number of jobs to be completed, and the dependencies between the jobs are managed by the developers themselves;
4. The processing logic is hidden in the code details, and there is no overall logic;
5. The intermediate results are also placed in the HDFS file system;
6. ReduceTask needs to wait for all MapTasks to complete before it can be started. The delay is high, and it is only suitable for batch data processing. For interactive data processing, the support for real-time data processing is not enough;
7. The performance of iterative data processing is relatively poor.

Therefore, in response to the above content, Spark has made some improvements:
in terms of performance, Spark is fast in operation. Spark can also perform batch processing, however it is really good at handling streaming workloads, interactive queries and machine learning.
Compared with MapReduce's disk-based batch processing engine, Spark is famous for its real-time data processing capabilities. Spark is compatible with Hadoop and its modules. In fact, Spark is listed as a module on the Hadoop project page. Spark has its own page because while it can run in a Hadoop cluster via YARN (another resource coordinator), it also has a standalone mode. It can be run as a Hadoop module or as a standalone solution. The main difference between MapReduce and Spark is that MapReduce uses persistent storage whereas Spark uses Resilient Distributed Datasets (RDDS).
The reason Spark is so fast is that it processes everything in memory. Yes, it can also use disk for data that doesn't all fit into memory.
Spark's in-memory processing enables near real-time analytics on data from multiple sources: marketing campaigns, machine learning, IoT sensors, log monitoring, security analytics, and social media sites. Also, MapReduce uses batch processing, which was never designed for blazing speed. Its original intention is to continuously collect information from websites, which does not require the data to be real-time or near real-time.

In terms of ease of use, Spark supports Scala (native language), Java, Python, and Spark SQL. Spark SQL is very similar to SQL 92, so you can get started right away with little to no learning. Spark also has an interactive mode so that both developers and users can get immediate feedback on queries and other operations. MapReduce does not have an interactive mode, but with add-on modules like Hive and Pig, it is easier for adopters to use MapReduce.

In terms of fault tolerance, MapReduce and Spark approach the problem from two different directions. MapReduce uses the TaskTracker node, which provides the heartbeat for the JobTracker node. If there is no heartbeat, then the JobTracker node reschedules all to-be-executed and ongoing operations to another TaskTracker node. This approach is effective at providing fault tolerance, but can significantly increase the completion time of certain operations (even with only one failure).
Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant collections in which data elements can perform parallel operations. RDDs can reference datasets in external storage systems, such as shared file systems, HDFS, HBase, or any data source that provides Hadoop InputFormat. Spark can create RDDs from any storage source supported by Hadoop, including the local filesystem, or one of the filesystems listed earlier.

Therefore, through the discussion of the above content, the author believes that Spark can play a good supplementary role to Hadoop, and to some extent, the two can be paralleled. Hadoop establishes a distributed file system, and Spark is responsible for efficient data operations, thus building an ideal big data processing platform.
 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324999237&siteId=291194637