Similarities and differences between Hadoop and Spark

First, both Hadoop and Apache Spark are big data frameworks, but they exist for different purposes. Hadoop is essentially more of a distributed data infrastructure: it dispatches huge data sets to multiple nodes in a cluster of ordinary computers for storage, meaning you don't need to buy and maintain expensive server hardware.

At the same time, Hadoop also indexes and tracks this data, making big data processing and analysis more efficient than ever. Spark, on the other hand, is a tool specially used to process those distributed storage big data, and it does not store distributed data.

Both can be combined

In addition to providing the distributed data storage function of HDFS, which is agreed by everyone, Hadoop also provides a data processing function called MapReduce. So here we can completely abandon Spark and use Hadoop's own MapReduce to complete data processing.

On the contrary, Spark does not have to depend on Hadoop to survive. But as mentioned above, after all, it does not provide a file management system, so it must be integrated with other distributed file systems to work. Here we can choose Hadoop's HDFS, or other cloud-based data system platforms. But Spark is still used on Hadoop by default. After all, everyone thinks that their combination is the best.

The following is the most concise and clear analysis of MapReduce, excerpted from the Internet:

  We're going to count all the books in the library. You count shelf 1, I count shelf 2. This is "Map". The more we are, the faster we can count books.

Now let's get together and add all the statistics together. This is "Reduce".

Spark data processing speed kills MapReduce

Spark is much faster than MapReduce because of its different way of processing data. MapReduce processes data in steps: "Read data from the cluster, perform one processing, write the result to the cluster, read the updated data from the cluster, perform the next processing, and write the result to the cluster, Wait..." so explains Kirk Borne, data scientist at Booz Allen Hamilton.

In contrast, Spark does all the data analysis in-memory in near-real-time: "Read the data from the cluster, do all the necessary analytical processing, write the results back to the cluster, done," Born said. Spark's batch processing speed is nearly 10 times faster than MapReduce, and in-memory data analysis is nearly 100 times faster.

If the data to be processed and the result requirements are mostly static, and you have the patience to wait for the batch processing to complete, the MapReduce processing method is also completely acceptable.

But if you need to analyze streaming data, such as those collected by sensors in a factory, or if your application requires multiple data processing, then you should probably use Spark for processing.

Most machine learning algorithms require multiple data processing. In addition, Spark is usually used in the following application scenarios: real-time market activity, online product recommendation, network security analysis, machine diary monitoring, etc.

disaster recovery

Both have very different approaches to disaster recovery, but both are great. Because Hadoop writes data to disk each time it is processed, it is inherently resilient to system errors.

Spark's data objects are stored in a data cluster called Resilient Distributed Dataset (RDD: Resilient Distributed Dataset). "These data objects can be in memory or on disk, so RDDs can also provide complete disaster recovery capabilities," Borne noted.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326352898&siteId=291194637