Spark and the merits of Hadoop

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.

This link: https://blog.csdn.net/wwdede/article/details/100046027

Spark has replaced the most active open source Hadoop big data projects. However, the choice of big data framework, companies can not therefore favoritism. Recently, the famous big data expert Bernard Marr in an article (http://www.forbes.com/sites/bernardmarr/2015/06/22/spark-or-hadoop-which-is-the-best-big-data -framework /) analyzed the similarities and differences of Hadoop and Spark.

Hadoop and Spark are big data framework that provides some tools to perform common tasks of big data. But rather, the task they perform is not the same, it does not exclude each other. Although in certain circumstances, Spark is said 100 times faster than Hadoop, but it does not have a distributed storage system. The distributed storage is the basis for many of today's large data projects. It may PB level data sets stored on a computer virtually unlimited number of common hard disk, and provides good scalability, only increases with increasing hard disk data set. Therefore, Spark requires a third-party distributed storage. It is also for this reason, many large data items Spark will be installed on top of Hadoop. In this way, Spark advanced analytics applications can use data stored in HDFS in the.

　　Compared with Hadoop, Spark real advantage is speed. Spark most operations are in memory, and Hadoop's MapReduce system after each operation all data is written back to the physical storage media. This is to ensure full recovery in the event of problems, but Spark flexibility of distributed data storage can achieve this. If you are interested in big data development, I want to learn the system big data, you can join the big data exchange technology to learn buckle group: Digital 4583+ numbers 45782, private letters administrator can receive a free development tools and entry-learning materials

　　In addition, in terms of advanced data processing (such as real-time stream processing and machine learning), Spark functions to be better than Hadoop. In Bernard it seems, that together with its speed advantage is becoming more popular Spark real reason. Real-time processing means that data can be submitted in an instant to capture its analytical application, and immediately get feedback. In a variety of big data applications, more and more use of this treatment, for example, retailers recommendation engine, manufacturing of industrial machinery performance monitoring. Spark platform and streaming speed data processing capability is also ideal for machine learning algorithms. Such self-learning algorithm can improve, until find the ideal solution to the problem. This technology is the most advanced manufacturing systems (such as identifying parts when damage) and the core of unmanned vehicles. Spark has its own machine learning library MLib, and Hadoop system will need to use third-party libraries machine learning, such as Apache Mahout.

　　In fact, although there is some overlap Hadoop Spark and functional, but they are not commercial products, there is no real competition, and often offer both services through the provision of technical support profitable company for such free system. For example, Cloudera Spark will provide both service also provides Hadoop services, and will provide the most appropriate recommendations based on customer needs.

　　Bernard believes that although the Spark has developed rapidly, but it is still in its infancy, security and technical support infrastructure side also underdeveloped. In his view, Spark on the rise of open source community activity, indicating that businesses are looking for innovative use of stored data

Spark and the merits of Hadoop

Guess you like