Big Data Hadoop framework Spark is better than you?

For anyone who enters the world in terms of big data, big data and Hadoop has become synonymous. As people study the ecosystem and its tools and principles of operation of large data, they can better understand the real meaning of big data and Hadoop role in the ecosystem plays.

Wikipedia explains to big data: Big Data is a broad term that refers to large and complex data sets traditional data processing applications can not handle.

Briefly, the amount of data increases, the conventional approach requires time-consuming and expensive.

Doug Cutting inspired by Google MapReduce and GFS White Paper, the founder of Hadoop in 2005. Hadoop using open source software framework for large data sets distributed processing and distributed storage technology. In other words, this product is designed to reduce the time and cost of processing large data sets.

Hadoop, which distributed file system (the HDFS) and distributed processing modules (the MapReduce) Standard facto calculated data. Hadoop term not only associated with the base module, it is also closely linked with other software packages compatible with the Hadoop ecosystem.

Over time, the amount of data generated by the surge in demand for processing large amounts of data also will surge. This ultimately makes big data computing needs to meet a variety of needs, and these needs can not be completed by the Hadoop.

Most of the data analysis process is iterative in nature. Although the iterative process may be performed by MapReduce, but the data should be read at each iteration. Under normal circumstances, this is no problem. However, if the data of 100GB or a few TB of data read, will be time-consuming, and people will be impatient

Many people believe that the data analysis is an art rather than a science. In any field of art, the artists create a small part of the puzzle, the puzzle will turn a small place on a larger puzzle, witness its growth. It can be roughly translated as: data analyst would like the next to get a result before the start of pre-treatment process. In other words, much of the data analytics are interactive in nature. In the traditional sense, affect the Structured Query Language (SQL) is subject to the customary interactive analysis. Analysts query can be written to run in the data in the database. Although there are similar products Hadoop (Hive and Pig), which had also time-consuming, because each query requires a lot of time to process the data.

These obstacles led to the creation of Spark, the new processing module can promote the iterative programming and interactive analysis. Spark is equipped with a memory of the original model data loaded into memory and repeated the query. This makes the Spark is ideal for large data analysis and machine learning algorithms.

Note, Spark simply provides distributed processing module. Storing a data portion is still dependent on the Hadoop (Distributed File System HDFS) Efficient storage storing data distribution, without completing the Spark

Spark Big Data ecosystem provided on superluminal disk, make sure that 10-100 times faster than MapReduce. Many people think that this may be the end of MapReduce.

easy to use

Compared MapReduce speaking, Spark is simple to operate, and even can be very convenient. Even for a simple logic or algorithm, but also requires the MapReduce 100 lines; but using the Spark, a simple logic, a few lines of code can be completed. This raises a key factor, widely known uses. Many advanced algorithms for MapReduce in terms of machine learning problems or chart can not be done, can be done by Spark. It makes Spark adoption rate is quite high.

MapReduce is no interaction module. Despite the Hive and Pig includes a command line interface, the performance of these systems still rely on MapReduce. MapReduce for batch still very welcome.

Spark processing data in memory, and after the shift MapReduce processing data back to disk. So Spark will be better than MapReduce.

In 2014, Spark promotion Daytona GraySort test and come out on top. For the uninitiated, DaytonaGraySort third party evaluation benchmark system retrieves 100TB (one trillion records) data rate.

Spark AWS EC2 using device 206, in 23 minutes 100TB of data stored on disk. The previous highest record holder is MapReduce, it uses a 2100 piece of equipment, it spent a total of 72 minutes. Spark MapReduce than three times faster, and uses less equipment total number of 10 times under the same conditions.

Spark up a lot of memory. If you run other memory-intensive service while we run Spark, and its performance may be compromised. However, we can safely say, Spark prevail (requires multiple passes over the same data) in an iterative processing.

Here I would like to recommend my own build large data exchange learning skirt qq: 522 189 307, there are learning skirt big data development, if you are learning to big data, you are welcome to join small series, we are all party software development, Share occasional dry (only big data development related), including a new advanced materials and advanced big data development tutorial myself finishing, advanced welcome and want to delve into the big data companion. The above information plus group can receive

cost

Both in computing power, disk and network hardware requirements are very similar to the environment. The larger the memory, the better the performance Spark. Both use commodity servers.

MapReduce programming effort, the market is not much of an expert in this area. Even with the few Spark expert, but this is only because the Spark is a start-up product. Spark learning so much easier than programming MapReduce.

Out of Hadoop Spark

Run Spark does not really need the support of Hadoop. If we do not read data from a distributed file system (HDFS) in, Spark can also run on their own. Spark can also read and write data from other memory such as S3, Cassandra like. In this architecture, Spark can be run in standalone mode, it does not need to support Hadoop components.

Products

Recent studies show that the use of user Spark surge in the product. Many users run Spark and Cassandra, or Spark and Hadoop, Spark Or run on Apche Mesos. Spark Although the number of users has increased, but did not cause panic in the big data community. MapReduce usage may drop, but the specific decline is unknown.

Many predict Spark will lead to better develop another stack. But this new stack can be very similar to the Hadoop ecosystem and its package.

Spark biggest advantage is simplicity. But it does not completely eliminate MapReduce, because a lot of people still using MapReduce. Even Spark big winner, unless the development of new distributed file system, we will also use Hadoop and Spark process data.

Guess you like

Origin blog.csdn.net/bigagag/article/details/90347907