Parallel Computing and MapReduce

2019-12-01

21:17:38

Reference: https: //www.iteye.com/blog/xuyuanshuaaa-1172511

Now MapReduce / Hadoop and related data processing technology is very hot, so I would like a summary of what the advantages of MapReduce, a brief comparison with the traditional MapReduce parallel computing model based HPC cluster to do, can be considered to have learned long ago MapReduce knowledge to make a summary and comb. 

  As the Internet continues to increase the amount of data required for data processing capabilities are becoming increasingly high. When the calculation processing amount exceeds the capacity limit of the single, taken to address the way parallel computing is a natural. Before MapReduce appeared, there have been so very sophisticated parallel computational framework like MPI, then why the need MapReduce Google, MapReduce compared to traditional parallel computing framework What are the advantages, this is a problem of this article. 

  The beginning of the first article gives a comparison table of traditional parallel computing and MapReduce framework, and then subjected to an item analysis.  

Parallel Computing and HPC clusters MapReduce Comparative merits 
▲ 

  in the conventional parallel computing, the computing resources are generally shown as logically unified on a single computer. For HPC cluster consisting of a plurality of blades, SAN constituted, the show to programmers still is a computer, but this calculation has a large number of CPU, and the huge capacity of main memory and disk. On the physical computing resources and storage resources are two opposing separate portions, arrival node data calculated from the data node via the data bus or high-speed network transmission. For compute-intensive processing a small amount of data, this is not a problem. For data-intensive processing, I between computing nodes and storage nodes / O will become a performance bottleneck of the entire system. Cause data sharing architecture centrally located, causing I / O bottlenecks. Further, since the coupling between the cluster assembly, more closely dependent, cluster fault tolerance is poor. 

  In fact, when large-scale data when the data will be reflected in a certain locality characteristics, so the unified data storage, read out a unified approach is not the best. MapReduce is committed to solve the problem of large-scale data processing, so early in the design takes into account the principle of locality data, using the principle of locality and conquer the entire problem points. MapReduce cluster consists of an ordinary PC mechanism for shared-nothing architecture. Prior to processing, distribution of the data set to the respective nodes. Processing, each processing node nearest to read data stored locally (Map), the processed data merge (Combine), sorting (shuffle and sort) and then distributed (to reduce node), avoiding a large number of data transmission to improve the processing efficiency. Another benefit is shared nothing architecture with replication (replication) strategy, the cluster can have a good fault tolerance, down machine node part of the normal work of the cluster will not be affected. 

  Hardware / Price / scalability 

  traditional HPC cluster consists of advanced hardware, very expensive, if you want to improve the performance of HPC clusters, usually taken longitudinally expanded way: that change faster CPU use, increased blade, adding memory, disk expansion, etc. . But this way can not support the long-term extension of computing expansion (it is easy to reach the peak) and the upgrade expensive. Thus with respect to the MapReduce cluster, HPC cluster poor scalability. 

  MapReduce cluster consists of an ordinary PC agency, ordinary PC has a higher price, and therefore the same computing power of a cluster, a cluster of MapReduce price is much lower. Moreover, the MapReduce cluster nodes via an Ethernet connection, which has good lateral extensibility, i.e., throughput can be improved by addition of the PC nodes manner. Yahoo! has the world's largest Hadoop cluster contains more than 4000 nodes (Google's MapReduce cluster size should be larger, but it seems not released any specific figures, if users informed, but also hope the wing). 

  Programming / learning curve 

  The traditional model of parallel computing and multi-threading model have similar logic, the biggest problem with this programming model is difficult to control the behavior of the program. In order to ensure correct implementation of the results, it is necessary to carefully control access to shared resources, and thus developed a mutex, semaphore, a series of locks and other synchronization technology, it also brings problems such as fighting, starvation, deadlock and so on. Programmers in parallel programming using traditional computing model, not only to consider things to do (ie "what to do": the use of a parallel model to describe the need to be resolved), but also consider the details of program execution (ie "how to do "many synchronous program execution, communication problems), which makes parallel programming difficult. The existing programming models, such as MPI, OpenCL, CUDA is only done at a lower level package, program implementation details still need to deal with a lot. 

  MapReduce is to do more processing: MapReduce programming model includes not only, but also provides a runtime environment for executing MapReduce program, parallel program execution in many details, such as distribution, consolidation, synchronization, monitoring and other functions were handed over to be responsible for the implementation of the framework . Use MapReduce, the programmer need only consider how to use the MapReduce model to describe the problem (what), without having to worry about how the program is executed (how), which makes it easy to use MapReduce. 

  Applicable scene 

  said so much good things MapReduce, MapReduce is one size fits all right? 

  The answer is no, no matter what, should not forget MapReduce is designed to: solve large-scale, non-real-time data processing problems. Have decided to locality properties of large-scale data available (which can be divided), it can batch; representatives of non-real-time response time may be longer, have sufficient time to execute the program. Such as the following several operations: 

  1. Update sorted search engine (PageRank algorithm performed on the entire web graph) 

  2. Calculate recommendation (recommendation result updated in real time is not required, thus to set a fixed time point periodically updated) 

  The birth of MapReduce has its historical background: With the development of the web, especially the development of SNS and the Internet of Things, the web produced by a variety of users, amount of sensor data showing explosive growth. Data can only save up dead data, only after analysis and processing, in order to get the information inherent in the data, and then summarize knowledge from their information. Therefore, the data is important, as important as the ability to process data. Parallel computing based on the traditional HPC clusters has been unable to meet the data processing needs of rapid growth, so ordinary PC-based low-cost, high-performance, highly scalable, highly reliable MapReduce emerged.

Guess you like

Origin www.cnblogs.com/JasonPeng1/p/11967764.html