Xiaobai is learning cloud computing - MapReduce

One: Introduction
           MapReduce is an algorithm model and related implementation for processing and generating very large data in cloud computing. It can process the massive data provided by the client and output the processed data files. MapReduce can be deployed on a cluster composed of a large number of commonly configured computers to achieve parallel processing. For programmers without parallel computing and distributed processing system development experience, the MapReduce framework can effectively utilize the rich resources of distributed systems.
 Two: usage (programming model)
            MapReduce is mainly based on the map collection to achieve programming. The user first needs to create a Map function and input a set of key/value pairs. The Map collection generates a collection of intermediate key/value pairs. The MapReduce library will process it accordingly, collect the intermediate values ​​with the same key value and pass them to the Reduce function.
            Then the user needs to define the Reduce function (in actual operation, both the map function and the reduce function should be defined in advance) to receive a set of intermediate key values ​​and corresponding value values. The Reduce function combines these values ​​to form a smaller set of value values ​​and outputs it to the user.
Three:
            There are many ways to implement MapReduce, which are used in different operating environments. Here is an implementation based on a computing environment widely used within Google: a large cluster consisting of ordinary PCs connected by an Ethernet switch.
                       In this environment include:
                                 1, x86 architecture, machine running Linus operating system
                                 2, common network hardware devices
                                 3. Hundreds of ordinary machines. So machine failure is the norm
                                 4, storage is cheap IDE hard drives. The bottom layer uses the GFS system to ensure that the file will not break the validity and reliability of the data due to unreliable hardware.
                                 5. The user submits a job to the scheduling system, each job generates multiple tasks, and the scheduling system schedules these tasks to multiple available machines in the cluster.

           The specific execution process of MapReduce can be divided into five steps:
                       Step 1: After the client inputs the file data, the MapReduce function library will cut the input file into M segments (the value of M is determined by the user), and the size of each segment is Generally between 16M and 64M. The user program then creates a large number of program copies (respectively for processing different blocks of file data) in the cluster.
                       Step 2: There will be a special program - master in the large number of copies generated, and other programs are worker programs. A master program will assign tasks to the worker programs and monitor the execution status of the worker programs. There will be M Map tasks and R Reduce tasks assigned to idle worker programs.
                       Step 3: The program assigned the map task begins to read the input data fragment, parses the key/value pair from it, and then passes the map set containing the data to the user-defined map function. The Map function generates an intermediate key/value pair cached in memory.
                       Step 4: The key/value pair in the cache is divided into R regions by the partition function. The location information is then passed back to the master.
                       Step 5: The worker program assigned the reduce task obtains the location information of the R files from the master program, and reads the key/value pair data. The Worker program processes the data through the reduce function, and outputs the final key/value pair value to R files.

Four: Implementation details
           1. The Master program should not only assign tasks to the worker program in the early stage, but also store your location information in the relevant middle. At the same time, during the process, the master program will periodically communicate with the worker program to prevent errors in the worker program or failure of the machine where the worker program is located, which may cause the program to run abnormally or the result to be wrong.
           2. For the failed worker program, the master will reassign the task to the idle and normal worker program according to the atomic information stored by itself, and the files generated by the failed worker program will no longer be accessed by the map task. When the worker program returns to normal, it will become idle, waiting to be called again.
           3. If the master program fails, the MapReduce program also provides a corresponding processing mechanism. However, because the master program is unique, it will take a long time to return to normal. Generally, the program will be terminated and the client program will be notified. Wait for the customer to re-execute this procedure as needed.
           4. In the operating environment of MapReduce, network resources are the most scarce resources. Therefore, the system will try to store the input data in the local disk to save network bandwidth.
           5. The speed at which the system executes tasks depends on the slowest worker program. In order to improve this situation as much as possible, MapReduce provides an alternate program mechanism. When most of the tasks are completed, an alternate program is started to execute the unfinished tasks. This mechanism greatly improves the execution efficiency of the program.
Five: Summary
            MapReduce is the real computing system in cloud computing. Understanding the calculation and fault tolerance principles of MapReduce is the premise of applying distributed computing platforms such as Hadoop, and it is also an important reference for realizing massive data computing and data mining. Cloud computing is a hot direction in the development of the Internet today, and it is also an important direction of the development of the times. Only by deeply understanding MapReduce can we truly enter the world of cloud computing and grasp the pulse of future development.

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326976083&siteId=291194637