Good programmers to share large data submitted MapReduce job flow

Good programmers to share large data MapReduce job flow submitted

A, MapReduce definition

MapReduce is parallel computing model for large data processing, the frame and the internet.

Its main idea is: the Map (map) and reduce (reduction)

1) MapReduce is a cluster-based high-performance parallel computing platform

2) MapReduce is a software framework running parallel computing

3) MapReduce programming model is a method of parallel and

Two, MapReduce main functions:

Two, MapReduce main function

1) Calculation task scheduling and data partitioning

2) data / code for mutual positioning

3) System Optimization

4) Error detection and recovery

Three , computing tasks submitted job process

In the time to learn this, we will face some problems:

1) First, the problem faced is how the data is distributed?

2) cut down a large file in accordance with the kind of way, were thrown on different machines?

3) after cutting down some way, it is how to throw up different machines?

4) What a machine assigned to the task? How allocation?

5) How to get the task to solve?

With these questions, we need to learn about the submission process job from the process to find the answer to our problems.

Job specific submission process, we use words summarized as follows:

1, the client submits a job to resourcemanager (rm).

2, rm to put queue, and returns jobid file path information.

3, the client computing resources required, upload the hdfs storage path (including the job information and the fragment information).

4, the client to return rm a ready resource information, job into the queue, told him he could start the job, wait for rm scheduling.

5, rm before scheduling, a resource request nodemanager (nm), nm Container starts, it receives a task on the resource acquired hdfs Container, and then interact with the client resource requirements have been calculated, the client starts to send applicationmaster (am) commands.

6, after start up am, application computing resources (MapTask) rm by parsing the fragment information.

7, rm to receive information View nm resources, through a load balancing machine needed, nm every heartbeat queries the tasks assigned to their own job description information from the received message the machine will pick up the task from the hdfs computing resources, and interact with am, am sending commands to start maptask.

8, after the end of Maptask, notification am, then release maptask resources, am sending a message to rm, application of resources reducetask.

9, rm allocation of resources, it is start reducetask.

10, reducetask maptask completed data collection, reduce startup logic. After execution, notification am, then release the resources reducetask. I am notify rm. am free up resources.

Guess you like

Origin www.cnblogs.com/gcghcxy/p/10980290.html