hadoop-MapReduce Overview

1.MapReduce defined

  MapReduce programming framework is a distributed computing program, the core framework is developed based on the user data hadoop analysis application;

  MapReduce core function is to integrate business logic and user-written code that comes with the default components into a complete distributed computing program, run concurrently on a hadoop cluster;

2.MapReduce advantages and disadvantages

  2.1 advantage

    2.1.1 Easy to program

      It is simple to implement some interfaces, you can complete a distributed program, this program can be distributed to a large number of distributed low-cost PC machines running; that is to say you write a distributed program, written with a simple serial program is exactly the same of; because of this feature makes MapReduce programming has become very popular

    2.1.2 good scalability

      When your computing resources can not be met, you can extend its computing power by simply increasing the machine;

    2.1.3 high fault tolerance;

      MapReduce is originally designed so that the program can be deployed on an inexpensive PC machine, which requires that it has a high fault tolerance; such a machine wherein the hanging, it can be transferred to the above computing tasks running on another node, this task will not fail, and this process does not require human intervention, and is entirely an internal hadoop completed;

    2.1.4 offline processing for mass data PB and above

      Server clusters can achieve thousands of concurrent work, providing data processing capabilities;

  2.2 shortcomings

    2.2.1 are not good at real-time computing

      MapReduce as not to mysql, returns results within milliseconds or seconds stage;

    2.2.2 not good flow calculation;

      Input data stream when the calculated dynamic, input data sets MapReduce is static, not dynamic change; this is because their design features MapReduce determines the data sources must be static;

    2.2.3 not good at the DAG (directed graph) is calculated

      A plurality of application dependency, the application of an input to the output of the previous one; in this case, not MapReduce not do, but after use, the output of each MapReduce job is written to disk, will cause a lot of disk IO, resulting in performance is very low;

3.MapReduce core idea

    

  3.1 Distributed computing program often needs to be divided into at least two stages;

  3.2 MapTask first phase concurrent instances, completely run in parallel, independent of each other;

  3.3 reduceTask second phase concurrent instances unrelated, but their data is dependent on a phase of the output of all concurrent instances maptask;

  3.4 MapReduce programming model can contain only one map phase and a reduce phase, if the business logic is very user load, it can only multiple MapReduce programs, serial run;

4.MapReduce process

  A complete program in MapReduce distributed runtime, there are three examples of the process:

    4.1 MrAppMaster: responsible for the entire process of program scheduling and coordination of state;

    4.2 MapTask: responsible for the entire data map phase of the process flow;

    4.3 ReduceTask: responsible for the entire process reduce the data processing stage;

The sequence common data types

    

6.MapReduce programming specification

  6.1 mapper stage

    mapper 6.1.1 User-defined to inherit their parent;

    6.1.2 mapper input data is in the form of a KV;

    6.1.3 mapper business logic in the map (method);

    6.1.4 mapper output data is in the form of a KV;

    6.1.5 map () method is called once for each of a KV;

  6.2 Reducer stage

    6.2.1 User-defined reducer to inherit their parent;

    6.2.2 reducer input data corresponding to the type of mapper output data type, but also KV;

    6.2.3 reducer of the logic within reduce () method;

    6.2.4 reducetask same process for each set of K KV group call the method once reduce ();

  6.3 Driver 阶段

    6.3.1 用户自定义的mapper和reducer都要继承各自的父类;

    6.3.2 整个程序需要一个Drvier来进行提交,提交的是一个描述了各种必要信息的JOB对象;

 

Guess you like

Origin www.cnblogs.com/wnwn/p/12569183.html