Hadoop MapReduce Introduction

Then introduce a typical batch mode MapReduce, the last function of the Map and Reduce functions described.

 

Batch Mode

Batch mode is a pioneer in large-scale data processing mode. Batch major operating large-scale static data set, and returns the result in the overall data processed. Batch is ideal for access to the computing work of the entire data set to complete.

For example, when calculating totals and averages, the data set must be treated as a whole, can not record it as a collection of a plurality of. These operations require the process data to maintain their state in the calculation.

Task requires processing large amounts of data are usually the most suitable for processing in batch mode, batch systems in the design process to fully consider the amount of data that can provide sufficient processing resources.

Since batch in response to a large number of aspects of persistent data extremely well, it is often used to analyze historical data.

In order to improve processing efficiency, batch processing of large data sets need the help of distributed parallel programs.

Recommended Reading articles

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Big Data era need to know six things

Big Data framework hadoop Top 10 Myths

Experience big data development engineer salary 30K summary?

Big Data framework hadoop we encountered problems

The traditional procedure is substantially single-instruction, single-data streams is performed sequentially. This procedure is relatively simple to develop, in line with people's habits of mind, but the performance will be limited by the performance of a single computer, it is difficult to complete the task within the given time.

The program runs on a distributed parallel computer consisting of a large number of clusters, you can use multiple computers at the same time concurrent with the completion of a data processing tasks, improve processing efficiency, and can be expanded by adding new computer cluster computing power.

Google was first to achieve a distributed parallel processing mode MapReduce, and in 2004 by way of the paper announced its working principle, Hadoop MapReduce is that it's open source implementation. Hadoop MapReduce running on HDFS.

MapReduce 简释

As shown in Figure 1, if we want to know quite a thick stack of cards there are how many hearts, most intuitive way to check these cards one by one, and count how many there are hearts. The drawback of this approach is too slow, especially in the particularly high number of cards the case, get results very long time.

Find out how many hearts
1 find out how many hearts

The method follows the rules MapReduce.

  • This stack of cards assigned to all the players here.
  • So that each player the number of cards in their hands, there are a few hearts, then put this number up report.
  • The numbers add up all players to report to give a final conclusion.

Obvious, MapReduce methods so that all players check the cards in parallel to find a stack of cards in the number of hearts, can greatly accelerate the speed to get the answer.

MapReduce method uses the idea of ​​split, merge the two classical functions.

1) map (Map)

Each of the set of elements in a same operation. If you want to form in each cell multiplied by two, then this function is separately applied to each cell belonging to the operation map (Map).

2) simplification (Reduce)

Traverse the elements in the collection to return the results of a comprehensive. If you want to find the sum of all the numbers in the form, then the sum of the output in the form of a digital task belongs to simplification (Reduce).

Let's re-examine the front of the card to find examples of the total dispersion hearts MapReduce using the basic method of data analysis. In this case, the players represent a computer, because they work at the same time, so they are clusters.

The cards distributed to multiple players and make their own count, that is executed in parallel operation, then each player at the same time counting. This brings the work into a distributed, because a number of different people in the process of solving the same problem does not need to know their neighbors are doing.

Tell everyone to count, in fact, an examination of each card mapping tasks. Not allowing the player cards handed back the hearts, but let them wanted simplified to a number.

Note that the card distribution is uniform too. If a player assigned to the cards far more than the other players, then he could count cards process is much slower than others, which will affect the whole progress of the number of cards.

Further can also ask some of the more interesting questions, such as "the average of a stack of cards what?." We can merge "what is the value of all cards and is it?" And "How many of our cards?" These two questions to be answered. With this "and" "the number of cards" would have been divided by the average value.

MapReduce algorithm mechanism to be much more complicated than the number of cards, but the main idea is the same, namely to analyze large amounts of data through dispersion calculations. Whether Google, Baidu, Tencent, NASA, or small start-up companies, MapReduce are currently the mainstream method of analysis of Internet-level data.

The basic idea of ​​MapReduce

Use MapReduce to handle large data basic idea consists of three levels. First, take the idea of ​​divide and rule for large data. To each other does not have a big data computing dependencies of parallel processing, the most natural way is to take the strategy of divide and rule.

Secondly, the ideological divide and conquer rose abstract model. In order to overcome the lack of MPI parallel computing and other high-level parallel programming model for this deficiency, MapReduce draws on the idea of ​​a functional language Lisp, providing high-level parallel programming abstraction model with Map and Reduce two functions.

Finally, the ideological divide and rule level rise architecture, unified architecture for the programmer to hide the implementation details of the system layer.

MPI parallel computing methods such as the lack of a unified computing framework support, programmers need to consider the data storage division, distribute, collect the results, error recovery, and many other details, for this purpose, MapReduce design and provides a unified computational framework, programmers hide most systems-level details of the deal.

1. Big Data processing idea: Divide and conquer

The first important question is how to divide the parallel computing computing tasks or data in order to calculate the divided sub-tasks or data blocks computed simultaneously. However, the problems which exist between the front and rear computing data items strongly dependent, can not be divided, only serial computation.

For dependencies between tasks or inseparably calculated data can not be mutually parallel computing. If a large data can be divided into data blocks with the same calculation process, and the data dependency exists between these data blocks, the best way is to increase the processing speed of the parallel computation.

For example, assume there is a huge two-dimensional data, while too large to be placed in a computer memory, as shown in FIG. 2, now request of cubic each element. Because treatment is the same for each element, and the data dependencies between the data element does not exist, it can be considered divided into sub-arrays, by a group of parallel processing computer.

MapReduce divide and conquer thoughts
Figure 2 MapReduce ideological divide and rule

2. construct abstract models: Map and Reduce functions function

Functional programming language Lisp is a list processing language. Lisp defines various processing operations can be performed on the entire list of elements. For example, (add # (1 2 3 4) # (4 3 2 1)) result is # (5555).

Lisp also provided and operation is similar to the Map function Reduce function. Such as:

  • (map ‘vector #+ #(1 2 3 4 5) #(10 11 12 13 14)),通过定义加法 Map 运算将两个向量相加产生的结果为 #(11 13 15 17 19)。
  • (reduce #’ + #(11 13 15 17 19)) 通过加法归并产生的累加结果为 75。

Map 函数对一组数据元素进行某种重复式的处理,Reduce 函数对 Map 函数的中间结果进行某种进一步的结果整理。

MapReduce 通过借鉴 Lisp 的思想,定义了 Map 和 Reduce 两个抽象的编程接口,为程序员提供了一个清晰的操作接口抽象描述,由用户去编程实现。

1) Map:<k1,v1>List(<K2,V2>)

输入:键值对<k1,v1>表示的数据。

处理:数据记录将以“键值对”形式传入 Map 函数;Map 函数将处理这些键值对,并以另一种键值对形式输出中间结果 List(<K2,V2>)。

输出:键值对List(<K2,V2>)示的一组中间数据。

2) Reduce:<K2,List(V2)>→List(<K3,V3>)

输入:由 Map 输出的一组键值对 List(<K2,V2>)将被进行合并处理,同样主键下的不同数值会合并到一个列表List(V2)中,故 Reduce 的输入为<K2,List(V2)>。

处理:对传入的中间结果列表数据进行某种整理或进一步的处理,并产生最终的输出结果List(<K3,V3>)。

输出:最终输出结果List(<K3,V3>)。

基于 MapReduce 的并行计算模型如图 3 所示。各个 Map 函数对所划分的数据并行处理,从不同的输入数据产生不同的中间结果。

各个 Reduce 函数也各自并行计算,负责处理不同的中间结果。进行 Reduce 函数处理之前,必须等到所有的 Map 函数完成。

因此,在进入 Reduce 函数前需要有一个同步屏障;这个阶段也负责对 Map 函数的中间结果数据进行收集整理处理,以便 Reduce 函数能更有效地计算最终结果,最终汇总所有 Reduce 函数的输出结果即可获得最终结果。

Model Based Parallel Computing MapReduce
图 3  基于MapReduce的并行计算模型

3)上升到架构:并行自动化并隐藏底层细节

MapReduce 提供了一个统一的计算框架,来完成计算任务的划分和调度,数据的分布存储和划分,处理数据与计算任务的同步,结果数据的收集整理,系统通信、负载平衡、计算性能优化、系统结点出错检测和失效恢复处理等。

MapReduce 通过抽象模型和计算框架把需要做什么与具体怎么做分开了,为程序员提供了一个抽象和高层的编程接口和框架,程序员仅需要关心其应用层的具体计算问题,仅需编写少量的处理应用本身计算问题的程序代码。

与具体完成并行计算任务相关的诸多系统层细节被隐藏起来,交给计算框架去处理:从分布代码的执行,到大到数千个,小到单个的结点集群的自动调度使用。

MapReduce 计算架构提供的主要功能包括以下几点。

1)任务调度

提交的一个计算作业(Job)将被划分为很多个计算任务(Tasks)。

任务调度功能主要负责为这些划分后的计算任务分配和调度计算结点(Map 结点或 Reduce 结点),同时负责监控这些结点的执行状态,以及 Map 结点执行的同步控制,也负责进行一些计算性能优化处理。例如,对最慢的计算任务采用多备份执行,选最快完成者作为结果。

2)数据/程序互定位

为了减少数据通信量,一个基本原则是本地化数据处理,即一个计算结点尽可能处理其本地磁盘上分布存储的数据,这实现了代码向数据的迁移。

当无法进行这种本地化数据处理时,再寻找其他可用结点并将数据从网络上传送给该结点(数据向代码迁移),但将尽可能从数据所在的本地机架上寻找可用结点以减少通信延迟。

3)出错处理

在以低端商用服务器构成的大规模 MapReduce 计算集群中,结点硬件(主机、兹盘、内存等)出错和软件有缺陷是常态。因此,MapReduce 架构需要能检测并隔离出错结点,并调度分配新的结点接管出错结点的计算任务。

4)分布式数据存储与文件管理

海量数据处理需要一个良好的分布数据存储和文件管理系统作为支撑,该系统能够把海量数据分布存储在各个结点的本地磁盘上,但保持整个数据在逻辑上成为一个完整的数据文件。

为了提供数据存储容错机制,该系统还要提供数据块的多备份存储管理能力。

5)Combiner 和 Partitioner

为了减少数据通信开销,中间结果数据进入 Reduce 结点前需要进行合并(Combine)处理,即把具有同样主键的数据合并到一起避免重复传送。

一个 Reduce 结点所处理的数据可能会来自多个 Map 结点,因此,Map 结点输出的中间结果需使用一定的策略进行适当的划分(Partition)处理,保证相关数据发送到同一个 Reduce 结点上。

Map 函数和 Reduce 函数

MapReduce 是一个使用简易的软件框架,基于它写出来的应用程序能够运行在由大规模通用服务器组成的大型集群上,并以一种可靠容错的方式并行处理 TB 级别的数据集。

MapReduce 将复杂的、运行在大规模集群上的并行计算过程高度地抽象为两个简单的函数:Map 函数和 Reduce 函数。

简单来说,一个 Map 函数就是对一些独立元素组成的概念上的列表的每一个元素进行指定的操作。

例如,对一个员工薪资列表中每个员工的薪资都增加 10%,就可以定义一个“加 10%” 的 Map 函数来完成这个任务,如图 4 所示。

事实上,每个元素都是被独立操作的,原始列表没有被更改,而是创建了一个新的列表来保存新的答案。这就是说,Map 函数的操作是可以高度并行的,这对高性能要求的应用,以及并行计算领域的需求非常有用。

Hadoop's MapReduce and HDFS cluster architecture
图 4  Hadoop 的 MapReduce 与 HDFS 集群架构

在图 4 中,把 18 个员工的表分成 3 个模块,每个模块包括 6 个员工,由一个 Map 函数负责处理,这样就可以比顺序处理的效率提高 3 倍。而在每一个 Map 函数中,对每个员工薪资的处理操作都是完全相同的,即增加 10%。

Reduce 函数的操作指的是对一个列表的元素进行适当的合并。

例如,如果想知道员工的平均薪资是多少?就可以定义一个 Reduce 函数,通过让列表中的元素跟与自己相邻的元素相加的方式,可把列表数量减半,如此递归运算直到列表只剩下一个元素,然后用这个元素除以人数,就得到了平均薪资。

虽然 Reduce 函数不如 Map 函数那么并行,但是因为 Reduce 函数总是有一个简单的答案,并且大规模的运算相对独立,所以 Reduce 函数在高度并行环境下也很有用。

Map 函数和 Reduce 函数都是以 <key,value>作为输入的,按一定的映射规则转换成另一个或一批 <key,value> 进行输出,如表 1 所示。

函数 输入 输出 注解
Map Map<k1,V1> List(<k1,V2>) 将输入数据集分解成一批<key,value>对,然后进行处理;每一个<key,value>输入,Map 会输出一批<K2,V2>
Reduce <k2,List(V2)> <K3,V3> MapReduce 框架会把 Map 的输出,按 key 归类为 <K2,List(V2)>。List(V2) 是一批属于同一个 K2 的 value

Map 函数的输入数据来自于 HDFS 的文件块,这些文件块的格式是任意类型的,可以是文档,可以是数字,也可以是二进制。文件块是一系列元素组成的集合,这些元素也可以是任意类型的。

Map 函数首先将输入的数据块转换成 <key,Value> 形式的键值对,键和值的类型也是任意的。

Map function of the role is a key input to each of the mapping into one or a number of new key. The output of the key in the key input of the key in the key may be different.

Note that the input format and output format of the Map function Reduce function is not the same, the former is List (<K2, V2>) format, which is <K2, List (V2)> format. Therefore, the output of the Map function can not be directly input as the Reduce function.

Map MapReduce output frame will be classified according to the function keys, the key with the same key of the format, all merged into <K2, List (V2)> wherein, List (V2) is a group belong to the same K2's value.

Reduce function is a series of tasks with the same key value of the input in some way combined, and output processing of the key value, the output typically combined into one file.

In order to improve the processing efficiency Reduce the number, the user can also specify Reduce tasks, that is, there may be multiple concurrent Reduce Reduction operation to complete.

MapReduce framework to each key input to the respective tasks processed according to Reduce rule set. In this case, MapReduce will output multiple files.

In general, these do not need to merge output files, because these files might be used as input to the next MapRedue task.

Guess you like

Origin blog.csdn.net/yuidsd/article/details/92010901