Article Directory
Mapreduce
先分再和、分而治之
The idea of MapReduce
Map
: perform some repetitive processing on a set of data elements
Reduce
: Perform some further result sorting on the intermediate results of the Map
- MapReduce data type
->kv键值对
form
Phase composition:
A MapReduce programming model can only contain one Map stage and one Reduce stage , or only the Map stage
If the user's business logic is very complex, only multiple MapReduce programs can be run serially
Mapper > Reducer > Mapper > Reducer
MapReduce instance process
MRAppMaster
: Responsible for the process scheduling and status coordination of the entire MR programMapTask
: Responsible for the entire data processing process in the Map phaseReduceTask
: Responsible for the entire data processing process in the Reduce phase
Advantages and disadvantages of MapReduce
- Features of MapReduce
易与编程
, simply implement some interfaces, you can complete a distributed program良好的扩展性
, computing power can be expanded by adding machines高容错性
, any single node downtime does not affect the completion of the entire job task适合海量数据离线处理
- Disadvantages of MapReduce
实时计算性能差
, mainly used for offline operations, unable to achieve second-level data response不能进行流式计算
, MapReduce is mainly for offline static data sets
MapReduce programming case-WordCount word frequency statistics
Implementation ideas
- The core of the Map stage: Pass the input data
切割
and mark it all as 1, so the output is<word, 1>
- The core of the Shuffle stage: through the built-in
排序分组
functions of the MR program, usekey
the same data as a set of data to form a newk-v键值对
- The core of the Reduce phase: process the Shuffled set of data, which is all the key-value pairs of the word.
对所有的1进行累加求和
, is the total number of occurrences of the word
Steps
-
Create a new file that needs word frequency statistics, and enter the content
-
Upload files to HDFS file system
-
Run Hadoop built-in case wordcount,
-
under
$HADOOP_HOME/share/hadoop/mapreduce
the path namedhadoop-mapreduce-examples-*.jar
-
Running instance:
hadoop jar hadoop-mapreduce-examples-2.7.1.jar wordcount /input /output
The
wordcount
parameter means to specify the instance of running word frequency statistics,/input
is the path of the file where word frequency statistics need to be performed,/output
output path for the result, no need to manually create- View Results
-