Hadoop --- mapreduce architecture concepts

Creative Commons License Copyright: Attribution, allow others to create paper-based, and must distribute paper (based on the original license agreement with the same license Creative Commons )

For first-come, chestnut:
If you make a yard drive statistics, statistics for each brand of car How many? How would you not go to the statistics.
Ferritin came and said soeasy, from front to statistics, tired as a dog.
Well, now expanded, statistics a county or a district, how you do it.
Ferritin summoned his buddies, iron posts, tigers, balloon angioplasty, niu, flowers and so on. A person in charge of a yard. After a good person to make statistics, most aggregates.
Well, it is not very fast and very simple.
Ferritin and boldly blow the cow said. Let alone a county, more than just friends, that is not the thing.
Here we are not the things that will be distributed, and finally to summarize. Next, enter the topic.

MapReduce

Why is it called MapReduce, is not called the Little XXOO, apache ss, and so on.
MapTask- parallel, such as a number of friends while out statistics ferritin car.
ReduceTask ---- summary, statistics everyone's finished and then aggregated. It is the whole final result.
Here Insert Picture Description
Horizontal perspective image ------------------ 1.1>
a input input terminal, an output terminal output, the output is the input reduceTask of mapTask. Only mapTask statistical finished, reduceTask to statistics. Linearly dependent.
1.2 Vertical image view ------------------>
Map three, reduce two. map ferritin is more of a friend, a friend efficiency faster. reduce can also have multiple. According to our needs.
1.3 Process
Hdfs-map ---- reduce ---- hdfs
Precautions:
a slice corresponds to a map. Slice is a logical range. By default a block, a split, a map.
But analyze things such as your documents need to 128M. block is 64M, there is the need for two block, has a split, there is a map. split is flexible.
split input recording units (default is a record one line) records may be multiple.
Input (formatted k, v) ---- map data set mapped to intermediate data set (K, V) the reduce -----
(. 1) map number
is determined by the split, split feature data to decide, how to calculation.
(2) Number of renduce
The first chestnut: ferritin friends over statistics, statistics need to ferritin, everyone gave him the number, but ferritin this guy is lazy, out of a helper cub, statistics of time for it, some people changed cub, some people gave ferritin. That finish is still not a complete statistical data. Check out our pictures, then reduce to complete gone. No. So reduce the number of decisions based on demand.
do not understand? You anxious
Second chestnut: We let ferritin statistics for men and women of a county, and finally a person ferritin statistics can not be tiring. Statistical male cub. Ferritin statistics woman. This is not easier. Fast efficiency. It takes two reduce, then the three can not be, a male and a female, the other can only be empty. If you have, and is not the same as a chestnut.
There mapj reduce the number calculated from the data to determine how many such vehicles, it is one of a reduce.
Important to understand:
"the same" key as a group, called once reduce method, iterative method within this set of data to calculate
what does it mean?
For example, a reduce, our statistics for men and women, two for men and women is key, all for a group of men calling a reduce, for a group of women in a call to reduce, with a group of this group.
Two reduce, put a male, the other one can not for men. Destroy the group as a unit of calls reduce.
Summary
Block> Split
1: 1 Default
N: 1 enlarged sections
1: N a record slice
Split> Map
1: 1 must
Map> reduce
N: 1 reduce a total
N: N plurality of data
1: 1
1: N hypothesis record 1000, 1000 has four sets of data.
Group (Key)> Partition
. 1:. 1
N:. 1
N: N
. 1: N> departing?
Partition> outputfile
Here Insert Picture Description
Process:

Guess you like

Origin blog.csdn.net/power_k/article/details/92395267