MapReduce run the whole process of parsing

No public concern, we can reply, "blog Park" No background in public, freely available knowledge of Java / information interview must-see.

 

Foreword

Earlier we talked about MapReduce programming model, we know that he is divided into two stages to complete a task, counted separately First stage of our map data, the second stage is reduce, resulting in the calculation of phase map then summary.

Also he wrote a very classic, similar to Java in the same HelloWorld WordCount code. Today we have the basis of this code is to illustrate the operation of the entire process of MapReduce.

First earnestly tell you, this knowledge is very, very important, the front of the five companies, three companies have asked this process, ask the other two operational mechanism Yarn, which is going to speak later content, you have to know how the general process is like, if you can figure out to study every detail, then of course the best.

 The feed from the data outputs to the processing program to the storage processing completion, the whole process we roughly divided into the following five stages:

  • Input Split or Read data phase

     Input Split, starting from the data slice, the input data to the processing program. Read is a view from the reverse processing program, the data read from the file to the handler. This stage is the expression of our data came from. This is the beginning of the process.

  • Map stage

    When the input data come in, we are treated map stage. For example, a word line is divided, and each word is counted as an output.

  • Shuffle stage

    Shuffle phase is the core of MapReduce, between now Map Reduce stage between stage. In Spark also has this concept, you can say you understand this concept, wait until then to learn other big data computing framework principles of time, will bring you a very big help, because most of them are the same philosophy, we will focus on the following explain the process.

  • Reduce stage

    Map data after processing stage, the data and then after Shuffle stage, and finally to Reduce, the same data will be key values ​​to the same Reduce task in the final summary.

  • Output stage

    This stage thing is to Reduce phase computed results stored somewhere to go, this is the end of the process.

     

A flowchart of the entire execution

A picture is worth a thousand words:

 

 

If you do not clear, I uploaded a complete in gayHub above address:

(https://raw.githubusercontent.com/heyxyw/bigdata/master/bigdatastudy/doc/img/mapreduce/mr-Implementation-process.png)

 

Of course, do not understand or are new may begin to see a relatively ignorant force, I just started too. Here we explain to split a piece, and finally almost understand.

 

Input Split data phase

Input Split Guming Si Yee, enter the fragmentation, why we call slicing of the input it? Because the data before performing the calculation Map, MapReduce will be segmented according to the input file, because we need to be distributed computing Well, then I have calculated how much of my data to be cut into pieces, and then we go on to assign each piece of data task to deal with.

Each input will correspond to a fragment Map task, slicing of the input data is not stored in itself, but a slice data length and a recording position, and it is often the HDFS Block (block) is associated.

If we set the block size for each of HDFS 128M is, if we have three files, namely the size of 10M, 129M, 200M, then to the MapReduce 10M data file into a document fragment, 129M is divided into 2 pieces of fragmented, 200M file is divided into two fragments. So this time we have five slices, you need 5 Map task to deal with, but the data is still uneven.

If there are a lot of small files, it will generate a lot of Map tasks processing efficiency is very low.

This phase is used InputFormat component, which is an interface that is used by default TextInputFormat to deal with, he will call readRecord () to read the data.

This is also a very important MapReduce point calculation optimized ** ** test before being interviewed. How to optimize the problem of small files it?

  • The best way: in the forefront of the data processing system (pre-acquisition), the first small files will be combined, and then passed to the HDFS.

  • Remedy: If you have a large number of small files in HDFS, you can use another InputFormat assembly CombineFileInputFormat to solve, it's sliced ​​in different ways with TextInputFormat, it will be little more planning to file a slice logically, In this way, multiple small files can be handed over to a Map task to deal with.

     

 Map stage

The output stage as Map Reduce input stage of the process is the Shuffle. This is the entire MapReduce the most important aspect.

General MapReduce processing are huge amounts of data, data Map outputs can not put all the data in memory, when we call context.write in map function () method, it calls OutputCollector component data is written to to be in a thing called a ring buffer memory.

Ring buffer default size is 100M, but write only 80%, while the output map will start operating a daemon thread, when the data reaches 80%, daemon threads started clearing the data, the data is written to disk, the process called the spill.

When data is written into the ring buffer, data is sorted according to the default key, data of each partition is in order, the default is HashPartitioner. Of course, we can go to customize this partition is.

Each execution produces a file clean-up, after the map when the execution is completed, there will be a process of merging files, in fact, he was here with the slicing of the input (Input split) Map stages is quite similar, a Partitioner corresponds to a Reduce operation, if only a reduce operation, then Partitioner only one, if there are multiple reduce operations, then there are more Partitioner. Partitioner the number of key values ​​and the number of Reduce determined. It can be set by job.setNumReduceTasks ().

There is also an optional component Combiner, when the overflow data if you call Combiner component that is the same logic with reduce, the same key value were first added together, provided that the merger does not change the business, so as not to paste it many of the same key transmission data, so as to enhance the efficiency.

For example, when the data overflow, Combiner is not used by default, data length like this: <a, 1>, <a, 2>, <c, 4>. When a Combiner component data is: <a, 3>, <c, 4>. To a merged data.

 

Reduce stage

Before performing the Reduce, Reduce task will be to put in charge of their own data partition pulled to the local, but also to conduct a merge sort and merge.

reduce method Reduce stage, but also our own implementation of logic, with the map method Map stage, like when only reduce execution function, values ​​for the same set of key value iterators. In the example wordCount, we iteration superimposed on these data. The last call context.write function, the total number of words and output.

 

Output stage

When the function call context.write reduce the function, calls OutPutFomart components TextOutputFormat default implementation, the output data to the target storage, it is common HDFS.

 

Spread

We just explained above general processes are here to throw a few questions? Also frequently asked interview .

1. File segmentation is how cut? A file in the end will be cut a bit? Algorithm is kind of how?

Number 2. Map task is how determined?

 

The above problem, everyone posted two links:

MapReduce Input Split (input points / section) Detailed :

https://blog.csdn.net/dr_guo/article/details/51150278

Source resolve MapReduce jobs slice (Split) process :

https://blog.csdn.net/u010010428/article/details/51469994

to sum up

MapReduce implementation process to explain here almost completed, I hope you can also draw out the big picture above. It is able to understand the general process, and to grasp the key link Shuffle. Will you still hear this word in other large data components.

We will bring you back roughly operational mechanism Yarn, and then the whole process WordCount for you on the run.

Stay tuned.

 

Java technology geeks public number, is founded by a group of people who love Java technology development, focused on sharing of original, high-quality Java articles. If you feel that our article is not bad, please help appreciated, watching, forwarding support, encourage us to share a better article.

 

 

Guess you like

Origin www.cnblogs.com/justdojava/p/11271223.html