Some summary mapreduce program development

mapreduce when programming, a substantially cured pattern, where there is not much flexibility to change, except for the following few:

 

1, input data interface: InputFormat ---> FileInputFormat (generic abstract data read class file types) DBInputFormat (Universal Database abstract data read)

The default implementation class is: TextInputFormat job.setInputFormatClass (TextInputFormat.class)

TextInputFormat logic functions: a read line of text, and the starting offset as row key, row of content as return value

2, the interface logic: Mapper

Require the user to fully realize where the map () setup () clean ()

3, the result output in the shuffle phase map is Sort partition and, where there are two interfaces can be customized:

Partitioner

There default implementation HashPartitioner, is to return a logical partition number and key according numReduces; key.hashCode () & Integer.MAXVALUE% numReduces

Typically, the default of this HashPartitioner can, if you have special requirements on business, you can customize

Comparable

When we use self-defined object as the key output, it is necessary to implement the interface WritableComparable, wherein the override the compareTo () method

 

4, reduce side data packet comparison Interface: Groupingcomparator

After reduceTask get input data (all data of a partition), packet data needs to be first, the principle of the default packet which is the same key, and then reduce each group once a data call kv () method, and this group kv the first key is passed as a parameter kv reduce the key, the value of the set of data transmitted iterator reduce () values ​​of the parameters

With the above this mechanism, we can implement an efficient packet takes the maximum value logic:

Subject to a custom bean package our data, which is then rewritten compareTo method of producing an effect reverse order

Then customize a Groupingcomparator, bean logical grouping of objects into groups in accordance with our business id grouping (such as order number)

In this way, we want to get the maximum value is reduce () method passed in key

 

5, the logic processing interface: Reducer

Require the user to fully realize which reduce () setup () clean ()

 

6, the output data interface: OutputFormat ---> series subclasses FileOutputformat DBoutputFormat .....

The default implementation class is TextOutputFormat, logic function: each of KV is output to the target line of a text file

 

Published 461 original articles · won praise 193 · Views 1.84 million +

Guess you like

Origin blog.csdn.net/zengmingen/article/details/104583777