MapReduce thought:
core:
Divide and rule, and the first points
Scenario:
Complex tasks, not dependent to provide a parallel processing efficiency
Context reflects:
After the first map reduce
map: the complex task split into tasks, partial calculation, the results obtained locally
reduce: the partial summary of the results of the map globally, the final result
MapReduce design ideas:
How big data processing?
In the first division together, divide and rule
Abstract model of two functions:
It is output to the key input section kv
map:
the complex task split into tasks, partial calculation, the results obtained locally
reduce:
the partial summary of the results of the map globally, the final result
To do and what to do to split:
It is responsible for doing complex (technical)
Users are responsible for what (business)
Two more were merged MR program is complete
MapReduce programming framework and specifications:
Code level:
Mapper class inheritance override the map () ----- map phase is responsible for business
Reduce class inheritance override the reduce () ---- responsible for business logic reduce stage
Running on the client main class (main) ----- mr relevant attributes specified, the program submission
Three or more packaged as jar package
Run angle:
MapTask: map phase of operation task
ReduceTask: reduce operational phase of task
MapReduceApplictionMaster (MrAppMaster): the main program running to monitor operation of each task and mr run the program, responsible for resources with yarn
Case WordCount:
Environmental development version of the problem:
Apache 2.7.4 local execution environment optimized CDH2.6.0
Data types and serialization mechanism:
Writable (Interface) considered bloated java serialization mechanism is not conducive to transfer large data network
Key: (MR execution flow):
Serialization mechanism:
Serialization mechanism concept:
Interprocess data transfer network into a byte stream
Writable:
Sequence method: write (out)
Deserialization: readField (in)
Note: the first serialized, after deserialization
Custom Sort:
Nature (CompareTo):
0: equal
Positive: more than
Negative: less than
Note: Who is Who in the large
Reverse order:
Spoofer: Deception large ---> small negative ---> positive
Object implements the interface:
Compareable | WritableCompareable<Bean>
Custom Partitioning:
Partition definition:
Determines the output map in which a key value on reduceTask
Default partition rules:
HashPartitioner(key.hashcode % NumReduceTasks)
Implement a custom partitioning:
Partition getPartitions override inherited class method returns a value that is the reference value of the partition
Let customize the partition effect:
job.setPartitionClass()
The relationship between the number and the number of partitions reduceTask:
Should remain equal
The number of multi-partition error illegal partition
The number of partitions empty file generated little execution
Combiner (reduction):
Partially polymerized component of each output of the first partial polymerization map
Optimize network IO
Reduce itself is not just a small range of global
The default is not open
Note: Use caution: Because the number of the order will change in the final result.
Parallelism mechanisms:
Concept: the so-called degree of parallelism, it refers to a plurality of simultaneously
maptask parallelism (logic slice reduction): the file number of the size of the cut size
reducetask parallelism: Code set global count relates to use caution
shuffle mechanisms:
The concept: a process
Before starting the output data from the map data is accepted as input to reduce
Across the middle stages across the network map reduce mr is the core of the program is the cause of the efficiency of the slowest.
data compression:
Compression Objective: To reduce the amount of network traffic, reduce the disk space occupied by the final
Compression schemes:
Quantity of output compression map :( network transmission)
redcue output compression :( disk space occupied)
Compression algorithms:
Recommended use: snappy
Hadoop support depends on the compression
Check whether to support the local library: hadoop chechnative
The best combination of Hadoop compiler supports a subset of compression algorithms.
Compression mode settings:
Directly in the program map by conf.set () ----- only for the mr
Modify the xml configuration file mapred-site.xml ----- globally valid
Optimization parameters:
Include: resources, fault tolerance, stability, etc. ------ Hadoop official website api xxx.default.xml (look for abandoned property --Deprecated Properties)
Correlation between the size of the file operation --- (hive between the Join the table size (combined))
All data sent to the associated field to reduce a processing the same as the key
Drawbacks: reduce join pressure data skew may occur
Complete the association between the data map phase
map join not reduce phase (numreducetask (0))
Part-m-00000
Distributed cache:
Can be specified file (archive jar) to the current program occurs every maptask
setup initialization method:
To load the cache of small files to the current program running in memory maptask
Save small file to create a variety of different types of data sets of data
Handling small files scene:
Slice default mechanism: -> a small slice a file ----> a slice a maptask
CombineTextInputFormat: Slice mechanism
Small files:
Custom sub packet:
Development stage:
Before calling reduce () method
Default group:
Row of data sorted according to a key are equal before and after the two (equal or unequal)
Custom object as a key:
WritableComparator packet class that inherits Note: WritableComparable <OrderBean> sorted implement interface
It is used for packet Key
It is in ReduceTask, the default type is GroupingComparator can also customize
WritableComparator provide the basis for a secondary sort means (inherit), to respond to different business needs
For example GroupingComparator (
packet comparator
) will write the file to disk and in ReduceTask grouped by Key sorted, it determines a next key is the same, the same group Key passed reduce () execution
Custom grouping effect:
job.setGroupingComparatorClass(OrderGroupingComparator.class);