MapReduce summary

MapReduce thought:

    core:

        Divide and rule, and the first points

    Scenario:

        Complex tasks, not dependent to provide a parallel processing efficiency

    Context reflects:

        After the first map reduce
                map: the complex task split into tasks, partial calculation, the results obtained locally
                reduce: the partial summary of the results of the map globally, the final result

MapReduce design ideas:

    How big data processing?

            In the first division together, divide and rule

    Abstract model of two functions:

It is output to the key input section kv
            map: the complex task split into tasks, partial calculation, the results obtained locally
            reduce: the partial summary of the results of the map globally, the final result

    To do and what to do to split:

            It is responsible for doing complex (technical)
            Users are responsible for what (business)
Two more were merged MR program is complete
 

MapReduce programming framework and specifications:

Code level:

    Mapper class inheritance override the map () ----- map phase is responsible for business
    Reduce class inheritance override the reduce () ---- responsible for business logic reduce stage
    Running on the client main class (main) ----- mr relevant attributes specified, the program submission
Three or more packaged as jar package
 

Run angle:

    MapTask: map phase of operation task
    ReduceTask: reduce operational phase of task
    MapReduceApplictionMaster (MrAppMaster): the main program running to monitor operation of each task and mr run the program, responsible for resources with yarn
 

Case WordCount:

Environmental development version of the problem:
       Apache 2.7.4 local execution environment optimized CDH2.6.0
Data types and serialization mechanism:
    Writable (Interface) considered bloated java serialization mechanism is not conducive to transfer large data network

Key: (MR execution flow):

 

Serialization mechanism:

    Serialization mechanism concept:
        Interprocess data transfer network into a byte stream
    Writable:
        Sequence method: write (out)
        Deserialization: readField (in)
Note: the first serialized, after deserialization
 

Custom Sort:

    Nature (CompareTo):

            0: equal
            Positive: more than
            Negative: less than
Note: Who is Who in the large
 

    Reverse order:

            Spoofer: Deception large ---> small negative ---> positive

    Object implements the interface:

            Compareable | WritableCompareable<Bean>
 

Custom Partitioning:

    Partition definition:
        Determines the output map in which a key value on reduceTask
    Default partition rules:
        HashPartitioner(key.hashcode % NumReduceTasks)
    Implement a custom partitioning:
        Partition getPartitions override inherited class method returns a value that is the reference value of the partition
    Let customize the partition effect:
        job.setPartitionClass()
    The relationship between the number and the number of partitions reduceTask:
        Should remain equal
        The number of multi-partition error illegal partition
        The number of partitions empty file generated little execution
 

Combiner (reduction):

        Partially polymerized component of each output of the first partial polymerization map
        Optimize network IO  
        Reduce itself is not just a small range of global
        The default is not open
  Note: Use caution: Because the number of the order will change in the final result.
 

Parallelism mechanisms:

    Concept: the so-called degree of parallelism, it refers to a plurality of simultaneously
    maptask parallelism (logic slice reduction): the file number of the size of the cut size
    reducetask parallelism: Code set global count relates to use caution
 

shuffle mechanisms:

    The concept: a process
    Before starting the output data from the map data is accepted as input to reduce
    Across the middle stages across the network map reduce mr is the core of the program is the cause of the efficiency of the slowest.
 

data compression:

    Compression Objective: To reduce the amount of network traffic, reduce the disk space occupied by the final
    Compression schemes:
           Quantity of output compression map :( network transmission)
           redcue output compression :( disk space occupied)
    Compression algorithms:
            Recommended use: snappy
            Hadoop support depends on the compression
                      Check whether to support the local library: hadoop chechnative
                    The best combination of Hadoop compiler supports a subset of compression algorithms.
                        
    Compression mode settings:
            Directly in the program map by conf.set () ----- only for the mr
            Modify the xml configuration file mapred-site.xml ----- globally valid
 

Optimization parameters:

       Include: resources, fault tolerance, stability, etc. ------ Hadoop official website api xxx.default.xml (look for abandoned property --Deprecated Properties)
        

Correlation between the size of the file operation --- (hive between the Join the table size (combined))

        All data sent to the associated field to reduce a processing the same as the key
                Drawbacks: reduce join pressure data skew may occur
        Complete the association between the data map phase
            map join not reduce phase (numreducetask (0)) Part-m-00000
                    Distributed cache:
                            Can be specified file (archive jar) to the current program occurs every maptask
                    setup initialization method:
                            To load the cache of small files to the current program running in memory maptask
                            Save small file to create a variety of different types of data sets of data
 

    Handling small files scene:

           Slice default mechanism: -> a small slice a file ----> a slice a maptask
           CombineTextInputFormat: Slice mechanism
           Small files:               
 

Custom sub packet:

    Development stage:
        Before calling reduce () method
    Default group:
        Row of data sorted according to a key are equal before and after the two (equal or unequal)
    Custom object as a key:
        WritableComparator packet class that inherits Note: WritableComparable <OrderBean> sorted implement interface
                It is used for packet Key
                It is in ReduceTask, the default type is GroupingComparator can also customize
                WritableComparator provide the basis for a secondary sort means (inherit), to respond to different business needs
                For example GroupingComparator ( packet comparator ) will write the file to disk and in ReduceTask grouped by Key sorted, it determines a next key is the same, the same group Key passed reduce () execution

      Custom grouping effect:

            job.setGroupingComparatorClass(OrderGroupingComparator.class);

 



Guess you like

Origin www.cnblogs.com/TiePiHeTao/p/7915a1b78c4cbcecee6949d881d2d8b8.html