MapReduce distributed computing framework


    MapReduce is one of the core components of the Hadoop system. It is a computing model, framework and platform that can be used for parallel processing of big data. It mainly solves the calculation of massive data. It is a widely used one in distributed computing models.

1. The core idea of ​​MapReduce: divide and conquer

Steps when using MapReduce to operate massive data:

  • Each MapReduce program is initialized as a work task
  • Each work task can be divided into two stages: Map and Reduce l
    • Map stage: responsible for decomposing tasks, that is, decomposing the responsible tasks into several simple tasks to be processed in parallel, but the premise is that these tasks have no inevitable dependencies and can perform tasks independently.
    • Reduce phase: responsible for merging tasks, that is, globally summarizing the results of the Map phase.
Second, the MapReduce programming model: used to process parallel operations on large-scale data sets

    The programming process is implemented through the map () and reduce () functions.
    From the data format, the data format received by the map () function is a key-value pair, and the output result is also in the form of key-value pairs; The key-value pairs output by the map () function are used as input, and the values ​​of the same key value are aggregated to output a new key-value pair. (The whole process is equivalent to the process of key-value pair conversion)

Specific MapReduce simple data model :

  • Process raw data into key-value pairs <K1, V1>
  • Pass the parsed key-value pair <K1, V1> to the map () function. The map () function will map the key-value pair <K1, V1> into a series of intermediate result-type key-value pairs <K2 according to the mapping rules. , V2>
  • The intermediate form of key-value pair <K2, V2> is formed into the form of <K2, {V2,…}> and passed to the reduce () function for processing. The values ​​with the same key are merged together to generate a new key-value pair <K3, V3>, the key-value pair <K3, V3> at this time is the final output result

note: There can be a Map process, there is no Reduce process, the data generated by the Map process can be directly written to HDFS, of course, if the task is complex, there can be multiple Reduce processes.

3. The main components of the MapReduce architecture
  • client
  • JobTracker
  • TaskTracker
  • Task

Insert picture description here

Four, MapReduce operation mode
  • Local operation mode : Simulate the MapReduce execution environment in the current development environment. The processed data and output results are in the local operating system (no need to build a cluster)
  • Cluster running mode : MapReduce program is packaged as a jar package and submitted to yarn cluster to run the task. Because the yarn cluster is responsible for resource management and task scheduling, the program will be distributed to the nodes in the cluster for concurrent execution, so the processed data and output results are in the HDFS file system
Fifth, MapReduce programming example: word frequency statistics

Schematic diagram of word frequency statistics:
Insert picture description here
specific steps:
Insert picture description here

6. Serialization of MapReduce basic data types

    The data types in MapReduce must implement the Writable interface, so that the data defined by these types can be serialized for network transmission and file storage, which is more efficient than serialization in Java.

  • Boolean => BooleanWritable
  • Byte => ByteWritable
  • Double => DoubleWritable
  • Float => FloatWritable
  • Integer => IntWritable
  • Long => LongWritable
  • String => Text
  • Null => NullWritable

note: Custom serialized type objects (implements Writable) can also be used during application

Seven, MapReduce working principle
  • Sharding and formatting data sources
    • Sharding operation: Divide the source file into small data blocks of equal size (2.x default 128M), that is, sharding, Hadoop will build a Map task for each shard, and the task will run a custom map ( ) Function to process each record in the shard
    • Formatting operation: format the divided fragments into key-value pairs (key represents the offset, value represents each line of content)
  • Execute MapTask : Each Map task has a memory buffer (default size 100M). The intermediate results of the input fragment data after the Map task processing will be written to the memory buffer. Threshold (80M), will start a thread to write data overflowed in memory to the disk, and will not affect the intermediate results of the map and continue to write to the buffer. During the overflow process, the MapReduce framework will sort the keys. If the intermediate result is relatively large, multiple overflow files will be formed. Finally, all the buffer data will be written to the disk to form an overflow file. If it is multiple overflow files, Then all the overwrite files are merged into one file at the end.
  • Perform the Shuffle process : Shuffle will distribute the processing result data output by MapTask to ReduceTask, and in the process of distribution, partition and sort the data by key
  • Implement ReduceTask
  • Write file
1. MapTask working process

It is divided into five stages:

  • Read stage
  • Map stage
  • Collect stage
  • Spill stage
  • Combiner stage
2. ReduceTask working process

It is divided into five stages:

  • Copy stage
  • Merge stage
  • Sort stage
  • Reduce stage
  • Write stage
3. Shuffle working process

    Shuffle is the core of MapReduce and is used to ensure that the input of each reduce is sorted by key. Its performance level directly determines the performance level of the entire MapReduce program. Both the map and reduce stages involve the shuffle mechanism, and the main role is to deal with "intermediate results."
The shuffle process:

  • Combine (combination operation): all the things done by Combine in the map, reduce the reduce things
  • Partition (partition operation): data classification

MapReduce uses hash HashPartitioner to classify, and can also be customized

Eight, MapReduce programming components
  • InputFormat component: It is mainly used to describe the format of input data. It provides two functions, namely data segmentation and input data for Mapper.
    • addInputPath()
    • addInputPaths()
    • setInputPaths()
    • setInputPaths()
  • Mapper component: The Mapper class provided by Hadoop is an abstract base class that implements Map tasks. The base class provides a map () method.
  • Reducer component: The key-value pairs output by the Map process will be merged and processed by the Reducer component, and finally output in some form of result.
  • Partitioner component: The Partitioner component can lie on the Map to partition the Key, so that it can be distributed to different Reduces for processing according to different keys. Its purpose is to distribute the keys evenly on the ReduceTask.
  • Combiner component: The function of the Combiner component is to perform a merge calculation on the repeated data output by the Map stage, and then output the new (key, value) as your Reduce stage.
  • OutputFormat component: OutputFormat is an abstract class used to describe the output format and specifications of MapReduce programs
Published 72 original articles · Like 3 · Visitor 3554

Guess you like

Origin blog.csdn.net/id__39/article/details/104986883