Hadoop big data principle (3) - distributed computing framework MapReduce

1. General Computing for Big Data

  Distributed computing existed before the emergence of Hadoop. At that time, distributed computing was a dedicated system that could only handle certain types of computing, such as sorting large-scale data. Such a system cannot be reused in other big data computing scenarios. Each application requires the development and maintenance of a dedicated system, which is not universal enough.

  It was not until the emergence of Hadoop MapReduce that general programming for big data computing became possible. As long as we follow the MapReduce programming model, we only need to invest in writing the logic code of business processing, and then we can run it on the Hadoop distributed cluster, without caring about how the distributed computing is completed.

2 MapReduce programming model

  The core idea of ​​big data computing is to replace mobile data with mobile computing as much as possible, so Hadoop big data computing uses the MapReduce programming model.

  Although the MapReduce programming model is not original to Hadoop, Google uses Hadoop and MapReduce programming models on big data, making many seemingly complex big data such as machine learning, data mining, and SQL processing easier and more versatile.

  The MapReduce programming model only includes two processes of Map and Reduce:

  • Map mainly inputs key-value pairs <key,value>, and outputs a pair after Map calculation <key,value>;
  • Reduce input <key,value集合>, output 0 or more after calculation <key,value>;

  MapReduce is very powerful. Whether it is relational algebra operations (SQL computing) or matrix operations (graph computing), almost all computing needs in the field of big data can be realized through MapReduce programming.

  The following uses an example of word frequency statistics to explain the calculation process of MapReduce.

Raw data
Hello World
Bye World
Hello Hadoop
Bye Hadoop

  Perform MapReduce calculation on the above data:

  1. Carry out a Map and output one <word,1>such key-value pair for each word;
    • <Hello,1>, <World,1>
    • <Bye,1>, <World,1>
    • <Hello,1>, <Hadoop,1>
    • <Bye,1>, <Hadoop,1>
  2. The same data in the Map <word,1>are collected to form <word,<1,1,1...>>such <key,value集合>data;
    • <Hello, <1,1>>, <World, <1,1>>, <Bye, <1,1>>, <Hadoop, <1,1>>
  3. Perform Reduce, input the data of the Map result <key,value集合>, and the calculation process is to sum the 1 in this set, and then combine the word (word) and this sum (sum) into one <key,value>, that is, <word,sum>the output of . Each output is the sum of a word and its frequency statistics.
    • <Hello, 2>, <World, 2>, <Bye, 2>, <Hadoop, 2>

  Seeing this, in fact, this idea is no different from traditional application development.

  In an actual big data scenario, Map will calculate a part of the data and divide a big data into many blocks. The MapReduce computing framework will assign a Map function to each database for calculation, realizing distributed computing of big data.

  For a better understanding, suppose there are two data blocks of text that need word frequency statistics. The MapReduce calculation process is as follows:

MapReduce process - 2 shards

3. MapReduce computing framework

  If the above MapReduce program is to be executed in a distributed environment and process massive large-scale data, it also needs a computing framework that can schedule and execute the MapReduce program so that it can run in parallel in a distributed cluster, and this computing framework also It's called MapReduce.

  There are two key issues to address in this process:

  1. How to assign a Map calculation task for each data block, including: how the code is sent to the server where the data block is located, how to start it after sending, how to know where the data you need to calculate is in the file after startup (how to obtain the BlockID) .

  2. <key,value>How to aggregate the same key together and send it to the Reduce task for processing the Map output on different servers .

  These two key issues correspond to the two "MapReduce framework processing" in the figure below. Specifically, they are MapReduce job startup and running, and MapReduce data merging and connection.
MapReduce framework processing

3.1 Three types of key processes

  The following is an introduction to the three key processes of MapReduce for later understanding.

Big Data Application Process

  It is the main entrance to start the MapReduce program, mainly specifying Map and Reduce classes, input and output file paths, etc., and submitting jobs to the Hadoop cluster, which is the JobTracker process mentioned below, which is the MapReduce program process started by the user.

JobTracker process

  According to the amount of input data to be processed, this type of process commands the TaskTracker process mentioned below to start a corresponding number of Map and Reduce process tasks, and manages the task scheduling and monitoring of the entire job life cycle. This is a resident process of the Hadoop cluster. It should be noted that the JobTracker process is globally unique in the entire Hadoop cluster.

TaskTracker process

  This process is responsible for starting and managing the Map process and the Reduce process. Because each data block needs to have a corresponding map function, the TaskTracker process is usually started on the same server as the HDFS DataNode process. That is to say, most servers in the Hadoop cluster run DataNode process and TaskTracker process at the same time.

  From the above introduction, we can see that MapReduce is still a master-slave architecture, the master server is JobTracker, and the slave server is TaskTracker.

3.2 Job startup and operation mechanism

  1. The application process JobClient stores user job JAR packages in HDFS, and these JAR packages will be distributed to servers in the Hadoop cluster to perform MapReduce calculations;

  2. The application submits the job to the JobTracker;

  3. JobTracker creates a JobInProcess tree according to the job scheduling strategy, and each job will have its own JobInProcess tree;

  4. JobInProcess creates a corresponding number of TaskInProcess according to the number of input data fragments (usually the number of data blocks) and the set Reduce number;

  5. The TaskTracker process and the JobTracker process communicate regularly;

  6. If TaskTracker has free computing resources (free CPU), JobTracker will assign tasks to it. When assigning tasks, it will match the data block computing tasks on the same machine according to the TaskTracker server name, so that the started computing tasks can just process the data on the machine, so as to realize the "mobile computing ratio" we mentioned at the beginning. Mobile data is more cost-effective”;

  7. After TaskTracker receives the task, according to the task type (Map/Reduce) and task parameters (job JAR package path, input data file path, starting position and offset of the data to be processed in the file, DataNode for multiple backups of data blocks Host name, etc.), start the corresponding Map/Reduce process;

  8. After the Map/Reduce process starts, check whether there is a JAR package file to execute the task locally, if not, download it from HDFS, and then load the Map/Reduce code to start execution;

  9. If it is a Map process, read data from HDFS (data blocks to be read are usually stored locally);

  10. If it is a Reduce process, write the result data to HDFS;

3.3 Data Merging and Connection Mechanism

  In the example of WordCount in the second chapter, we want to count the number of times the same word appears in all input data, but a Map can only process part of the data, and a popular word will appear in almost all Maps, which means the same Words must be merged together for statistics to get correct results.

  Simpler ones like WordCount only need to merge the Keys. For more complex ones like database join operations, two types (or more types) of data need to be connected according to the Key.

Shuffle process

  Between Map output and Reduce input, the MapReduce computing framework handles data merging and connection operations, which are called shuffle.

<key,value>The calculation results of each Map task will be written to the local file system. When the Map task is about to be calculated, the MapReduce computing framework will start the shuffle process, and call a Partitioner interface in the Map task process to perform Reduce partitions   on each Map generated. Select, and then send to the corresponding Reduce process through HTTP communication.

  No matter which server node the Map is located on, the same Key will be sent to the same Reduce process. <key,value>The Reduce task process sorts and merges the received ones , and puts the same Key together to form one <key,value集合>and pass it to the Reduce for execution.

  The default Partitioner of the MapReduce framework uses the hash value of the Key to take the modulus of the number of Reduce tasks, and the same Key must fall on the same Reduce task ID.

  Let's redefine what shuffle is. Distributed computing needs to merge related data from different servers together for the next calculation. This is shuffle. Shuffle is also the hardest and most performance-consuming part of the entire MapReduce process.

Guess you like

Origin blog.csdn.net/initiallht/article/details/124720375