[Turn] Map Reduce & YARN

Map Reduce & YARN

Introduction

Apache Hadoop is an open-source software framework that can be installed on a cluster of commodity machines, enabling machines to communicate with each other and work together to collectively store and process large amounts of data in a highly distributed fashion. Originally, Hadoop consisted of two main components: the Hadoop Distributed File System (HDFS) and a distributed computing engine that supported implementing and running programs as MapReduce jobs.

MapReduce is a simple programming model popularized by Google that is useful for processing large data sets in a highly parallel and scalable manner. MapReduce is inspired by functional programming, where users express their computations as map and reduce functions, processing data as key-value pairs. Hadoop provides a high-level API to implement custom map and reduce functions in various languages.

Hadoop also provides the software infrastructure to run MapReduce jobs as a series of map and reduce tasks. The map task calls the map function on a subset of the input data. After completing these calls, the reduce task starts calling the reduce task on the intermediate data generated by the map function, producing the final output. The map and reduce tasks run independently of each other, which supports parallel and fault-tolerant computation.

Most importantly, the Hadoop infrastructure handles all the complex aspects of distributed processing: parallelization, scheduling, resource management, inter-machine communication, software and hardware failure handling, and more. Thanks to this clean abstraction, implementing distributed applications that process terabytes of data across hundreds (or even thousands) of machines has never been easier, even for those with no prior experience with distributed systems The same goes for developers.

MR architecture


map reduce process diagram

Divide the task into map side and reduce side.

JobClient JobTracker TaskTracker


MR architecture
  1. JobClient asks JobTracker for a new jobID
  2. Check Job Output Description
  3. Compute job output division split
  4. Copy the resources needed to run the job (job jar files, configuration files, computed input partitions) to the JobTracker's filesystem in a directory named by the job ID.
  5. Tell the JobTracker that the job is ready to execute by calling the submitJob() method of the JobTracker
  6. After the JobTracker receives the call of the submitJob() method, it puts the call into an internal queue, sends it to the job scheduler for scheduling, and initializes it
  7. Create a list of running tasks, and schedule the job to first obtain the input partition information that the JobClient has calculated from the shared file system (step 6 in the figure), and then create a Map task for each partition (one split corresponds to one map, and there are as many splits as there are splits). how many maps).
  8. The TaskTracker executes a simple loop that periodically sends a heartbeat to call the JobTracker

shuffle combine

The overall Shuffle process includes the following parts: Map-side Shuffle, Sort stage, and Reduce-side Shuffle. That is to say: the shuffle process spans both ends of map and reduce, including the sort phase in the middle, which is the process of data output from map task to reduce task input.

Sort and combine are on the map side, and combine is an advanced reduce, which needs to be set by yourself.

In a Hadoop cluster, most map tasks and reduce tasks are executed on different nodes. Of course, in many cases, when Reduce is executed, it is necessary to pull map task results on other nodes across nodes. If there are many jobs running in the cluster, the normal execution of tasks will consume a lot of network resources inside the cluster. For necessary network resource consumption, the ultimate goal is to minimize unnecessary consumption. Also within the node, compared to memory, the impact of disk IO on job completion time is also considerable. From the most basic requirements, for the Shuffle process of MapReduce job performance tuning, the target expectations can be as follows:

  • Completely pull data from the map task side to the reduce side.
  • Minimize unnecessary consumption of bandwidth when pulling data across nodes.
  • Reduce the impact of disk IO on task execution.

Generally speaking, this Shuffle process can be optimized mainly to reduce the amount of pulled data and use memory instead of disk as much as possible.

Map Shuffle


map shuffle
  1. When the input
    is executed by the map task, the input source is the block of HDFS, and the map task only reads the split. The correspondence between Split and block may be many-to-one, and the default is one-to-one.

  2. The method of splitting
    depends on which reducer the current mapper part is given to: the Partitioner interface provided by mapreduce, after hashing the key, modulo the number of reducetasks, and then go to the specified job.
    Then write the data into the memory buffer. The function of the buffer is to collect map results in batches and reduce the impact of disk IO. Both the key/value pair and the result of the Partition are written to the buffer. Both key and value values ​​are serialized into byte arrays before being written.

  3. Overwriting
    Due to the size limit of the memory buffer (default 100MB), memory overflow may occur when the map task outputs a lot of results, so it is necessary to temporarily write the buffer data to the disk under certain conditions, and then reuse the buffer. . This process of writing data from memory to disk is called Spill, which can be translated as overflow in Chinese.
    This overflow is done by a separate thread, and does not affect the thread that writes the map result to the buffer.
    The entire buffer has a spill.percent ratio of overflow writes. This ratio defaults to 0.8,

Combiner adds up key/value pairs with the same key to reduce the amount of data spilled to disk. Applicable scenarios of Combiner: Since the output of the Combiner is the input of the Reducer, the Combiner must not change the final calculation result. Therefore, in most cases, combiner is suitable for scenarios where the input and output key/value types are exactly the same and do not affect the final result (such as accumulation, maximum value, etc.).

  1. When the Merge
    map is large, each overflow will generate a spill_file, so there will be multiple spill_files, and the final output is only one file. Before the final output, the overflow files generated by multiple intermediate processes will be merged. , this process is merge.

    Merge is to add up the results of the same key. (Of course, if the combiner is set, the combiner will also be used to merge the same key)

Reduce Shuffle


reduce shuffle

Before the reduce task, the final result of each maptask in the current job is continuously pulled, and then the data pulled from different places is continuously merged, and finally a file is formed as the input file of the reduce task.

  1. The copy
    Reduce process starts some data copy threads (Fetcher), and requests the TaskTracker where the map task is located to obtain the output file of the map task through HTTP. Because maptask has already ended, these files are managed by TaskTracker on the local disk.

  2. The data copied by merge
    will be put into the memory buffer first. The buffer size here is more flexible than that of the map side. It is based on the heap size setting of the JVM. Because the Reducer does not run in the Shuffle stage, most of the memory should be stored. All for Shuffle. It should be emphasized here that there are three forms of merge: 1) memory to memory 2) memory to disk 3) disk to disk. The first form is not enabled by default, which is confusing, isn't it? When the amount of data in memory reaches a certain threshold, the merge from memory to disk is started. Similar to the map side, this is also the process of overwriting. In this process, if you set up Combiner, it will also be enabled, and then many overwriting files will be generated in the disk. The second merge method keeps running until there is no data on the map side, and then starts the third disk-to-disk merge method to generate the final file.

  3. The reducer's input
    merge ends up producing a file, which in most cases exists on disk, but needs to be put into memory. When the reducer input file has been determined, the entire Shuffle phase is over. Then the Reducer executes and puts the result on HDFS.

YARN

YARN (Yet Another Resource Negotiator), the name of the next-generation MapReduce framework, is generally called MRv2 (MapReduce version 2) for easy memory. The framework is no longer a traditional MapReduce framework, or even has nothing to do with MapReduce, it is a general runtime framework, users can write their own computing framework and run in this runtime environment. The framework written by yourself is used as a lib on the client side, which can be packaged when the application submits the job.

why YARN instead of MR

Disadvantages of MR

The most severe limitations of classic MapReduce are primarily related to scalability , resource utilization , and support for workloads different from MapReduce . In the MapReduce framework, job execution is controlled by two types of processes:

  • A main process called JobTracker that coordinates all jobs running on the cluster, assigning map and reduce tasks to run on the TaskTracker.
  • A number of subordinate processes, called TaskTrackers, that run assigned tasks and periodically report progress to the JobTracker.

Large Hadoop clusters exhibit scalability bottlenecks caused by a single JobTracker.
Additionally, both smaller and larger Hadoop clusters have never used their computing resources most efficiently. In Hadoop MapReduce, the computing resources on each slave node are decomposed into a fixed number of map and reduce slots by the cluster administrator, and these slots are irreplaceable. After setting the number of map slots and reduce slots, a node cannot run more map tasks than map slots at any time, even if no reduce tasks are running. This affects the utilization of the cluster because we cannot use any reduce slots when all map slots are used (and we need more), even if they are available and vice versa.
Hadoop is designed to run MapReduce jobs only. With the advent of alternative programming models (such as the graph processing provided by Apache Giraph), there is a growing need for programming models other than MapReduce that can run on the same cluster and share resources in an efficient and fair manner provide support.

Disadvantages of the original MapReduce framework
  • JobTracker is a centralized processing point for cluster transactions and has a single point of failure
  • JobTracker needs to complete too many tasks, it needs to maintain the state of the job and the task state of the job, resulting in excessive resource consumption
  • On the taskTracker side, it is too simple to use map/reduce tasks as resources, and it does not take into account the resource conditions such as CPU and memory. When two tasks that consume a large amount of memory are scheduled together, OOM is prone to occur.
  • The resources are forcibly divided into map/reduce slots. When there is only a map task, the reduce slot cannot be used; when there is only a reduce task, the map slot cannot be used, which may cause insufficient resource utilization.
Solve scalability problems

In Hadoop MapReduce, the JobTracker has two different responsibilities:

  • Manage computing resources in the cluster, which involves maintaining a list of active nodes, a list of available and occupied map and reduce slots, and allocating available slots to appropriate jobs and tasks according to the chosen scheduling strategy
  • Coordinates all tasks running on the cluster, which involves instructing the TaskTracker to start map and reduce tasks, monitoring task execution, restarting failed tasks, speculatively running slow tasks, summing job counter values, etc.

Scheduling a large number of responsibilities for a single process can cause significant scalability issues, especially on larger clusters where the JobTracker must constantly track thousands of TaskTrackers, hundreds of jobs, and tens of thousands of map and reduce tasks. Instead, the TaskTracker typically runs nearly a dozen tasks, which are assigned to them by the diligent JobTracker.

To solve the scalability problem, a simple but brilliant idea came into being: we reduced the responsibilities of a single JobTracker and delegated some of the responsibilities to TaskTrackers, since there are many TaskTrackers in the cluster. In the new design, this concept is reflected by separating the dual responsibilities of the JobTracker (cluster resource management and task coordination) into two different types of processes.

Advantages of YARN

  1. Faster MapReduce computations
  2. Support for multiple frames
  3. Framework upgrades are easier

YARN
  • ResourceManager replaces cluster manager
  • ApplicationMaster replaces a dedicated and ephemeral JobTracker
  • NodeManager replaces TaskTracker
  • A distributed application instead of a MapReduce job

A global ResourceManager runs as the main background process, usually running on a dedicated machine, arbitrating the available cluster resources among various competing applications.
When a user submits an application, a lightweight process instance called ApplicationMaster is started to coordinate the execution of all tasks within the application. This includes monitoring tasks, restarting failed tasks, speculatively running slow tasks, and summing application counter values. Interestingly, ApplicationMaster can run any kind of task inside a container.
NodeManager is a more general and efficient version of TaskTracker. Instead of a fixed number of map and reduce slots, NodeManager has many dynamically created resource containers.

reference

http://blog.csdn.net/dianacody/article/details/39494417
http://blog.csdn.net/dianacody/article/details/39502917
http://www.ibm.com/developerworks/cn/data/library/bd-yarn-intro/
http://my.oschina.net/leejun2005/blog/97802
http://code.csdn.net/news/2818355



Author: HarperKoo
Link: http://www.jianshu.com/p/c97ff0ab5f49
Source: Jianshu The
copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326461081&siteId=291194637