Hadoop-Yarn and Hadoop enterprise optimization

Yarn

Hadoop1.x and Hadoop2.x architecture difference

In the Hadoop 1.x era, MapReduce in Hadoop processes business logic operations and resource scheduling at the same time, with greater coupling.

In the Hadoop 2.x era, Yarn was added. Yarn is only responsible for resource scheduling, MapReduce is only responsible for computing

Yarn overview

Yarn is a resource scheduling platform, responsible for providing server computing resources for computing programs, which is equivalent to a distributed operating system platform, while computing programs such as MapReduce are equivalent to applications running on the operating system.

Yarn basic architecture

YARN is mainly composed of components such as ResourceManager, NodeManager, ApplicationMaster and Container.

Yarn working mechanism

Insert picture description hereGlossary

1. 资源:在 YARN 的语境下,资源特指计算资源,包括 CPU 和内存。计算机的每个进程都会占用一定的 CPU 和内存,任务需要先向 RM 申请到资源后才能获准在 NM 上启动自己的进程。

2. 队列:YARN 将整个集群的资源划分为队列,每个用户的任务必须提交到指定队列。同时限制每个队列的大小,防止某个用户的任务占用整个集群,影响了其他用户的使用。

3. Vcore & Mem:逻辑 CPU 和逻辑内存,每个 NM 会向 RM 汇报自己有多少 vcore 和内存可用,具体数值由集群管理员配置。比如一台48核,128G内存的机器,可以配置40vcore,120G内存,意为可以对外提供这么多资源。具体数值可能根据实际情况有所调整。每个 NM 的逻辑资源加起来,就是整个集群的总资源量。

4. MinResources & MaxResources:为了使每个队列都能得到一定的资源,同时又不浪费集群的空闲资源,队列的资源设置都是“弹性”的。每个队列都有 min 和 max 两个资源值,min 表示只要需求能达到,集群一定会提供这么多资源;如果资源需求超过了 min 值而同时集群仍有空闲资源,则仍然可以满足;但又限制了资源不能无限申请以免影响其他任务,资源的分配不会超过 max 值。

5. Container:任务申请到资源后在 NM 上启动的进程统称 Container。比如在 MapReduce 中可以是 Mapper 或 Reducer,在 Spark 中可以是 Driver 或 Executor。
  • Simplified version of working mechanism
    1. The user uses the client to submit a job job to RM, and at the same time specify which queue to submit to and how many resources are required. Users can set the corresponding parameters of each calculation engine, if not specified, the default settings will be used.
  1. After RM receives the request for task submission, it first selects an NM based on whether the resources and queue meet the requirements, and informs it to start a special container called ApplicationMaster (AM), and the subsequent process is initiated by it.

  2. After AM registers with RM, according to the needs of their tasks, they apply for containers from RM, including factors such as quantity, required resources, and location.

  3. If the queue has enough resources, RM will assign the container to the NM with enough remaining resources, and the AM will notify the NM to start the container.

  4. After the container is started, it performs specific tasks and processes the data allocated to itself. In addition to starting the container, NM is also responsible for monitoring its resource usage and whether it fails to exit. If the memory actually used by the container exceeds the memory specified at the time of application, it will be killed to ensure that other containers can run normally.

  5. Each container reports its progress to AM. After all of them are completed, AM logs off the task to RM and exits. RM informs NM to kill the corresponding container and the task ends.

How many resources are appropriate for the container?

If the container memory is set too low, and the actual memory used is too much, it may be killed by YARN during operation and cannot run normally. If the number of concurrent threads in the container is large and the vcore is set less, it may be allocated to a machine with a relatively high load, resulting in slow operation. Therefore, it is necessary to estimate the memory corresponding to the amount of data processed by a single container, and the number of vcores should not be lower than the number of concurrent threads.

  • Yarn complex operating mechanism
    Insert picture description here--detailed working mechanism
    • The Mr program is submitted to the node where the client is located.
    • Yarnrunner applies for an Application from Resourcemanager.
    • rm returns the resource path of the application to yarnrunner.
    • The program submits the resources required for operation to HDFS.
    • After the program resources are submitted, apply for running mrAppMaster.
    • RM initializes the user's request into a task.
    • One of the NodeManagers receives the task task.
    • The NodeManager creates a Container and generates MRAppmaster.
    • Container copies resources from HDFS to local.
    • MRAppmaster applies to RM for running maptask resources.
    • RM assigns the task of running maptask to the other two NodeManagers, and the other two NodeManagers receive tasks and create containers respectively.
    • MR sends the program startup script to the two NodeManagers that have received the task, and the two NodeManagers respectively start the maptask, and the maptask sorts the data partitions.
    • After MrAppMaster waits for all maptasks to run, apply for a container from RM and run reduce tasks.
    • The reduce task obtains the data of the corresponding partition from the maptask.
    • After the program runs, MR will apply to RM to cancel itself.

Job submission process

Detailed explanation of the whole process of job submission

  • Assignment submission

    Step 0: The client calls the job.waitForCompletion method to submit MapReduce jobs to the entire cluster.

    Step 1: The client applies for a job id from RM.

    Step 2: RM returns the submission path and job id of the job resource to the client.

    Step 3: The client submits the jar package, slice information and configuration files to the specified resource submission path.

    Step 4: After the client submits the resources, it applies to the RM to run MrAppMaster.

  • Job initialization

    Step 5: When RM receives the client's request, it adds the job to the capacity scheduler.

    Step 6: An idle NM receives the job.

    Step 7: The NM creates a Container and generates MRAppmaster.

    Step 8: Download the resources submitted by the client to the local.

  • Task Assignment

    Step 9: MrAppMaster applies to RM to run multiple maptask task resources.

    Step 10: RM assigns the task of running maptask to the other two NodeManagers, and the other two NodeManagers receive tasks and create containers respectively.

  • Task run

    Step 11: MR sends the program startup script to the two NodeManagers that have received the task. The two NodeManagers start maptask, and maptask sorts the data partitions.

    Step 12: After MrAppMaster waits for all maptasks to run, apply for a container from RM and run reduce task.

    Step 13: The reduce task obtains the data of the corresponding partition from the maptask.

    Step 14: After the program runs, MR will apply to RM to cancel itself.

  • Progress and status updates

    The task in YARN returns its progress and status (including counter) to the application manager, and the client requests progress updates from the application manager every second (set by mapreduce.client.progressmonitor.pollinterval) and displays it to the user.

  • Homework completed

    In addition to requesting the job progress from the application manager, the client will check whether the job is complete by calling waitForCompletion() every 5 minutes. The time interval can be set by mapreduce.client.completion.pollinterval. After the job is completed, the application manager and container will clean up the working status. The job information will be stored by the job history server for later user verification.

Resource scheduler

​ At present, there are three main Hadoop job schedulers: FIFO, Capacity Scheduler and Fair Scheduler. The current default resource scheduler is Capacity Scheduler.

​ For specific settings, see: yarn-default.xml file

<property>
    <description>The class to use as the resource scheduler.</description>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
    </value>
</property>

First-in-first-out scheduler (FIFO) Insert picture description here-Advantages: The scheduling algorithm is simple, and the JobTracker (where the job is sent after the job is submitted) is light.

Disadvantages: Ignore the difference in requirements of different jobs. For example, if a job similar to statistical analysis of massive data occupies computing resources for a long time, then interactive jobs submitted later may not be processed for a long time, thus affecting the user experience.

  • Capacity Scheduler ===>Developed by Yahoo
    Insert picture description here--Multi-queue support, each queue uses FIFO

    • In order to prevent the jobs of the same user from monopolizing the resources in the queue, the scheduler limits the amount of resources occupied by the jobs submitted by the same user
    • First, calculate the ratio between the number of tasks running in each queue and the computing resources that should be allocated, and select the queue with the smallest ratio
    • Secondly, according to the priority of the job and the order of submission time, the tasks in the queue are sorted in consideration of user resource limitations and memory limitations.
    • The three queues are executed at the same time according to the order of the tasks. For example, job1, job21, and job31 are at the top of the queue. They are run first and run simultaneously.

    The scheduling does not support priority by default, but this option can be enabled in the configuration file. If priority is supported, the scheduling algorithm is a FIFO with priority.

    Priority preemption is not supported. Once a job is executed, its resources will not be preempted by high-priority jobs before execution.

    The percentage of resources that can be obtained by jobs submitted by the same user in the queue is restricted so that jobs belonging to the same user cannot monopolize resources.

  • Fair Scheduler ===> Facebook development
    Insert picture description here-1. Supports multiple queues and multiple users, the amount of resources in each queue can be configured, and jobs in the same queue share all resources in the queue fairly

    2. For example, there are three queues A, B, and C. Jobs in each queue allocate resources according to priority. The higher the priority, the more resources are allocated, but each job is allocated resources to ensure fairness. In the case of limited resources, under the ideal situation of each job, there is a gap between the computing resources obtained and the computing resources actually obtained. This gap is called a shortage. In the same queue, the greater the resource shortage of the job, the earlier the resources obtained will be executed first. The jobs are executed sequentially according to the level of the shortage, and you can see that there are multiple jobs running at the same time in the above picture.

Speculative execution of tasks

Speculative Execution refers to running MapReduce in a cluster environment. It may be due to program bugs, uneven load or some other problems, resulting in inconsistent speeds of multiple TASKs under a JOB, such as some tasks have been completed, but some The task may only run 10%. According to the barrel principle, these tasks will become the shortcomings of the entire JOB. If the cluster starts speculative execution, then in order to maximize the shortcomings, Hadoop will start a backup task for the task, so that The speculative task and the original task process a piece of data at the same time, whichever is run first, whose result is the final result, and kill another task after the run is completed.

  • The job completion time depends on the slowest task completion time

    A job consists of several Map tasks and Reduce tasks. Due to hardware aging, software bugs, etc., some tasks may run very slowly.

    Typical case: 99% of the Map tasks in the system are completed, and only a few Maps are always slow in progress and cannot be completed. What should I do?

  • Speculative execution mechanism:

    It is found that the tasks that are holding back, such as a certain task running speed is much slower than the task average speed. Start a backup task for the dragging task and run it at the same time. Whoever finishes running first will adopt whose result.

  • Prerequisites for performing speculative tasks

    • Each task can only have one backup task;
    • The completed tasks of the current job must not be less than 0.05 (5%)
    • Turn on the speculative execution parameter setting, which is enabled by default in the mapred-site.xml file.
 <property>
  <name>mapreduce.map.speculative</name>
<value>true</value>
  <description>If true, then multiple instances of some map tasks                may be executed in parallel.</description>
</property>

<property>
  <name>mapreduce.reduce.speculative</name>
  <value>true</value>
  <description>If true, then multiple instances of some reduce tasks 
               may be executed in parallel.</description>
</property>

Cannot enable speculative execution mechanism

  • There is a serious load tilt between tasks;
  • Special tasks, such as tasks writing data to the database.

Hadoop Enterprise Optimization

Reasons for the slow running of MapReduce

The bottleneck of the efficiency of Mapreduce program lies in two points:

  • Computer performance
    • CPU, memory, disk health, network
  • I/O operation optimization
    • Data skew
    • The number of map and reduce settings is unreasonable
    • Map running time is too long, causing reduce to wait too long
    • Too many small files
    • A large number of large files that are not blockable
    • Too many spills
    • Too many merge times, etc.

MapReduce optimization method

​ The MapReduce optimization method is mainly considered from six aspects: data input, Map phase, Reduce phase, IO transmission, data skew issues and commonly used tuning parameters.

  • data input

    • Merging small files: merge small files before executing the mr task. A large number of small files will generate a large number of map tasks, increasing the number of map tasks loading, and the task loading is time-consuming, resulting in slower operation of mr.
    • Use CombineTextInputFormat as input to solve a large number of small file scenarios at the input end.
  • Map stage

    • Reduce the number of spills: By adjusting the io.sort.mb and sort.spill.percent parameter values, the upper limit of the memory that triggers the spill is increased, and the number of spills is reduced, thereby reducing disk IO.
    • Reduce the number of merges: By adjusting the io.sort.factor parameter, increase the number of merged files and reduce the number of merges, thereby shortening the mr processing time.
    • After the map, without affecting the business logic, combine processing first to reduce I/O.
  • Reduce phase

    • Set the number of map and reduce reasonably: neither can be set too little or too much. Too little will cause the task to wait and prolong the processing time; too much will cause resource competition between map and reduce tasks, causing errors such as processing timeouts.
    • Set the coexistence of map and reduce: adjust the slowstart.completedmaps parameter so that after the map runs to a certain extent, reduce also starts to run, reducing the waiting time of reduce.
    • Avoid using reduce: Because reduce will generate a lot of network consumption when used to connect to data sets.
    • Set the reduce side buffer reasonably: By default, when the data reaches a threshold, the data in the buffer will be written to the disk, and then reduce will get all the data from the disk. In other words, buffer and reduce are not directly related. There are multiple processes of writing to disk -> reading disk. Since there is this drawback, it can be configured through parameters so that part of the data in the buffer can be directly transmitted to reduce , Thereby reducing IO overhead: mapred.job.reduce.input.buffer.percent, the default is 0.0. When the value is greater than 0, the specified proportion of the data in the memory read buffer will be reserved and used directly by reduce. In this way, memory is needed to set the buffer, memory is needed to read data, and memory is also needed for reduce calculation, so it needs to be adjusted according to the running status of the job.
  • IO transmission

    • Use data compression to reduce network IO time. Install and use Snappy and LZO compression encoders.
    • Use SequenceFile binary file.
  • Data skew

    • Data skew

      Data frequency tilt-the amount of data in a certain area is much larger than other areas.

      Data size skew-the size of some records is much larger than the average.

    • How to collect tilt data

    Add the function of recording the details of the map output key in the reduce method.

HDFS small file optimization method

  • Disadvantages of HDFS small files

    Each file on HDFS must be indexed on the namenode. The size of this index is about 150 bytes. In this way, when there are many small files, a lot of index files will be generated. On the one hand, it will occupy a large amount of memory space of the namenode. On the one hand, if the index file is too large, the index speed becomes slow.

  • solution

    • Hadoop Archive:

      It is a file archiving tool that efficiently puts small files into HDFS blocks. It can pack multiple small files into one HAR file, thus reducing the memory usage of namenode.

    • Sequence file:

      The sequence file is composed of a series of binary key/value. If the key is the file name and the value is the file content, a large number of small files can be merged into one large file.

    • CombineFileInputFormat:

      CombineFileInputFormat is a new inputformat used to combine multiple files into a single split. In addition, it will consider the storage location of the data.

    • Enable JVM reuse

      For a large number of small file jobs, you can turn on JVM reuse and reduce the running time by 45%.

      JVM reuse understanding: a map runs a jvm, if it is reused, after a map runs on the jvm, the jvm continues to run other maps.

      Specific settings: mapreduce.job.jvm.numtasks value is between 10-20.

Guess you like

Origin blog.csdn.net/qq_45092505/article/details/105458433