MapReduce common interview questions for big data

1. The execution process of MapReduce

The overall execution process of MR: (Yarn mode)

  • 1. Store the corresponding file in the input directory where the MapReduce program reads the file.
  • 2. The client program obtains the data information to be processed before the submit() method is executed, and then forms a task allocation plan according to the configuration of the parameters in the cluster.
  • 3. The client submits the slice information to Yarn, and the resourcemanager in Yarn starts
  • 4. After MRAPPmaster is started, according to the description information of the job, it calculates the required maptask instance objects, and then applies to the cluster to start the corresponding number of maptask processes.
  • 5. Maptask uses the input format specified by the client to read the data and form the output KV key-value pairs.
  • 6. Maptask passes the input KV key-value pair to the map() method defined by the customer for logical operation.
  • 7. After the Map() method is completed, the KV pairs are collected into the maptask cache.
  • 8. Shuffle stage
    • 1) maptask collects the KV pairs output by our map() method and puts them in the ring buffer.
    • 2) The KV pairs in maptask are sorted according to the k partition and are continuously written to the local disk file, which may overflow multiple files.
    • 3) Multiple files will be merged into a large overflow file.
    • 4) During the overflow writing process and the merging process, the partitioning and key sorting operations will be performed continuously.
    • 5) Reducetask obtains the corresponding result partition data on each maptask machine according to its partition number.
    • 6) Reducetask will fetch the result files from different maptasks in the same partition, and reducetask will merge and sort these files again.
    • 7) After merging into a large file, the shuffle process is over, and then the logic operation process of reducemask is entered (take out a key-value pair group from the file and call the user-defined reduce() method).
  • 9. After MRAPPmaster monitors that all the maptask process tasks are completed, it will start the corresponding number of reducetask processes according to the parameters specified by the customer, and inform the reducetask process of the data partition to be processed.
  • 10. After the reducetask process is started, according to the location of the to-be-processed data notified by MRAPPmaster, several maptask output result files are obtained from the machine where the maptask is running, and re-merged and sorted locally, and then a group is formed according to the KV of the same key , Call the reduce() method defined by the customer to perform logical operations.
  • 11. After the reducetask calculation is completed, call the outputformat specified by the customer to output the result data to the outside.

2. Has MapReduce been written? What are the key categories? What are the mapper methods? What does the setup method do? Does it call this method every time a row of data is read?

1. Key categories

• GenericOptionParser is a tool class for parsing command line parameters for the Hadoop framework.
• InputFormat interface, its implementation classes include: Fileinputfotmat, Composable inputformat, etc., which are mainly used for file input and cutting.
• Mapper maps the input kv pairs into a set of intermediate data kv pairs. Maps converts input records into intermediate records.
• Reducer combines the processing of the intermediate data set into a smaller data result set according to the key.
• Partitioner partitions the data according to the key.
• The output of the OutputCollector file.
• Combiner local aggregation, localized reduce.

2. The methods of mapper are setup, map, cleanup, run

• The setup method is used to manage the resources in the mapper life cycle, load some initial work, and execute each job once. Setup is executed before the mapper construction is completed, and the map action is about to be executed.
• The map method, the main logic writing method.
• The cleanup method mainly does some finishing work, such as closing files or key value distribution after executing map(), etc. Each job is executed once, which is more suitable for tasks such as calculating the global maximum.
• The run method executes all the processes described above, first calling the setup method, then executing the map() method, and finally executing the cleanup method.

3. There is a requirement that requires one instruction to shuffle all files into the same partition. If you use MapReduce, how do you write it?

  • Set the number of reducers in the Driver driver class, job.setNumReduceTask(1) is 1.

4. Hadoop Shuffle principle (the more detailed the better)?

  • 1. The process after the map method and before the reduce method is called Shuffle.
  • 2. After the map method, the data first enters the partition method, mark the data partition, and then send the data to the ring buffer; the default size of the ring buffer is 100M, when the ring buffer reaches 80%, overflow write; before overflow write Sort the data. The sorting is based on the lexicographical order of the key index, and the sorting method is fast. Overwriting produces a large number of overflowing files, and it is necessary to merge and sort the overflowing files; the overflowing files can also be combined with the Combiner operation, The premise is the summary operation, the average value is not enough. Finally, the files are stored on the disk according to the partition and wait for the Reduce side to pull it.
  • 3. Each Reduce pulls the data of the corresponding partition on the Map side. After pulling the data, store it in the memory first, if the memory is not enough, store it to the disk. After pulling all the data, use merge sort to sort the data in the memory and disk. Before entering the reduce method, you can group the data.

The relevant details are as follows:

  • 1. Maptask executes, collects the output data of maptask, writes the data into the ring buffer, and records the start offset.
  • 2. The default size of the ring buffer is 100M. When the data reaches 80M, the end offset is recorded.
  • 3. The data is partitioned (the default grouping is partitioned according to the hash value %reduce number of the key), and the partition is quickly sorted.
  • 4. Partition, after sorting, flash the data to disk (in this process, the output data is written into the remaining 20% ​​ring buffer, and the starting offset also needs to be recorded).
  • 5. After the maptask is over, the multiple small files formed are merged and sorted into one large file.
  • 6. When a maptask is executed, reducetask starts.
  • 7. The reducetask pulls the data belonging to its own partition on the machine where the maptask is completed.
  • 8. reducetask "groups" the pulled data, and calls the reduce() method once for each group of data.
  • 9. Execute reduce logic and output the result to a file.

Five. What is the role of the combine function?

  • Comnbine is divided into map side and reduce side. The function is to merge the key-value pairs of the same key, which can be customized. The combine function combines the <key, value> pairs (multiple keys, values) generated by a map function into a new <key2,value2>. Input the new <key2,value2> into the reduce function. This value2 can also be called values ​​because there are multiple. This merged parent university is to reduce network transmission.

6. Briefly list a few MapReduce tuning methods

The MapReduce optimization method is mainly considered from six aspects: data input, Map phase, Reduce phase, IO transmission, data skew issues and commonly used tuning parameters.

1. Data input

  • 1) Merge small files, merge small files before executing MR tasks, a large number of small files will generate a large number of map tasks, increase the number of map tasks loading, and the loading of tasks is time-consuming, resulting in slower MR operation;
  • 2) Use combinetextinputformat as input to solve the scenario of a large number of small files on the input end.

2.Map stage

  • 1) Reduce the number of overflow writes, by adjusting the io.sort.mb and sort.spill.percent parameter values, increase the upper limit of the memory for overflow writes, reduce the number of overflow writes, and thereby reduce disk IO;
  • 2) Reduce the number of merges, increase the number of merge files and reduce the number of merges by adjusting the io.sort.factor parameter, thereby reducing the MR processing time;
  • 3) After the map, without affecting the business logic, combine processing first to reduce IO.

3.Reduce phase

  • 1) Set the number of map and reduce reasonably. Neither number should be too small or too much. Too little will cause the task to wait too long and prolong the processing time. Too much will cause resource competition between map and reduce tasks. Cause processing timeout and other errors;
  • 2) Set the coexistence of map and reduce, adjust, show start completedmaps parameters, make the map run to a certain extent, reduce also start to run, thereby reducing reduce waiting time;
  • 3) Avoid the use of reduce, because reduce will generate a lot of network consumption when used to connect data sets;
  • 4) Reasonably set the reduce side buffer, which can be configured by setting parameters, so that part of the data in the buffer can be directly transmitted to reduce, thereby reducing IO overhead; the default of MapReduce, Reduce.input.buffer.percent is 0.0, when the value is greater than 0 At this time, the data in the specified ratio of memory read buffer will be kept in the memory and used directly by reduce.

4.IO transmission

  • 1) Use data compression to reduce task IO time;
  • 2) Use seq binary files.

7. Which processes are there in Hadoop and what are their roles?

  • NameNode manages the storage of metadata of the file system, records the location information of each data block in the file, and is responsible for performing operations related to the file system's namespace, such as opening, closing, renaming files and directories, etc., only one HDFS cluster is active The namenode can have other slave metadata nodes
  • Secondarynamenode, merge edit logs of namenode to fsimage file to assist namenode to persist metadata information in memory
  • NodeManager is an agent on each node in YARN. It manages a single computing node in a Hadoop cluster, including maintaining communication with ResourceManager, supervising the life cycle management of Container, monitoring resource usage (memory, CPU, etc.) of each Container, and tracking nodes Health status, management logs and auxiliary services used by different applications (auxiliary services)
  • DataNode, data storage node, save and retrieve block (file block) is responsible for providing read and write requests from the file system client, performing block creation, deletion and other operations
  • ResourceManager, in YARN, ResourceManager is responsible for the unified management and allocation of all resources in the cluster. It receives resource report information from each node (NodeManager), and distributes this information to each application according to a certain strategy (actually ApplicationManger) RM works with NodeManager (NMs) of each node and ApplicationMaster (AMs) of each application.

8. Yarn job submission process

Insert picture description here

1. Assignment submission

  • 1) The client calls the job.waitForCompletion method to submit MapReduce jobs to the entire cluster.
  • 2) The client applies for a job id from ResourceManager.
  • 3) The ResourceManager returns the submission path (HDFS path) and job id of the job resource to the Client. Each job has a unique id.
  • 4) Client sends jar package, slice information and configuration file to the specified resource submission path.
  • 5) After the client submits the resources, it applies to the ResourceManager to run MrAppMaster (ApplicationMaster for job).

2. Job initialization

  • 6) When ResourceManager receives the client's request, it adds the job to the capacity scheduler (Resource Scheduler).
  • 7) An idle NodeManager receives the job.
  • 8) The NodeManager creates a Container and generates MrAppMaster. ++++++++++
  • 9) Download the resources submitted by the Client to the local, and generate MapTask and ReduceTask according to the fragment information.

3. Assignment of tasks

  • 10) MrAppMaster applies to ResourceManager to run multiple MapTask task resources.
  • 11) The ResourceManager assigns the task of running MapTask to multiple NodeManagers that are idle, and the NodeManager receives the tasks and creates a container (Container) respectively.

4. Task running

  • 12) MrAppMasterMaster sends the program startup script to the two NodeManagers that receive the task, and each NodeManager that receives the task starts the MapTask, and the MapTask processes the data and sorts the data.
  • 13) After MrAppMaster waits for all MapTasks to run, apply for a container (Container) from ResourceManager and run ReduceTask.
  • 14) After the procedure is completed, MrAppMaster will apply to ResourceManager to cancel itself.
  • 15) Progress and status update. Tasks in YARN return their progress and status (including counter) to the application manager. The client requests progress updates from the application manager every second (set by mapreduce.client.progressmonitor.pollinterval) and displays it to the user. You can use YARN WebUI to view the task execution status.

5. Homework completed

  • In addition to requesting the progress of the job from the application manager, the client checks whether the job is complete every 5 minutes through waitForCompletion(). The time interval can
    be set by mapreduce.client.completion.pollinterval. After the job is completed, the application manager and container will clean up the working state. The job information will be stored by the job history server for later user inspection.

9. The size of the current block is 128M, and now there is a file with a size of 260M. How many slices will it be divided into when spiling?

  • 2 slices, 1.1 redundancy (every time slicing, it is necessary to judge whether the remaining part is greater than 1.1 times the block, and it is divided into one slice if it is not greater than 1.1 times)

10. List the components that can be intervened in MR (describe the principle of each component in detail, ps: combine)

  • combine: It is equivalent to a reduce on the map side (files generated by each maptask).
  • partition: partition, the default is based on the number of key hash value %reduce, custom partition inherits the Partitioner class and overrides the getPartition() partition method. Custom partitions can effectively solve the problem of data skew.
  • group: group, inherit the WritableCompatator class, override the compare() method, and customize the group (that is, define the data grouping rules input by reduce).
  • sort: Sort, inherit the WritableComoarable class, override the compareTo() method, and sort the output results of reduce according to a custom sorting method.
  • Fragmentation: The blocksize, minSize, and maxSize of the client can be adjusted.

11. What is the difference between fragmentation and chunking?

  • Fragmentation is a logical concept, and fragmentation has redundancy.
  • Blocking is a physical concept, which is to split data without redundancy.

12. What are the job responsibilities of resourceManager?

• Resource Scheduling.
• Resource monitoring.
• Application submission.

13. What are the job responsibilities of NodeManager?

  • Mainly resource management on the node, start the Container to run task calculations, report resources, the container status to RM and task processing status to AM.

14. Briefly describe the Hadoop scheduler

  • Currently Hadoop has three popular resource managers: FIFO, Capacity Scheduler, Fair Scheduler. Currently, Hadoop2.7 uses Capacity Scheduler by default.

1. FIFO (first in first out scheduler)

  • The default scheduler used by Hadoop 1.x is FIFO. FIFO uses a queue method to serve job tasks in the order of time. For example, the top job requires several maptasks and several reducetasks. When an idle server node is found, it will be assigned to this job until the job is executed.
    Insert picture description here

2.Capacity Scheduler (capacity scheduler)

The default scheduler used by hadoop2.x is Capacity Scheduler.

  • 1) Support multiple queues, each queue can be configured with a certain amount of resources, and each queue is scheduled by FIFO.
  • 2) In order to prevent job tasks of the same user from monopolizing the resources in the queue, the scheduler limits the resources occupied by job tasks submitted by the same user.
  • 3) When assigning a new job task, first calculate the ratio of the number of running tasks in each queue to the amount of resources that the queue should allocate, and then select the queue with the smallest ratio. For example, as shown in the figure, queue A has 15 tasks and 20% resources, then 15%0.2=70, queue B is 25%0.5=50, and queue C is 25%0.3=80.33. So choose the minimum queue B.
  • 4) Secondly, in accordance with the priority and time sequence of job tasks, the user's resource amount and memory limitations should be taken into consideration. Sort and execute job tasks in the queue.
  • 5) Multiple queues are executed at the same time according to the order in the task queue. As shown in the figure below, job11, job21, and job31 are relatively high in their respective queues, and the three tasks are executed at the same time.
    Insert picture description here

3. Fair Scheduler (Fair Scheduler)

  • 1) Holding multiple queues, each queue can be configured with certain resources, and the job tasks in each queue fairly share all the resources of the queue where it is located.
  • 2) Job tasks in the queue are allocated resources according to priority. The higher the priority, the more resources are allocated, but in order to ensure fairness, each job task will be allocated resources. The priority is determined according to the difference between the ideal amount of resource acquired for each job task minus the actual resource category. The larger the difference, the higher the priority.
    Insert picture description here

15. Can I remove the reduce stage when developing a job?

  • Yes, set the reduce number to 0, job.setNumReduceTask=0

Guess you like

Origin blog.csdn.net/sun_0128/article/details/108564793