The most complete Hadoop interview questions in history: Nien Big Data Interview Collection Topic 1

Say up front:

"Nin's Big Data Interview Collection" is a companion volume to " Nin's Java Interview Collection ".

Here is a special explanation: " Nin's Java Interview Collection " 41 topic PDFs (please obtain at the end of the article). Since its release, thousands of questions have been collected, with more than 4,000 pages. high salary. It has become a must-read book for Java learning and interviews.

Therefore, the Nien team strikes while the iron is hot and launches the "Ninen Big Data Interview Collection"

"Nin's Big Data Interview Collection" will be continuously upgraded and iterated in the future, and it will become a must-read book for studying and interviewing in the field of big data.

Note: This article is continuously updated in PDF, and the PDF files of the latest Nien architecture notes and interview questions can be obtained from [Technical Freedom Circle] at the end of the article

about the author:

First work: Mark , senior big data architect, Java architect, nearly 20 years of experience in Java, big data architecture and development. Senior architecture mentor, successfully guided multiple intermediate Java and senior Java transformation architect positions.

Second work: Nien , senior system architect, senior writer in the IT field, and famous blogger. In the past 20 years, he has worked on 3-level architecture research, system architecture, system analysis, and core code development in the fields of high-performance Web platform, high-performance communication, high-performance search, and data mining. Senior architecture mentor, successfully guided multiple intermediate Java and senior Java transformation architect positions.

Article directory

Topic 1: The most complete Hadoop interview questions in history

1. Chat: The main bottleneck of Hadoop cluster

The most important bottlenecks of Hadoop clusters may include the following aspects:

  1. Network bandwidth : Data in a Hadoop cluster usually needs to be transmitted between different nodes. If the network bandwidth is insufficient, the data transmission speed may slow down, thereby affecting the performance of the entire cluster.
  2. Storage performance : Hadoop clusters usually use distributed file systems to store data. If the storage performance is insufficient, data read and write speeds may slow down, thereby affecting the performance of the entire cluster.
  3. Computing resources : Computing tasks in a Hadoop cluster usually need to be executed on different nodes. If the computing resources of some nodes are insufficient, the task execution speed may slow down, thereby affecting the performance of the entire cluster.
  4. Data skew : In some cases, the data in the Hadoop cluster may be skewed, that is, the amount of data in some nodes is too large, which causes the execution time of computing tasks of these nodes to be too long, thus affecting the performance of the entire cluster.
  5. Scheduling strategy : Tasks in a Hadoop cluster usually need to be assigned and scheduled by the scheduler. If the scheduling strategy is unreasonable, some nodes may be overloaded, thereby affecting the performance of the entire cluster.

2. Chat: Hadoop operating mode

Local mode, pseudo-distributed mode, fully distributed mode

The local mode means that Hadoop runs on a single machine, and the data is stored in the local file system. It is suitable for situations where the amount of data is small, and is mainly used for development and testing.

Distributed mode means that Hadoop runs on multiple machines, and the data is divided into multiple blocks, stored on different nodes, and communicated and coordinated through the network. It is suitable for processing large-scale data and is mainly used in production environments. In distributed mode, Hadoop provides two operating modes: independent mode and pseudo-distributed mode.

Pseudo-distributed mode means that Hadoop runs on a single machine, but each component runs in different processes, simulating a distributed environment, suitable for development and testing.

3. Talk about the components of the Hadoop ecosystem and give a brief description

1) Zookeeper : It is an open source distributed application coordination service. Based on zookeeper, it can realize synchronization service, configuration maintenance, and naming service.

2) Flume : A highly available, highly reliable, and distributed massive log collection, aggregation, and transmission system.

3) Hbase : It is a distributed, column-oriented open source database that uses Hadoop HDFS as its storage system.

4) Hive : A data warehouse tool based on Hadoop, which can map structured data files into a database table, and provide a simple sql query function, which can convert sql statements into MapReduce tasks for operation.

5) Sqoop : Import data from a relational database into Hadoop's HDFS, or import HDFS data into a relational database.

4. Talk about the two concepts of "hadoop" and "hadoop ecosystem"

Hadoop refers to the Hadoop framework itself; the hadoop ecosystem includes not only hadoop, but also other frameworks that ensure the normal and efficient operation of the hadoop framework, such as zookeeper, Flume, Hbase, Hive, Sqoop and other auxiliary frameworks.

5. Chat: Please list which processes need to be started for Hadoop in a normal working Hadoop cluster, and what are their functions?

1) NameNode : It is the main server in hadoop, manages the file system namespace and access to files stored in the cluster, and saves metadata.

2) SecondaryNameNode : It is not a redundant daemon process of namenode, but provides periodic checkpoint and cleanup tasks. Help NN merge editslog and reduce NN startup time.

3) DataNode : It is responsible for managing the storage connected to the node (there can be multiple nodes in a cluster). Each node that stores data runs a datanode daemon.

4) ResourceManager (JobTracker): JobTracker is responsible for scheduling the work on the DataNode. Each DataNode has a TaskTracker which performs the actual work.

5) NodeManager : (TaskTracker) executes tasks.

6) DFSZKFailoverController : When it is highly available, it is responsible for monitoring the status of NN and writing the status information to ZK in a timely manner. It obtains the health status of NN by periodically calling a specific interface on NN through an independent thread. FC also has the right to choose who will be the Active NN, because there are only two nodes at most, and the current selection strategy is relatively simple (first come, first served, rotation).

7) JournalNode : Store the editlog file of the namenode in the case of high availability.

6. Let’s chat: How many blocks are saved by default in HDFS?

In HDFS, blocks are stored in three copies by default, which are also called replicas.

This copy mechanism is called data redundancy, and its purpose is to improve data reliability and availability. When a copy is unavailable, HDFS can use other copies to ensure data availability.

At the same time, HDFS will periodically verify the copies to ensure their consistency.

The number of replicas can be changed by modifying the configuration of HDFS.

However, it should be noted that increasing the number of replicas will take up more storage space and network bandwidth, so a trade-off needs to be made according to the actual situation.

7. Let’s talk: What is the default BlockSize of HDFS?

The default BlockSize of HDFS (Hadoop Distributed File System) is 128MB.

This value can be modified through the HDFS configuration file, but generally speaking, it is not recommended to modify this value arbitrarily, because it will affect the performance of HDFS and the efficiency of data storage. If the files you need to store are relatively small, you can consider merging multiple small files into one large file to achieve better storage efficiency.

8. Let’s talk: Which part is responsible for HDFS data storage?

DataNode is responsible for data storage

9. Chat: What is the purpose of SecondaryNameNode?

His purpose is to help NameNode merge edit logs and reduce NameNode startup time

10. Let’s talk: the block size of hadoop, from which version is 128M

Hadoop1.x is 64M, hadoop2.x is 128M at the beginning.

11. Chat: HDFS read and write process

Nin hints:

This question is a core one. Although I have seen it countless times, the interviewer has asked countless times.

However, there are still many interviewees who cannot fully speak out, so please remember.

In addition, many problems are derived from the HDFS read and write process.

HDFS write process:

The Client client sends an upload request and establishes communication with the NameNode through RPC. The NameNode checks whether the user has the upload permission and whether the uploaded file has the same name in the corresponding HDFS directory. If any of the two is not satisfied, an error will be reported directly. , if both are satisfied, then return to the client a message that can be uploaded;

The client splits the file according to the size of the file, and the default is 128M. After the split is completed, it sends a request to the NameNode, asking which server the first block is uploaded to;

After the NameNode receives the request, it allocates files according to the network topology, rack perception and copy mechanism, and returns the address of the available DataNode;

Note: Hadoop is designed with data security and efficiency in mind. Data files are stored in three copies on HDFS by default. The storage strategy is a local copy, a copy on another node in the same rack, and a node in a different rack. last copy.

After the client receives the address, it communicates with a node in the server address list such as A, which is essentially an RPC call to establish a pipeline. After receiving the request, A will continue to call B, and B will call C to complete the establishment of the entire pipeline. class returns Client;

Client starts to send the first block to A (first read data from the disk and then put it in the local memory cache), in the unit of packet (data packet, 64kb), A receives a packet and sends it to B, and then B sends For C, A will put it into a response queue to wait for the response after each packet is transmitted;

The data is divided into individual packets and transmitted sequentially on the pipeline. In the reverse transmission of the pipeline, acks are sent one by one (command correct response), and finally the first DataNode node A in the pipeline sends the pipeline ack to the Client;

After a block transmission is completed, the Client requests the NameNode to upload the second block again, and the NameNode selects three DataNodes for the Client again.

HDFS read process:

Client sends RPC request to NameNode. The location of the requested file block;

After receiving the request, the NameNode will check the user permissions and whether there is such a file. If they all match, it will return part or all of the block list depending on the situation. For each block, the NameNode will return the address of the DataNode that contains the copy of the block; these returned DataNode address, the distance between DataNode and client will be obtained according to the cluster topology, and then sorted according to two rules: in the network topology, the closest to the Client is ranked first; in the heartbeat mechanism, the state of the DataNode reported overtime is STALE, such Back row;

The client selects the top-ranked DataNode to read the block. If the client itself is a DataNode, it will directly obtain data from the local (short-circuit read feature);

The essence of the bottom layer is to establish a Socket Stream (FSDataInputStream), and repeatedly call the read method of the parent class DataInputStream until the data on this block is read;

After reading the list of blocks, if the file reading is not over yet, the client will continue to obtain the next batch of block lists from the NameNode;

After reading a block, checksum verification will be performed. If an error occurs when reading the DataNode, the client will notify the NameNode, and then continue to read from the next DataNode that has a copy of the block;

The read method reads block information in parallel, not one by one; the NameNode only returns the DataNode address of the client request containing the block, not the data of the requested block;

Finally, all the blocks read will be merged into a complete final file;

12. Chat: When HDFS is reading a file, what should I do if one of the blocks is suddenly damaged?

After the client reads the blocks on the DataNode, it will perform checksum verification, that is, to verify the local blocks read by the client and the original blocks on HDFS. If the verification results are found to be inconsistent, the client will notify the NameNode, and then Then continue reading from the next DataNode that has a copy of the block.

That is: if a block in HDFS is corrupted while reading, HDFS will try to read data from other replicas. In HDFS, each block has multiple copies, which are usually stored on different nodes. If one replica is corrupted, HDFS will try to read data from other replicas to ensure data integrity and availability.

In HDFS, the number of replicas of a block is configurable, the default is three. If one replica is damaged, HDFS will try to read data from other replicas. If all replicas are corrupted, the block will be marked as "lost", and data recovery or re-uploading will be required.

13. Let’s chat: What should I do if one of the DataNodes suddenly hangs up when HDFS is uploading files?

When the client uploads files, a pipeline pipeline is established with the DataNode. The forward direction of the pipeline is the data packet sent by the client to the DataNode, and the reverse direction of the pipeline is that the DataNode sends an ack confirmation to the client, that is, after receiving the data packet correctly, it sends a confirmed receipt received response.

When the DataNode suddenly hangs up and the client fails to receive the ack confirmation sent by the DataNode, the client will notify the NameNode, and the NameNode checks that the copy of the block does not match the specified one. Take it offline, and no longer let it participate in file upload and download.

14. Chat: What operations will NameNode do when it starts up

NameNode data is stored in memory and local disk, and local disk data is stored in fsimage image files and edits edit log files.

NameNode data is stored in memory and local disks, and local disk data is stored in fsimage image files and edits edit log files.

(1) Start the NameNode for the first time:

①First format the NameNode and generate the fsimage image file;

②Start the NameNode:

  • Read the fsimage image file, load the file content into the memory,
  • Wait for DataNode registration and block report;

③Start the DataNodes:

  • First register with NameNode
  • send block report
  • Check whether the block in fsimage is consistent with the number in block report;

④Operate the file system:

  • At this time, the change information such as creating a directory, uploading a file, etc. will be recorded in the edits editing log file;

(2) Start the NameNode for the second time:

① Read fsimage and edits files;

② After the data in the two fsimages are edited and recorded in the log file, a new fsimage image file is generated;

③Create a new edits edit log file;

④ Start the DataNode node.

15. Let’s chat: Do you understand the Secondary NameNode? What is its working mechanism?

SecondaryNameNode has two functions, one is mirror backup, and the other is regular merging of logs and mirrors. Two processes at the same time, called checkpoint (checkpoint).

From the name of the Secondary NameNode, it feels like a backup of the NameNode, but it is not. Many Hadoop beginners are wondering what exactly does the Secondary NameNode do and why it appears in HDFS.

Review the NameNode first

NameNode is mainly used to store metadata information of HDFS, such as namespace information, block information, etc.

When the NameNode is running, this information is stored in memory. But this information can also be persisted to disk.

The picture above shows how the NameNode saves metadata to disk. There are two different files here:

  • fsimage - it is a snapshot of the entire filesystem at NameNode startup
  • edit logs - it is the sequence of changes made to the file system after the NameNode started

Only when the NameNode restarts will the edit logs be merged into the fsimage file, resulting in an up-to-date snapshot of the file system. But in the production cluster, the NameNode is rarely restarted, which also means that when the NameNode runs for a long time, the edit logs file will become very large.

In this case the following problems arise:

  • The edit logs file will become very large, how to manage this file is a challenge.
  • The restart of the NameNode can take a long time because there are many changes to be merged into the fsimage file.

If the NameNode dies, then we lose a lot of changes because the fsimage file is very old at this time.

So in order to overcome this problem, we need an easy-to-manage mechanism to help us reduce the size of the edit logs file and get a latest fsimage file, which will also reduce the pressure on the NameNode. This is very similar to the recovery point of Windows. The recovery point mechanism of Windows allows us to take a snapshot of the OS, so that when a problem occurs in the system, we can roll back to the latest recovery point.

Now we understand the function of the NameNode and the challenge it faces - keeping the metadata of the file system up to date.

So, what does this have to do with the Secondary NameNode?

What the Secondary NameNode does is set a checkpoint in the file system to help the NameNode work better.

It is not intended to replace the NameNode nor is it the backup of the NameNode. The SecondaryNameNode has two functions, one is mirror backup, and the other is the regular merger of logs and mirrors.

Two processes at the same time, called checkpoint (checkpoint).

The role of mirror backup: backup fsimage (fsimage is written to the file when the metadata is sent to the checkpoint); the role of regular merging of logs and mirrors: merge the edits log and fsimage in the Namenode to prevent if the Namenode node fails,

When the namenode starts next time, it will load the fsimage into the memory, apply the edits log, and the edits log is often very large, resulting in time-consuming operations. (This is also a set of mechanisms for namenode fault tolerance)

Secondarynamenode working process

SecondaryNameNode backup is controlled by three parameters: fs.checkpoint.period control period (in seconds, default 3600 seconds), fs.checkpoint.size controls log file merge when the size exceeds (in bytes, default 64M), dfs .http.address indicates the http address. This parameter needs to be set when the SecondaryNameNode is a separate node.

  1. The SecondaryNameNode notifies the NameNode to prepare to submit the edits file. At this time, the primary node records the new write operation data into a new file edits.new.
  2. The SecondaryNameNode obtains the fsimage and edits files of the NameNode through HTTP GET (the temp.check-point or previous-checkpoint directories can be seen in the same-level directory of the SecondaryNameNode's current, and these directories store the image files copied from the NameNode).
  3. SecondaryNameNode starts to merge the above two files obtained to generate a new fsimage file fsimage.ckpt.
  4. SecondaryNameNode sends fsimage.ckpt to NameNode in HTTP POST mode.
  5. The NameNode renames the fsimage.ckpt and edits.new files to fsimage and edits respectively, and then updates fstime, and the entire checkpoint process ends here.

It can be seen from the working process that the important role of the SecondaryNameNode is to periodically merge the namespace mirror through the edit log file to prevent the edit log file from being too large. SecondaryNameNode generally needs to run on another machine, because it needs to take up a lot of CPU time and the same capacity of memory as NameNode to perform the merge operation. It keeps a copy of the merged namespace mirror and is brought up in case of namenode failure.

16. Let’s chat: Secondary NameNode cannot restore all the data of NameNode, so how to ensure the security of NameNode data storage

This question is about the high availability of NameNode, that is, NameNode HA.

If one NameNode has a single point of failure, then configure dual NameNodes. There are two key points in the configuration. One is to ensure that the metadata information of the two NameNodes must be synchronized, and the other is that after one NameNode hangs up, the other must Immediately make up.

Metadata information synchronization uses "shared storage" in the HA solution. Every time a file is written, the log needs to be written to the shared storage synchronously. Only when this step succeeds can the file be written successfully. Then the backup node periodically synchronizes logs from the shared storage for active/standby switchover.

Zookeeper is used to monitor the NameNode status. The status of the two NameNode nodes is stored in the ZooKeeper. The other two NameNode nodes have a process monitoring program, which reads the status of the NameNode in the ZooKeeper to determine whether the current NameNode is down. If the ZKFC of the Standby NameNode node finds that the primary node has hung up, it will forcefully send a forced shutdown request to the original Active NameNode node, and then set the standby NameNode as Active.

Tip: If the interviewer asks how the shared storage in HA is implemented, do you know?

It can be explained: There are many NameNode shared storage solutions, such as Linux HA, VMware FT, QJM, etc. At present, the community has merged the solution based on QJM (Quorum Journal Manager) implemented by Clouderea into the trunk of HDFS and used it as the default shared storage implementation.

The shared storage system based on QJM is mainly used to save EditLog, not FSImage file. The FSImage file is still on the local disk of the NameNode.

The basic idea of ​​QJM shared storage comes from the Paxos algorithm, which uses a JournalNode cluster composed of multiple nodes called JournalNodes to store EditLogs. Each JournalNode keeps the same copy of EditLog. Every time the NameNode writes the EditLog, in addition to writing the EditLog to the local disk, it will also send a write request to each JournalNode in the JournalNode cluster in parallel. As long as most of the JournalNode nodes return successfully, it is considered to write to the JournalNode cluster EditLog succeeded. If there are 2N+1 JournalNodes, then according to most principles, up to N JournalNode nodes can be tolerated to hang up.

17. Chat: Secondary namenode working mechanism

1) Phase 1: NameNode startup

(1) After starting the NameNode format for the first time, create fsimage and edits files. If it is not the first time to start, directly load the edit log and mirror file to memory.

(2) The client requests to add, delete, or modify metadata.

(3) NameNode records the operation log and updates the rolling log.

(4) NameNode adds, deletes, modifies and checks data in memory.

2) The second stage: Secondary NameNode work

(1) Secondary NameNode asks NameNode whether checkpoint is needed. Whether to directly bring back the NameNode to check the result.

(2) Secondary NameNode requests to execute checkpoint.

(3) NameNode scrolls the edits log being written.

(4) Copy the edit log and mirror file before rolling to the Secondary NameNode.

(5) The Secondary NameNode loads the edit log and the image file into memory and merges them.

(6) Generate a new image file fsimage.chkpoint.

(7) Copy fsimage.chkpoint to NameNode.

(8) NameNode renames fsimage.chkpoint to fsimage.

What is the difference and connection between NameNode and SecondaryNameNode?

1) Difference :

(1) NameNode is responsible for managing the metadata of the entire file system and the data block information corresponding to each path (file).

(2) SecondaryNameNode is mainly used to periodically merge namespace mirroring and editing logs of namespace mirroring.

2) Contact :

(1) A mirror image file (fsimage) and edit log (edits) consistent with the namenode are saved in the SecondaryNameNode.

(2) When the primary namenode fails (assuming the data is not backed up in time), the data can be recovered from the SecondaryNameNode.

18. Chat: In NameNode HA, will there be a split-brain problem? how to solve split brain

How does HA NameNode work?

Main responsibilities of ZKFailoverController

1) Health monitoring : periodically send health detection commands to the NN it monitors to determine whether a NameNode is in a healthy state. If the machine is down and the heartbeat fails, then zkfc will mark it in an unhealthy state.

2) Session management : If NN is healthy, zkfc will maintain an open session in zookeeper. If NameNode is still active at the same time, then zkfc will also occupy a short-lived znode in Zookeeper. When this NN When it hangs up, the znode will be deleted, and then the standby NN will get the lock, upgrade to the master NN, and mark the status as Active.

3) When the downtime NN starts up again, it will register zookeper again, and if it finds that there is already a znode lock, it will automatically change to the Standby state. This reciprocating cycle ensures high reliability. It should be noted that currently only supports a maximum of 2 configurations. NN.

4) Master election : As mentioned above, a preemptive lock mechanism is realized by maintaining a short-lived znode in zookeeper, so as to determine which NameNode is in the Active state

What is a split brain?

Assume that NameNode1 is currently in Active state and NameNode2 is currently in Standby state. If the ZKFailoverController process corresponding to NameNode1 has a "fake death" phenomenon at a certain moment, the Zookeeper server will think that NameNode1 is down, and according to the previous active-standby switching logic, NameNode2 will replace NameNode1 and enter the Active state. But at this time, NameNode1 may still be in the Active state and run normally. In this way, both NameNode1 and NameNode2 are in the Active state and can provide external services. This condition is called a split brain.

A split brain is catastrophic for a system such as NameNode that requires very high data consistency, and the data will be disordered and cannot be recovered. The zookeeper community's solution to this problem is called fencing, which translates as isolation in Chinese, that is, to find a way to isolate the old Active NameNode so that it cannot provide services to the outside world.

During fencing, the following operations are performed:

First try to call the transitionToStandby method of the HAServiceProtocol RPC interface of the old Active NameNode to see if it can be converted to the Standby state.

If the call to the transitionToStandby method fails, execute the predefined isolation measures in the Hadoop configuration file. Hadoop currently provides two isolation measures, and sshfence is usually selected:

sshfence: Log in to the target machine through SSH, and execute the command fuser to kill the corresponding process;

shellfence: Execute a user-defined shell script to isolate the corresponding process.

19. Let’s chat: what harm will there be if there are too many small files, and how to avoid them

A large amount of HDFS metadata information on Hadoop is stored in the NameNode memory, so too many small files will definitely overwhelm the NameNode memory.

Each metadata object occupies about 150 bytes, so if there are 10 million small files and each file occupies a block, the NameNode needs about 2G space. If 100 million files are stored, the NameNode needs 20G space.

The obvious solution to this problem is to merge small files. You can choose to implement a certain strategy to merge first when the client uploads, or use Hadoop to merge CombineFileInputFormat<K,V>small files.

20. Let’s chat: Please talk about the organizational structure of HDFS

The architecture mainly consists of four parts, namely HDFS Client, NameNode, DataNode and Secondary NameNode. Below we introduce these four components separately.

1) Client : Client

(1) Divide the file. When a file is uploaded to HDFS, the Client divides the file into blocks one by one, and then stores them

(2) Interact with NameNode to obtain the location information of the file

(3) Interact with DataNode to read or write data

(4) Client provides some commands to manage HDFS, such as starting and closing HDFS, accessing HDFS directory and content, etc.

2) NameNode : Name node, also known as master node, stores metadata information of data and does not store specific data

(1) Manage the namespace of HDFS

(2) Management data block (Block) mapping information

(3) Configure the copy strategy

(4) Handle client read and write requests

3) DataNode : data node, also called slave node. NameNode issues commands and DataNode performs actual operations

(1) Store the actual data block

(2) Execute the read/write operation of the data block

4) Secondary NameNode : It is not the hot standby of NameNode. When the NameNode hangs up, it cannot immediately replace the NameNode and provide services

(1) Auxiliary NameNode to share its workload

(2) Regularly merge Fsimage and Edits and push to NameNode

(3) In an emergency, it can assist in recovering the NameNode

21. Chat: Please talk about the working mechanism of Map Task in MR

Brief overview :

The inputFile is cut into multiple split files through the split, and the content is read line by line through the Record to the map (a method of processing logic written by myself). After the data is processed by the map, it is handed over to the OutputCollect collector, and the result key is partitioned (default HashPartitioner used), and then write to the buffer, each map task has a memory buffer (ring buffer), which stores the output of the map, when the buffer is almost full, the data in the buffer needs to be saved as a temporary file After the entire map task is finished, all the temporary files generated by the map task in the disk will be merged to generate the final official output file, and then wait for the pull of the reduce task.

Detailed steps :

Read data component InputFormat (default TextInputFormat) will use the getSplits method to logically slice and plan the files in the input directory to obtain blocks, and how many blocks correspond to how many MapTasks are started.

After the input file is divided into blocks, it will be read by the RecordReader object (LineRecordReader by default), with \n as the delimiter, read a line of data, and return, Key represents the offset value of the first character of each line, and Value represents the text of this <key,value>line content.

Read the block and return <key,value>, enter the Mapper class inherited by the user, execute the map function rewritten by the user, and call RecordReader once to read a line.

After the Mapper logic ends, each result of the Mapper is collected through context.write. In collect, it will be partitioned first, and HashPartitioner is used by default.

Next, the data will be written into the memory. This area in the memory is called the ring buffer (default 100M). The function of the buffer is to collect Mapper results in batches and reduce the impact of disk IO. Our Key/Value pairs and Partition results are written to the buffer. Of course, before writing, the Key and Value values ​​will be serialized into byte arrays.

When the data in the ring buffer reaches the overflow ratio (0.8 by default), that is, 80M, the overflow thread starts, and the keys in the 80MB space need to be sorted (Sort). Sorting is the default behavior of the MapReduce model, and the sorting here is also the sorting of the serialized bytes.

Merge overflow files, each overflow will generate a temporary file on the disk (judging whether there is a Combiner before writing), if the output of Mapper is really large, and there are many such overflows, the corresponding ones on the disk will be There will be multiple temporary files. After the entire data processing is completed, Merge merges the temporary files in the disk, because only one final file is written to the disk, and an index file is provided for this file to record the offset of the data corresponding to each reduce.

22. Chat: Please talk about the working mechanism of Reduce Task in MR

Brief description :

Reduce is roughly divided into three stages: copy, sort, and reduce, with the focus on the first two stages.

The copy phase includes an eventFetcher to obtain the completed map list, and the Fetcher thread will copy the data. During this process, two merge threads will be started, namely inMemoryMerger and onDiskMerger, to merge the data in the memory to the disk and the disk to the disk respectively. The data is merged. After the data copy is completed, the copy phase is completed.

Start the sort stage, the sort stage is mainly to execute the finalMerge operation, the pure sort stage, after the completion of the reduce stage, call the user-defined reduce function for processing.

Detailed steps :

Copy phase: Simply pull data. The Reduce process starts some data copy threads (Fetcher), and requests the maptask to obtain its own files through HTTP (the partition of the map task will identify which reduce task each map task belongs to, and the default reduce task ID starts from 0).

Merge phase: While copying data remotely, ReduceTask starts two background threads to merge files in memory and on disk to prevent excessive memory usage or too many files on disk.

There are three forms of merge: memory to memory; memory to disk; disk to disk. The first form is not enabled by default. When the amount of data in memory reaches a certain threshold, the merge from memory to disk is started directly. Similar to the map side, this is also the process of overflow writing. If you set up a Combiner in this process, it will also be enabled, and then many overflow files will be generated on the disk. The memory-to-disk merge method keeps running until there is no data on the map side, and then the third disk-to-disk merge method is started to generate the final file.

Merge sort: After merging scattered data into a large data, the merged data will be sorted again.

Call the reduce method on the sorted key-value pairs: call the reduce method once for key-value pairs with equal keys, each call will generate zero or more key-value pairs, and finally write these output key-value pairs to the HDFS file .

23. Chat: The core steps of the Shuffle stage in MR

The shuffle phase is divided into four steps: partitioning, sorting, reduction, and grouping, in which the first three steps are completed in the map phase, and the last step is completed in the reduce phase.

Shuffle is the core of Mapreduce, which is distributed in the map stage and reduce stage of Mapreduce. Generally, the process from Map generating output to Reduce obtaining data as input is called shuffle.

Collect stage : Output the results of MapTask to the ring buffer with a default size of 100M, which stores key/value, Partition partition information, etc.

Spill stage : When the amount of data in the memory reaches a certain threshold, the data will be written to the local disk. Before writing the data to the disk, the data needs to be sorted once. If combiner is configured, it will also be Data with the same partition number and key are sorted.

Merge in the MapTask stage : perform a merge operation on all overflowed temporary files to ensure that a MapTask will only generate one intermediate data file in the end.

Copy stage : ReduceTask starts the Fetcher thread to copy a copy of its own data on the node that has completed the MapTask. These data will be stored in the memory buffer by default. When the memory buffer reaches a certain threshold, it will be saved. Data is written to disk.

Merge in the ReduceTask stage : While the ReduceTask is copying data remotely, two threads will be started in the background to merge the data files from the memory to the local.

Sort stage : While merging data, sorting operations will be performed. Since the MapTask stage has already partially sorted the data, the ReduceTask only needs to ensure the final overall validity of the copied data.

The size of the buffer in Shuffle will affect the execution efficiency of the mapreduce program. In principle, the larger the buffer, the fewer the number of disk io, and the faster the execution speed.

The size of the buffer can be adjusted by parameters, parameter: mapreduce.task.io.sort.mb default 100M

24. Chat: Do you understand the data compression mechanism in the Shuffle stage?

In the shuffle stage, it can be seen that the data is copied through a large number of copies. The data output from the map stage must be copied through the network and sent to the reduce stage. This process involves a large number of network IO. If the data can be compressed, then The amount of data sent will be much less.

Compression algorithms supported in hadoop:

gzip, bzip2, LZO, LZ4, Snappy, these compression algorithms combine compression and decompression rate, Google's Snappy is the best, generally choose Snappy compression. Produced by Google, it must be a boutique.

25. Let’s chat: When writing MR, under what circumstances can the protocol be used?

A combiner is a partial summary that cannot affect the running results of tasks. It is suitable for the summation class, but not for the average value. If the type of the input parameter of reduce is the same as the type of the output parameter, the class of the rule can be used For the reduce class, you only need to specify the class of the specification in the driver class.

26. Chat: Architecture and working principle of YARN cluster

The basic design idea of ​​YARN is to split the JobTracker in MapReduce V1 into two independent services: ResourceManager and ApplicationMaster.

ResourceManager is responsible for resource management and allocation of the entire system, and ApplicationMaster is responsible for the management of a single application.

ResourceManager : RM is a global resource manager responsible for resource management and allocation of the entire system. It mainly consists of two parts: Scheduler and Application Manager.

The scheduler allocates resources in the system to running applications according to constraints such as capacity and queues. On the premise of ensuring capacity, fairness and service level, the scheduler optimizes the utilization of cluster resources so that all resources can be fully utilized The application manager is responsible for managing all applications in the entire system, including submitting applications, negotiating resources with the scheduler to start the ApplicationMaster, monitoring the running status of the ApplicationMaster and restarting it when it fails.

ApplicationMaster : An application submitted by a user corresponds to an ApplicationMaster, and its main functions are:

Negotiate with the RM scheduler to obtain resources, and resources are represented by Containers.

The obtained tasks are further assigned to internal tasks.

Communicate with NM to start/stop tasks.

Monitor the status of all internal tasks, and re-apply for resources for the task to restart the task when the task fails to run.

NodeManager : NodeManager is the resource and task manager on each node. On the one hand, it will regularly report to RM the resource usage on this node and the running status of each Container; on the other hand, it receives and processes the information from AM Container start and stop requests.

Container : Container is a resource abstraction in YARN, which encapsulates various resources. An application will allocate a Container, and this application can only use the resources described in this Container. Different from the resource encapsulation of slots in MapReduceV1, Container is a division unit of dynamic resources, which can make full use of resources.

27. Chat: What is the task submission process of YARN

When the jobclient submits an application to YARN, YARN will run the application in two stages: one is to start the ApplicationMaster; the second stage is for the ApplicationMaster to create the application, apply for resources for it, and monitor the operation until the end. Specific steps are as follows:

The user submits an application to YARN, and specifies the ApplicationMaster program, the command to start the ApplicationMaster, and the user program.

RM allocates the first Container for this application, and communicates with the corresponding NM, asking it to start the application ApplicationMaster in this Container.

ApplicationMaster registers with RM, and then splits it into various internal subtasks, applies for resources for each internal task, and monitors the operation of these tasks until the end.

AM uses polling to apply for and receive resources from RM.

RM allocates resources to AM and returns them in the form of Container.

After the AM applies for the resource, it communicates with the corresponding NM and asks the NM to start the task.

NodeManager sets up the running environment for the task, writes the task start command into a script, and starts the task by running the script.

Each task reports its status and progress to the AM, so that the task can be restarted when the task fails.

When the application is finished, the ApplicationMaster logs out to the ResourceManager and closes itself.

28. Chat: Do you understand the three models of YARN resource scheduling?

There are three schedulers to choose from in Yarn: FIFO Scheduler, Capacity Scheduler, and Fair Scheduler.

The Apache version of hadoop uses the Capacity Scheduler scheduling method by default. The CDH version uses the Fair Scheduler scheduling method by default

FIFO Scheduler (first come, first served):

FIFO Scheduler arranges applications into a queue in the order of submission. This is a first-in-first-out queue. When resource allocation is performed, resources are allocated to the application at the top of the queue first, and after the requirements of the application at the top are met Then assign to the next one, and so on.

FIFO Scheduler is the simplest and easiest to understand scheduler, and does not require any configuration, but it is not suitable for shared clusters. A large application may occupy all cluster resources, which causes other applications to be blocked. For example, if a large task is being executed, all resources are occupied, and a small task is submitted, the small task will always be blocked.

Capacity Scheduler:

For the Capacity scheduler, there is a dedicated queue for running small tasks, but setting up a queue for small tasks will pre-occupy certain cluster resources, which causes the execution time of large tasks to lag behind the time when using the FIFO scheduler .

Fair Scheduler:

In the Fair scheduler, we do not need to occupy certain system resources in advance, and the Fair scheduler will dynamically adjust system resources for all running jobs.

For example: when the first big job is submitted, only this job is running, and it obtains all cluster resources at this time; when the second small job is submitted, the Fair scheduler will allocate half of the resources to this small task, so that this The two tasks share cluster resources fairly.

It should be noted that in the Fair scheduler, there will be a certain delay from the submission of the second task to the acquisition of resources, because it needs to wait for the first task to release the occupied Container. After the execution of the small task is completed, the resources it occupies will be released, and the large task will obtain all the system resources. The final effect is that the Fair scheduler can achieve high resource utilization and ensure that small tasks can be completed in time.

Said in the back:

Continuous iteration and continuous upgrading are the tenets of the Nien team.

Continuous iteration and continuous upgrading are also the soul of "Nin's Big Data Interview Collection" and "Neon's Java Interview Collection".

More big data real questions will be collected in the future. At the same time, if you encounter big data interview problems, you can come to Nien's community "Technical Freedom Circle (formerly Crazy Maker Circle)" to communicate and ask for help.

Our goal is to create the best big data interview book in the world.

The realization path of technical freedom PDF:

Realize your architectural freedom:

" Have a thorough understanding of the 8-figure-1 template, everyone can do the architecture "

" 10Wqps review platform, how to structure it? This is what station B does! ! ! "

" Alibaba Two Sides: How to optimize the performance of tens of millions and billions of data?" Textbook-level answers are coming "

" Peak 21WQps, 100 million DAU, how is the small game "Sheep a Sheep" structured? "

" How to Scheduling 10 Billion-Level Orders, Come to a Big Factory's Superb Solution "

" Two Big Factory 10 Billion-Level Red Envelope Architecture Scheme "

… more architecture articles, being added

Realize your responsive freedom:

" Responsive Bible: 10W Words, Realize Spring Responsive Programming Freedom "

This is the old version of " Flux, Mono, Reactor Combat (the most complete in history) "

Realize your spring cloud freedom:

" Spring cloud Alibaba Study Bible "

" Sharding-JDBC underlying principle and core practice (the most complete in history) "

" Get it done in one article: the chaotic relationship between SpringBoot, SLF4j, Log4j, Logback, and Netty (the most complete in history) "

Realize your linux freedom:

" Linux Commands Encyclopedia: 2W More Words, One Time to Realize Linux Freedom "

Realize your online freedom:

" Detailed explanation of TCP protocol (the most complete in history) "

" Three Network Tables: ARP Table, MAC Table, Routing Table, Realize Your Network Freedom!" ! "

Realize your distributed lock freedom:

" Redis Distributed Lock (Illustration - Second Understanding - The Most Complete in History) "

" Zookeeper Distributed Lock - Diagram - Second Understanding "

Realize your king component freedom:

" King of the Queue: Disruptor Principles, Architecture, and Source Code Penetration "

" The King of Cache: Caffeine Source Code, Architecture, and Principles (the most complete in history, 10W super long text) "

" The King of Cache: The Use of Caffeine (The Most Complete in History) "

" Java Agent probe, bytecode enhanced ByteBuddy (the most complete in history) "

Realize your interview questions freely:

4000 pages of "Nin's Java Interview Collection" 40 topics

Please go to the following [Technical Freedom Circle] to get the PDF file update of Nien’s architecture notes and interview questions↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/131283719