HDFS—Common interview questions

1. HDFS writing process

  1. The client requests the NameNode to upload a file through the Distributed FileSystem module, and the NameNode checks whether the target file already exists and whether the parent directory exists.
  2. NameNode returns whether it can upload.
  3. Which DataNode servers the client requests to upload the first block to.
  4. The NameNode returns 3 DataNode nodes, namely dn1, dn2, and dn3.
  5. The client requests dn1 to upload data through the FSDataOutputStream module. When dn1 receives the request, it will continue to call dn2, and then dn2 will call dn3 to complete the establishment of the communication channel.
  6. dn1, dn2, and dn3 answer the client level by level.
  7. The client starts to upload the first block to dn1 (first read the data from the disk and put it into a local memory cache). In the unit of Packet, dn1 will send a packet to dn2, and dn2 will send it to dn3; dn1 will send one each packet. The packet will be put into a reply queue waiting for reply.
  8. When the transmission of a block is completed, the client again requests the NameNode to upload the second block to the server. (Repeat steps 3-7).

2. HDFS read data process

  1. The client requests the NameNode to download the file through the Distributed FileSystem, and the NameNode finds the DataNode address where the file block is located by querying metadata.
  2. Pick a DataNode (the nearest principle, then random) server and request to read data.
  3. The DataNode starts to transmit data to the client (read the data input stream from the disk, and use the Packet as the unit for verification).
  4. The client receives it in packets, first caches it locally, and then writes it to the target file.

3. Under what circumstances will datenode not be backed up?

  1. When the number of backups is set to 1, it will not be backed up.
  2. Extension-Where to set the number of backups in Hadoop, which field: dfs.replication variable in hdfs-site.xml.

4. Problems caused by a large number of small files in HDFS and solutions

problem:

Directories, files and blocks in hadoop will be stored in the memory of the namenode as objects, about 150 bytes per object. A large number of small files will take up a lot of the memory of the namenode; make the namenode read metadata slower and start up time Extend; also because it takes up too much memory, resulting in an increase in gc time, etc.

Solution:

There are two perspectives, one is to solve the generation of small files from the root, and the other is to choose to merge if it can't be solved.

Start with the data source, such as extracting once an hour to extracting once a day to accumulate the amount of data.

If small files are unavoidable, they are generally solved by merging. You can write an MR task to read all small files in a certain directory and rewrite them into one large file.

5. What are the three core components of HDFS and what are their roles?

  1. NameNode.: The core of the cluster is the management node of the entire file system. It maintains the file directory structure and metadata information of the file system, and the correspondence between files and data block lists
  2. DataNode: A node that stores specific data blocks, which is mainly responsible for data reading and writing, and sends heartbeats to NameNode regularly
  3. SecondaryNameNode: secondary node, synchronizing metadata information in NameNode, assisting NameNode to merge fsimage and editsLog.

6. What are fsimage and editlogs used for?

  1. The fsimage file stores Hadoop metadata files. If the namenode fails, the latest fsimage file will be loaded into the memory to reconstruct the latest state of the metadata, and then execute the edit logs file from the relevant point forward Every transaction recorded.
  2. When the file system client performs a write operation, these transactions are first recorded in the log file.
  3. During the operation of the namenode, the client's write operations to hdfs are saved in the edit file. Over time, the edit file will become very large, which has no effect on the operation of the namenode, but if the namenode is restarted, it will change the contents of the fsimage Map it to the memory, and then execute the operations in the edit file one by one, so the log file is too large will cause the restart speed to be very slow. Therefore, edit logs and fsimage must be merged regularly when the namenode is running.

7. The block size in Linux is 4KB, why is the block size in HDFS 64MB or 128MB?

A block is the smallest unit of data stored in the file system. If a block size of 4kb is used to store the data stored in Hadoop, a large number of blocks will be required, which greatly increases the time to find a block and reduces the efficiency of reading
and writing. And , A map or a reduce is processed in a block. If the block is small, the number of mapreduce tasks will be large, and the switching between tasks will increase the overhead and reduce the efficiency.

8. Is it possible to write HDFS files concurrently?

No, because after the client receives the permission to write on the data block through the namenode, that block will be locked until the write operation is completed, so it cannot be written on the same block.

9. HDFS placement strategy

Insert picture description here
 

10. What is the difference and connection between NameNode and SecondaryNameNode?

the difference:

  1. The NameNode is responsible for managing the metadata of the entire file system and the data block information corresponding to each path (file).
  2. The SecondaryNameNode is mainly used to periodically merge the edit logs of the namespace mirror and the namespace mirror.

contact:

  1. The SecondaryNameNode saves an image file (fsimage) and edit log (edits) consistent with the namenode.
  2. When the primary namenode fails (assuming that data is not backed up in time), data can be restored from the secondary namenode.

11. The working mechanism of namenode

Phase 1: NameNode startup

  1. After starting the NameNode for the first time and formatting, create the Fsimage and Edits files. If it is not the first time to start, directly load the edit log and image file to the memory.
  2. The client requests to add, delete, or modify metadata.
  3. NameNode records operation logs and updates rolling logs.
  4. The NameNode adds, deletes, and changes metadata in memory.

The second stage: Secondary NameNode work

  1. The Secondary NameNode asks whether the NameNode needs CheckPoint. Bring back the result of NameNode check directly.
  2. The Secondary NameNode requests to execute CheckPoint.
  3. NameNode scrolls the Edits log being written.
  4. Copy the edit log and mirror file before rolling to the Secondary NameNode.
  5. The Secondary NameNode loads the edit log and image file into the memory and merges them.
  6. Generate a new image file fsimage.chkpoint.
  7. Copy fsimage.chkpoint to the NameNode.
  8. NameNode renamed fsimage.chkpoint to fsimage.

12.datenode working mechanism

  1. A data block is stored on the disk in the form of a file on the DataNode, including two files, one is the data itself, and the other is the metadata including the length of the data block, the checksum of the block data, and the timestamp.
  2. After the DataNode is started, it registers with the NameNode. After passing it, it reports all block information to the NameNode periodically (1 hour).
  3. The heartbeat is every 3 seconds, and the heartbeat returns the result with the command given by the NameNode to the DataNode, such as copying block data to another machine, or deleting a data block. If no heartbeat from a DataNode is received for more than 10 minutes, the node is considered unavailable.
  4. Some machines can be safely added and exited during the cluster operation.

13. What do you think the design of hadoop is unreasonable?

  1. Concurrent writing of files and random modification of file contents are not supported.
  2. Low-latency, high-throughput data access is not supported.
  3. Accessing a large number of small files will occupy a large amount of memory in the namenode, and the addressing time of small files exceeds the reading time.
  4. Hadoop environment is more complicated to build.
  5. The data cannot be processed in real time.

Guess you like

Origin blog.csdn.net/sanmi8276/article/details/113079301