hadoop study notes (4): HDFS

1. HDFS Architecture

1 HDFS assumptions

  data flow access

  big data set

  simple correlation model

  Mobile computing is cheaper than mobile data

  Portability across multiple hardware and software platforms

2 Design goals of HDFS

  very large distributed file system

  run on common hardware

  Optimize batch processing

  User controls can reside in heterogeneous operating systems

Use a single namespace  across the cluster

  data consistency

  The file is divided into small chunks

  Smart Client

  The program adopts the principle of " data nearness " to allocate nodes for execution

  The client has no caching mechanism for files

3 HDFS Architecture

 

1 HDFS Architecture - File

  The file is divided into blocks (the default size is 64M). The unit is block. Each block has multiple copies stored on different machines. The number of copies can be specified when the file is generated (default 3).

  The NameNode is the master node, which stores file metadata such as file name, file directory structure, file attributes (generation time, number of replicas, file permissions), as well as the block list of each file and the DataNode where the block is located, etc.

  The DataNode stores file block data in the local file system, as well as block data checksums.

  Files can be created, deleted, moved, or renamed, and the contents of the file cannot be modified after the file is created, written, and closed.

2 HDFS file permissions

  Similar to linux file permissions.

  r:read;w:write;x:execute, permission x is ignored for files, and indicates whether to allow access to its contents for folders.

  If the Linux system user zhangsan uses the hadoop command to create a file, the owner of the file in HDFS is zhangsan.

  HDFS permissions purpose : to stop good people from doing bad things, not to stop bad people from doing bad things. HDFS believes that you tell me who you are, and I think who you are.

 3 HDFS Architecture - Component Function

NameNode DataNode
store metadata  Store file content
Metadata is kept in memory File contents are saved on disk
Save the mapping relationship between files, blocks, and DataNodes Maintained the mapping relationship between blockId and DataNode local file

  NameNode :

  It is a central server, a single node (simplifies the design and implementation of the system), responsible for managing the namespace of the file system and client access to files.

  For file operations, the NameNode is responsible for the operation of file metadata , and the DataNode is responsible for processing the read and write requests of the file content. The data stream related to the file content does not pass through the NameNode, and only asks which DataNode to contact, otherwise the NameNode will become a system bottleneck.

  The DataNodes on which replicas are stored are controlled by the NameNode , and block placement decisions are made based on the global situation. When reading files, the NameNode tries to let users read the most recent replicas first to reduce bandwidth consumption and read latency .

The NameNode is solely responsible for the replication of data blocks, and it periodically receives heartbeat signals and block status reports (BlockReport)   from each DataNode in the cluster . Receiving a heartbeat signal means that the DataNode is working properly. The block status report contains a list of all data blocks on the DataNode.  

  DataNode :

  A data block is stored on the disk as a file in the DataNode, including two files, one is the data itself, and the other is the metadata including the length of the data block, the checksum of the block data , and the timestamp .

  After the DataNode is started, it registers with the NameNode. After passing, it reports all block information to the NameNode periodically (1 hour).

  The heartbeat is once every 3 seconds , and the heartbeat returns the result with the command from the NameNode to the DataNode, such as copying the block data to another machine, or deleting a data block. If no heartbeat is received from a DataNode for more than 10 minutes, the node is considered unavailable.

  It is safe to join and leave some machines while the cluster is running.

4 HDFS replica placement strategy

Before hadoop 0.17 After hadoop 0.17
Replica 1: Different nodes on the same rack Copy 1: On the same node as the Client
Replica 2: Another node on the same rack Replica 2: On nodes in different racks
Replica 3: Another node on a different rack Replica 3: Another node in the same rack as the second replica
Other copies: randomly selected Other copies: randomly selected

5 HDFS data corruption (corruption) processing

  When the DataNode reads the block, it calculates the checksum .

  If the calculated checksum is different from the value when the block was created, the block is damaged.

  Client reads blocks on other DNs.

  NameData marks the block as corrupt, and then replicates the block up to the expected number of file backups .

  The DataNode verifies its checksum three weeks after its file is created.

6 HDFS Architecture-Client&SNN

Client Secondary NameNode

file segmentation

Interact with NameNode to get file location information

Interact with DataNode to read or write data

Manage HDFS

Access HDFS

Not a hot standby for NameNode

Auxiliary NameNode, sharing its workload

Periodically merge fsimage and fsedits, push to NameNode

In an emergency, it can assist in the recovery of the NameNode

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324842445&siteId=291194637