HDFS—Introduction to Getting Started

Original author: Zh_Y_G

Original address: Introduction to HDFS Introduction

table of Contents

What is HDFS?

Design goals:

Installation configuration

HDFS read and write flowchart

CheckPoint


What is HDFS?

  1. Easy to expand distributed file system
  2. Runs on a large number of ordinary cheap machines to provide fault tolerance mechanism
  3. Provide a large number of users with good performance access services

Design goals:

  1. Automatic quick detection and response to hardware errors
  2. Streaming access to data, accessing data in a streaming manner, designed for batch processing of data
  3. Disadvantages: not suitable for storing a large number of small files; not suitable for low-latency data access; does not support multi-user writing and arbitrary modification of files
  4. Mobile computing does not move data (basic principles of big data, space for time)
  5. Simple consistency model
  6. Portability of heterogeneous platforms

Installation configuration

hdfs://(协议头)host(主机名):port(端口号)/

View ${HADOOP_HOME}/binand${HADOOP_HOME}/sbin
Insert picture description here

#学会help(很多地方都可以查看帮助文档)
hdfs dfsadmin -help

Basic components of HDFS

  1. namenode: Manage the metadata of the entire file system, manage metadata, maintain the directory structure, and respond to client requests
  2. datanode: Copy the file data block of the management user, responsible for managing the data heartbeat mechanism block report submitted by the user
  3. secondarynamenode: The assistant of the namenode, which helps to load metadata, and can help restore data in emergency situations (such as namenode downtime)
    Insert picture description here

HDFS read and write flowchart

1. Data writing process
Insert picture description here
 

Node server data transmission method: network transmission, in the form of package package (step 8, when uploading data, the package will be placed in the cache queue first, if the package error at this time, it will be retransmitted 4 times by default)
here, add Some problems (there may be failures between distributed systems, and the unreliability of the network is a problem that designers need to consider): socket (long connection), http (short connection), and other methods, such as pipes, FIFO, message queue

1) Why use long links?

The simplest distributed system always exists, it is rarely accessed for a short time, and maintains the heartbeat mechanism

2) What is the heartbeat mechanism?

When namenode is started, there will be a loading metadata (data of the data, similar to the index of the table) and block report (datanode will periodically (can be set in the configuration file, so it must be time synchronized) to count the block information) In the process, the namenode maintains the availability of the entire cluster through the heartbeat mechanism. If the upload of the block report fails, the namenode will not update the metadata and will delete it during the block report.

3) Safe mode

When to enter safe mode? Just started (when the namenode loads the metadata (first load the metadata and mirror it into the memory, execute the edits log operation in the memory, the namenode enters the safe mode, the block report, if the threshold is safe, then 30 seconds to exit the safe mode)) The threshold is lower than 0.999f (default) the number of datanodes to survive is less than 0

4) How to release the safe mode?

  1. Formatting the cluster (need to delete the configuration path of namenode.dir) basically does not use this method
  2. Forced to leave safe mode:hdfs dfsadmin -safemode leave
  3. # 检测集群文件、节点、块是否出现问题hdfs fsck /
  4. #删除损坏块的blockhdfs fsck / -delete
  5. Lower the threshold (in the configuration file safemode)

2. Reading data flow
Insert picture description here
 

Disk failure

Multiple copy strategy

namenode downtime

Simple solution: secondarynamenode takes out the fsimage file and copies it to the metadata storage directory of namenode

Perfect solution: mount multiple disks on namenode, configure fs.namenode.name.dir (use, split disk)

CheckPoint

Triggering conditions:

  1. 1,000,000 transactions (default)
  2. 1 hour (default)
<property>
  <name>dfs.namenode.checkpoint.dir</name>
  <value>/hadoop/data/name</value>
</property>
<!--日志文件edits的检测目录-->
<property>
  <name>dfs.namenode.checkpoint.edits.dir</name>
  <value>/hadoop/data/edits</value>
</property>
<!--时间一小时-->
<property>
  <name>dfs.namenode.checkpoint.period</name>
  <value>3600</value>
</property>
<!--事物达到1000000-->
<property>
  <name>dfs.namenode.checkpoint.txns</name>
  <value>1000000</value>
</property>

Insert picture description here
Note: The namenode stores metadata. When the secondarynamenode executes checkpoint, go to the namenode to download edits and fsimage.
Note : the client and server side explain

  1. The configuration file of the client determines the number of copies, not the server
  2. File storage is stored on the server in the form of blocks (the client determines the file segmentation and block size)

Guess you like

Origin blog.csdn.net/sanmi8276/article/details/113063240