Hadoop-HDFS summary (four)

HDFS working mechanism

HDFS data flow

File writing
Insert picture description here 1) The client requests to upload a file to the namenode, and the namenode checks whether the target file already exists and whether the parent directory exists.

2) The namenode returns whether it can be uploaded.

3) Which datanode servers the client requests to upload the first block to.

4) Namenode returns 3 datanode nodes, namely dn1, dn2, and dn3.

5) The client requests dn1 to upload data, dn1 will continue to call dn2 upon receiving the request, and then dn2 will call dn3 to complete the establishment of this communication channel

6) dn1, dn2, dn3 answer the client side by step

7) The client starts to upload the first block to dn1 (first read the data from the disk and put it into a local memory cache). In the unit of packet, dn1 will send a packet to dn2, and dn2 will send it to dn3; Passing a packet will put it in a response queue waiting for the response

8) After the transfer of a block is completed, the client again requests the namenode to upload the second block to the server. (Repeat steps 3-7)
Insert picture description here1. The client creates a new file by calling the create method of DistributedFileSystem.

2. DistributedFileSystem uses RPC to call namenode to create a new file that is not associated with blocks. Before creation, namenode will do various checks, such as whether the file exists, whether the client has permission to create it, etc. If the verification passes, namenode will record the new file, otherwise it will throw an IO exception.

3. After the first two steps are over, the object of FSDataOutputStream will be returned. Similar to when reading files, FSDataOutputStream is encapsulated into DFSOutputStream. DFSOutputStream can coordinate namenode and datanode. The client starts to write data to DFSOutputStream, DFSOutputStream will cut the data into small packets, and then arrange them into a data queue (data queue).

4. The DataStreamer will process and accept the data quene. It first asks which datanodes the new block of namenode is most suitable for storage (for example, if the number of repetitions is 3, then it will find the 3 most suitable datanodes), and arrange them into A pipeline. The DataStreamer outputs the packets in a queue to the first datanode of the pipeline, the first datanode outputs the packets to the second datanode, and so on.

5. DFSOutputStream also has a pair of columns called ack quene, which is also composed of packets, waiting for the datanode to receive a response. When all datanodes in the pipeline indicate that it has been received, then ack quene will move the corresponding packet packet Get rid of.

如果在写的过程中某个datanode发生错误,会采取以下几步: 
1)pipeline被关闭掉; 
2)为了防止防止丢包ack quene里的packet会同步到data quene里; 
3)把产生错误的datanode上当前在写但未完成的block删掉; 
4)block剩下的部分被写到剩下的两个正常的datanode中; 
5)namenode找到另外的datanode去创建这个块的复制。当然,这些操作对客户端来说是无感知的。

6. After the client finishes writing data, call the close method to close the writing stream.

7. DataStreamer flushes the remaining packages to the pipeline, and then waits for the ack message. After receiving the last ack, it notifies the datanode to mark the file as completed.

注意:客户端执行write操作后,写完的block才是可见的(注:和下面的一致性所对应),正在写的block对客户端		是不可见的,只有 调用sync方法,客户端才确保该文件的写操作已经全部完成,当客户端调用close方法时,会		默认调用sync方法。是否需要手动调用取决你根据程序需 要在数据健壮性和吞吐率之间的权衡
#### 网络拓扑

In the local network, what does it mean that two nodes are called "neighbors to each other"? In massive data processing, the main limiting factor is the data transmission rate between nodes-bandwidth is scarce. The idea here is to use the bandwidth between two nodes as a measure of distance.

​ Node distance: the sum of the distances between two nodes to the nearest common ancestor.

For example, suppose there is node n1 in rack r1 of data center d1. This node can be expressed as /d1/r1/n1. Using this mark, here are four distance descriptions.

Distance(/d1/r1/n1, /d1/r1/n1)=0 (process on the same node)

Distance(/d1/r1/n1, /d1/r1/n2)=2 (different nodes on the same rack)

Distance(/d1/r1/n1, /d1/r3/n2)=4 (nodes on different racks in the same data center)

Distance(/d1/r1/n1, /d2/r4/n2)=6 (nodes in different data centers)

Rack awareness

  • Official ip address:

http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/RackAwareness.html

http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Replication

  • Low version Hadoop replica node selection

The first copy is on the node where the client is located. If the client is outside the cluster, choose one at random.

The second copy and the first copy are located on random nodes in different racks.

The third copy and the second copy are located in the same rack, and the nodes are random.

  • High replica node selection

    ​The first copy is on the node where the client is located. If the client is outside the cluster, choose one at random.

    ​The second copy and the first copy are located in the same rack, random nodes.

    ​The third copy is located in a different rack, random node.

2. HDFS read process

Insert picture description here
1) The client requests the namenode to download the file, and the namenode finds the address of the datanode where the file block is located by querying the metadata.
2) Pick a datanode (the nearest principle, then random) server and request to read the data.
3) The datanode starts to transmit data to the client (read data from the disk and put it into the stream, and use the packet as the unit for verification).
4) The client receives it in packets, caches it locally, and then writes it to the target file.
Insert picture description here1. First call the open method of the FileSystem object, which is actually an instance of DistributedFileSystem.

2. DistributedFileSystem obtains the locations of the first block of the file through rpc. The same block will return multiple locations according to the number of repetitions. These locations are sorted according to the hadoop topology, and the closest to the client is ranked first.

3. The first two steps will return an FSDataInputStream object, which will be encapsulated by a DFSInputStream object. DFSInputStream can conveniently manage datanode and namenode data streams. The client calls the read method, and DFSInputStream will find the datanode closest to the client and connect.

4. Data flows continuously from the datanode to the client.

5. If the first block of data is read, the datanode connection to the first block will be closed, and then the next block will be read. These operations are transparent to the client, and the client's point of view is just reading a continuous stream.

6. If the first batch of blocks have been read, DFSInputStream will go to namenode to get the locations of a batch of blocks, and then continue to read. If all blocks are read, all streams will be closed at this time.

7. If there is an abnormality in the communication between DFSInputStream and datanode when reading data, it will try the second closest datanode of the block being read, and record which datanode has an error, and skip directly when reading the remaining blocks The datanode
DFSInputStream will also check the block data checksum. If a bad block is found, it will be reported to the namenode node first, and then DFSInputStream will read the image of the block on other datanodes.

8. The design is that the client directly connects to the datanode to retrieve data and the namenode is responsible for providing the optimal datanode for each block. The namenode only processes the block location request, and the information is loaded in the memory of the namenode. HDFS can be used through the datanode cluster. Withstand concurrent access from a large number of clients.

3. Consistency model

debug debug the following code

package HDFS_01;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.net.URI;

public class HDFS01 {
    
    
    public static void main(String[] args) throws Exception {
    
    
        //1. 创建配置信息对象
        Configuration conf = new Configuration();
        //2.获取文件系统
        FileSystem fs = FileSystem.get(new URI("hdfs://linux1:9000"), conf, "root");
        //3.创建文件输入流

        FSDataOutputStream fos = fs.create(new Path("/data/A.txt"));

        //3.写入数据
        fos.write("aaaaaa".getBytes());
        //4.一致性刷新
       // fos.hflush();
        //fos.hsync();
        //5关流
        fos.close();
    }
}

to sum up

​ When writing data, if you want the data to be immediately visible by other clients, call the following method

​ FSDataOutputStream. hflush (); //Clean up client buffer data and be immediately visible by other clients

NameNode working mechanism

1. The working mechanism of NameNode and Secondary NameNode Insert picture description here-Phase 1: namenode startup

  • After starting namenode formatting for the first time, create fsimage and edits files. If it is not the first time to start, directly load the edit log (edits) and the image file (fsimage) to the memory
  • Client request to add, delete, or modify metadata
  • namenode records operation logs and updates rolling logs
  • namenode adds, deletes, and checks data in memory
  • The second stage: Secondary NameNode work
    • The Secondary NameNode asks whether the namenode needs checkpoint. Bring back the namenode check result directly.
    • The Secondary NameNode requests to perform a checkpoint.
    • namenode scrolls the edits log being written
    • Copy the edit log and mirror file before rolling to the Secondary NameNode
    • The Secondary NameNode loads the edit log and image file into the memory and merges them.
    • Generate a new image file fsimage.chkpoint
    • Copy fsimage.chkpoint to namenode
    • namenode renamed fsimage.chkpoint to fsimage
  • chkpoint check time parameter setting
    • Normally, the SecondaryNameNode is executed every hour.
#[hdfs-default.xml]

<property>
  <name>dfs.namenode.checkpoint.period</name>
  <value>3600</value>
</property>

The number of operations is checked once a minute. When the number of operations reaches 1 million, the SecondaryNameNode executes it once.

property>
  <name>dfs.namenode.checkpoint.txns</name>
  <value>1000000</value>
<description>操作动作次数</description>
</property>

<property>
  <name>dfs.namenode.checkpoint.check.period</name>
  <value>60</value>
<description> 1分钟检查一次操作次数</description>
</property>

2. Mirror file and edit log file

  • concept
  • After the namenode is formatted, the following files will be generated in the /opt/module/hadoop-2.8.4/data/dfs/name/current directory. Note that this file can only be found on the node where the NameNode is located

You can execute find. -name edits* to find files

edits_0000000000000000000
fsimage_0000000000000000000.md5
seen_txid
VERSION

(1) Fsimage file: A permanent checkpoint of the metadata of the HDFS file system, which contains the serialization information of all directories and file idnodes of the HDFS file system.

(2) Edits file: The path to store all update operations of the HDFS file system. All write operations performed by the file system client will first be recorded in the edits file.

(3) The seen_txid file saves a number, which is the number of the last edits_

(4) Every time the Namenode starts, the fsimage file will be read into the memory, and the update operation in each edits will be executed in turn from 00001 to the number recorded in seen_txid to ensure that the metadata information in the memory is the latest and synchronized , Can be seen as the fsimage and edits files are merged when Namenode starts.

oiv view fsimage file

  • Basic grammar
- hdfs oev -p 文件类型 -i编辑日志 -o 转换后文件输出路径
-p	–processor <arg>   指定转换类型: binary (二进制格式), xml (默认,XML格式),stats
-i	–inputFile <arg>     输入edits文件,如果是xml后缀,表示XML格式,其他表示二进制
-o 	–outputFile <arg> 输出文件,如果存在,则会覆盖

Case practice

[itstar@bigdata111 current]$ hdfs oev -p XML -i edits_0000000000000000135-0000000000000000135 -o /opt/module/hadoop-2.8.4/edits.xml -p stats

[itstar@bigdata111 current]$ cat /opt/module/hadoop-2.8.4/edits.xml

Insert picture description here Each RECORD records an operation, such as in the figure

OP_ADD stands for adding file operation, OP_MKDIR stands for creating directory operation. It also recorded

File path (PATH)

Modification time (MTIME)

Add time (ATIME)

Client name (CLIENT_NAME)

Client address (CLIENT_MACHINE)

Very useful information such as permissions (PERMISSION_STATUS)

3. Scrolling edit log

​ Under normal circumstances, when the HDFS file system is updated, the log will be scrolled. You can also use commands to force the scroll edit log.

  • Rolling edit log (prerequisite to start the cluster)
[dingshiqi@bigdata111 current]$ hdfs dfsadmin -rollEdits

Example: original file name edits_inprogress_0000000000000000321

After executing the following command

[root@bigdata111 current]# hdfs dfsadmin -rollEdits

Successfully rolled edit logs.

New segment starts at txid 323

edits_inprogress_0000000000000000321 => edits_inprogress_0000000000000000323
  • When is the image file generated

    Load image files and edit logs when NameNode starts

4. NameNode version number

  • View the namenode version number

View VERSION in the directory /opt/module/hadoop-2.8.4/data/dfs/name/current

namespaceID=1778616660

clusterID=CID-bc165781-d10a-46b2-9b6f-3beb1d988fe0

cTime=1552918200296

storageType=NAME_NODE

blockpoolID=BP-274621862-192.168.1.111-1552918200296

layoutVersion=-63

Namenode version number specific explanation

1) The namespaceID has multiple Namenodes on HDFS, so the namespaceIDs of different Namenodes are different, and a set of blockpoolIDs are managed separately.

(2) clusterID cluster id, globally unique

(3) The cTime attribute marks the creation time of the namenode storage system. For the newly formatted storage system, this attribute is 0; but after the file system is upgraded, the value will be updated to the new timestamp.

(4) The storageType attribute indicates that the storage directory contains the data structure of the namenode.

(5) blockpoolID: A block pool id identifies a block pool and is globally unique across clusters. When a new Namespace is created (part of the format process), a unique ID is created and persisted. Building a globally unique BlockPoolID during the creation process is more reliable than artificial configuration. NN persists the BlockPoolID to the disk, and will load and use it again in the subsequent startup process.

(6) layoutVersion is a negative integer. This version number is usually updated only when new features are added to HDFS.

(7) storageID (storage ID): is the ID of the DataNode, not unique

5. SecondaryNameNode directory structure

​ Secondary NameNode is an auxiliary daemon used to monitor the status of HDFS and obtains snapshots of HDFS metadata at regular intervals.

Check the SecondaryNameNode directory structure in the directory /opt/module/hadoop-2.8.4/data/dfs/namesecondary/current.

edits_0000000000000000001-0000000000000000002
fsimage_0000000000000000002
fsimage_0000000000000000002.md5
VERSION

The layout of the namesecondary/current directory of the SecondaryNameNode and the current directory of the primary namenode is the same.

Benefit: When the primary namenode fails (assuming that the data is not backed up in time), data can be restored from the secondary namenode.

Method 1: Copy the data in the SecondaryNameNode to the directory where the namenode stores the data;

Method 2: Use the -importCheckpoint option to start the namenode daemon, thereby copying the data in the SecondaryNameNode to the namenode directory.

  • Case practice (1)

    Simulate namenode failure, and use method one to restore namenode data

(1)kill -9 namenode进程

(2)删除namenode存储的数据(/opt/module/hadoop-2.8.4/data/dfs/name)
	rm -rf /opt/module/hadoop-2.8.4/data/dfs/name/*
		
		注:此时hadoop-daemon.sh stop namenode关闭NN,
		然后hadoop-daemon.sh start namenode重启NN,发现50070页面启动不了

(3)拷贝SecondaryNameNode中数据到原namenode存储数据目录
	cp -r /opt/module/hadoop-2.8.4/data/dfs/namesecondary/* 
	  /opt/module/hadoop-2.8.4/data/dfs/name/

(4)重新启动namenode
	sbin/hadoop-daemon.sh start namenode

Case Practice (2)

Simulate namenode failure, and use method two to restore namenode data

  • Modify the configuration in hdfs-site.xml, the unit of value is second, the default is 3600, which is 1 hour, and only one is required
 <property>
  <name>dfs.namenode.checkpoint.period</name>
  <value>120</value>
</property>

<property>
  <name>dfs.namenode.name.dir</name>
  <value>/opt/module/hadoop-2.8.4/data/dfs/name</value>
</property>
  • kill -9 namenode process

  • Delete the data stored by namenode (/opt/module/hadoop-2.8.4/data/dfs/name)

    rm -rf /opt/module/hadoop-2.8.4/data/dfs/name/*

  • If the SecondaryNameNode is not on the same host node as the Namenode, you need to copy the directory where the SecondaryNameNode stores data to the level directory where the Namenode stores data.

[dingshiqi@bigdata111 dfs]$ pwd
/opt/module/hadoop-2.8.4/data/dfs
[dingshiqi@bigdata111 dfs]$ ls
data  name  namesecondary

Import checkpoint data (wait a while for ctrl+c to end)

bin/hdfs namenode -importCheckpoint

Start namenode

sbin/hadoop-daemon.sh start namenode

If you are prompted that the file is locked, you can delete in_use.lock

rm -rf /opt/module/hadoop-2.8.4/data/dfs/namesecondary/in_use.lock

6. Cluster safe mode operation

  • Overview

    When the Namenode starts, it first loads the image file (fsimage) into the memory and executes various operations in the edit log (edits). Once the file system metadata image is successfully created in memory, a new fsimage file and an empty edit log are created. At this point, the namenode starts to monitor datanode requests. But at this moment, the namenode is running in safe mode, that is, the file system of the namenode is read-only for the client.

    The location of the data block in the system is not maintained by the namenode, but is stored in the datanode in the form of a block list. During the normal operation of the system, the namenode will keep the mapping information of all block locations in memory. In the safe mode, each datanode will send the latest block list information to the namenode. After the namenode knows enough block location information, it can run the file system efficiently.

    If the "minimum copy condition" is met, the namenode will exit safe mode after 30 seconds. The so-called minimum copy condition means that 99.9% of the blocks in the entire file system meet the minimum copy level (default value: dfs.replication.min=1). When starting a newly formatted HDFS cluster, because there are no blocks in the system, the namenode will not enter the safe mode.

  • Basic grammar

    The cluster is in safe mode and cannot perform important operations (write operations). After the cluster is started, it automatically exits the safe mode.

1)bin/hdfs dfsadmin -safemode get		(功能描述:查看安全模式状态)
(2)bin/hdfs dfsadmin -safemode enter  	(功能描述:进入安全模式状态)
(3)bin/hdfs dfsadmin -safemode leave	(功能描述:离开安全模式状态)
(4)bin/hdfs dfsadmin -safemode wait		(功能描述:等待安全模式状态)

Case simulation waiting for safe mode

  • Enter safe mode first
bin/hdfs dfsadmin -safemode enter

Execute the following script

Edit a script (Note: the environment variable must have been set, or write the absolute path)

#!bin/bash
hdfs dfsadmin -safemode wait
hadoop fs -put /opt/BBB /

Open another window and execute

bin/hdfs dfsadmin -safemode leave
  • Edit a script (Note: the environment variable must have been set, or write the absolute path)

7. NameNode multi-directory configuration

  • The local directory of namenode can be configured into multiple , and each directory stores the same content, which increases reliability.
  • The specific configuration is as follows
#hdfs-site.xml

<property>
    <name>dfs.namenode.name.dir</name>
<value>file:///${
    
    hadoop.tmp.dir}/dfs/name1,file:///${
    
    hadoop.tmp.dir}/dfs/name2</value>
</property>
#1.停止集群 删除data 和 logs  rm -rf data/* logs/*
#2.hdfs namenode -format
#3.start-dfs.sh
#4.去展示
#https://blog.csdn.net/qq_39657909/article/details/85553525

Experiment summary:
Thinking 1: If formatting (hdfs namenode -format) on non-Namenode nodes,
will the name1 and name2 directories be generated in the same way as on NN nodes?

答:只要配置了以上得配置,在该节点下同样会生成name1和name2
  • Specific explanation:
    What does formatting do?

    On the NameNode node, there are two most important paths, which are used to store metadata information and operation logs respectively, and these two paths come from

    For the configuration file, their corresponding attributes are dfs.name.dir and dfs.name.edits.dir, and their default path is /tmp/hadoop/dfs/name. When formatting, NameNode will clear all files in the two directories, after that, formatting will create files in the directory dfs.name.dir

    The configuration of hadoop.tmp.dir will make dfs.name.dir and dfs.name.edits.dir generate files in two directories in one directory

    Thinking 2: If name1 and name2 are generated on non-NN, is there any difference between them and NN generated?

    **Answer:** There is a difference. New edits_XXX will be generated on the NN node. The non-NN will not update the fsimage, but the non-NN will not. It will only generate a fsimage obtained only by initialization, and will not generate edits, let alone Log rolling occurred.

Guess you like

Origin blog.csdn.net/qq_45092505/article/details/105315599