Hadoop framework: Detailed explanation of the working mechanism of DataNode

Source code of this article: GitHub·click here || GitEE·click here

1. Working mechanism

1. Basic description

Hadoop framework: Detailed explanation of the working mechanism of DataNode

The data blocks on the DataNode are stored on the disk in the form of files, including two files, one is the data itself, and the other is the data block metadata including length, checksum, and time stamp;

After the DataNode starts, it registers with the NameNode service and periodically reports all data block metadata information to the NameNode;

There is a heartbeat mechanism between the DataNode and the NameNode. Every 3 seconds, the result is returned with the execution command of the NameNode to the DataNode, such as data replication and deletion. If the heartbeat of the DataNode is not received for more than 10 minutes, the node is considered unavailable.

2. Custom duration

Through the hdfs-site.xml configuration file, modify the timeout duration and heartbeat. The unit of heartbeat.recheck.interval is milliseconds and the unit of dfs.heartbeat.interval is seconds.

<property>
    <name>dfs.namenode.heartbeat.recheck-interval</name>
    <value>600000</value>
</property>
<property>
    <name>dfs.heartbeat.interval</name>
    <value>6</value>
</property>

3. The new node is online

The nodes of the current machine are hop01, hop02, hop03, and a new node hop04 is added on this basis.

The basic steps

Obtain the hop04 environment based on the clone of the current service node;

Modify the basic configuration of Centos7 and delete the data and log files;

Start DataNode to associate to the cluster;

4. Multi-directory configuration

The configuration synchronizes the services under the cluster, formatted to start hdfs and yarn, and uploaded files for testing.

<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///${hadoop.tmp.dir}/dfs/data01,file:///${hadoop.tmp.dir}/dfs/data02</value>
</property>

Two, black and white list configuration

1. Whitelist setting

Configure the whitelist and distribute the configuration to the cluster service;

[root@hop01 hadoop]# pwd
/opt/hadoop2.7/etc/hadoop
[root@hop01 hadoop]# vim dfs.hosts
hop01
hop02
hop03

Configure hdfs-site.xml, which is distributed to the cluster service;

<property>
    <name>dfs.hosts</name>
    <value>/opt/hadoop2.7/etc/hadoop/dfs.hosts</value>
</property>

Refresh NameNode

[root@hop01 hadoop2.7]# hdfs dfsadmin -refreshNodes

Refresh ResourceManager

[root@hop01 hadoop2.7]# yarn rmadmin -refreshNodes

2. Blacklist setting

Configure the blacklist and distribute the configuration to the cluster service;

[root@hop01 hadoop]# pwd
/opt/hadoop2.7/etc/hadoop
[root@hop01 hadoop]# vim dfs.hosts.exclude
hop04

Configure hdfs-site.xml, which is distributed to the cluster service;

<property>
    <name>dfs.hosts.exclude</name>
    <value>/opt/hadoop2.7/etc/hadoop/dfs.hosts.exclude</value>
</property>

Refresh NameNode

[root@hop01 hadoop2.7]# hdfs dfsadmin -refreshNodes

Refresh ResourceManager

[root@hop01 hadoop2.7]# yarn rmadmin -refreshNodes

3. File archiving

1. Basic description

The characteristics of HDFS storage are suitable for large files with massive data. If each file is very small, a large amount of metadata information will be generated, occupying too much memory, and it will become slow when NaemNode and DataNode interact.

Hadoop framework: Detailed explanation of the working mechanism of DataNode

HDFS can archive and store some small files, which can be understood as compressed storage, which reduces the consumption of NameNode and reduces the burden of interaction. At the same time, it also allows access to archived small files to improve overall efficiency.

2. Operation process

Create two directories

# 存放小文件
[root@hop01 hadoop2.7]# hadoop fs -mkdir -p /hopdir/harinput
# 存放归档文件
[root@hop01 hadoop2.7]# hadoop fs -mkdir -p /hopdir/haroutput

Upload test file

[root@hop01 hadoop2.7]# hadoop fs -moveFromLocal LICENSE.txt /hopdir/harinput
[root@hop01 hadoop2.7]# hadoop fs -moveFromLocal README.txt /hopdir/harinput

Archive operation

[root@hop01 hadoop2.7]# bin/hadoop archive -archiveName output.har -p /hopdir/harinput /hopdir/haroutput

View archive files

[root@hop01 hadoop2.7]# hadoop fs -lsr har:///hopdir/haroutput/output.har

Hadoop framework: Detailed explanation of the working mechanism of DataNode

In this way, the original small file blocks can be deleted.

Unarchive files

# 执行解除
[root@hop01 hadoop2.7]# hadoop fs -cp har:///hopdir/haroutput/output.har/* /hopdir/haroutput
# 查看文件
[root@hop01 hadoop2.7]# hadoop fs -ls /hopdir/haroutput

Fourth, the recycle bin mechanism

1. Basic description

If the recycle bin function is enabled, the deleted files can be restored within a specified time to prevent accidental deletion of data. The specific implementation inside HDFS is to start a background thread Emptier in the NameNode. This thread specifically manages and monitors the files under the system's recycle bin. Files that are put into the recycle bin and exceed their life cycle will be automatically deleted.

2. Turn on the configuration

This configuration needs to be synchronized to all services under the cluster;

[root@hop01 hadoop]# vim /opt/hadoop2.7/etc/hadoop/core-site.xml 
# 添加内容
<property>
   <name>fs.trash.interval</name>
    <value>1</value>
</property>

fs.trash.interval=0, means the recycle bin mechanism is disabled, and =1 means it is enabled.

Five, source code address

GitHub·地址
https://github.com/cicadasmile/big-data-parent
GitEE·地址
https://gitee.com/cicadasmile/big-data-parent

Recommended reading: finishing programming system

Serial number project name GitHub address GitEE address Recommended
01 Java describes design patterns, algorithms, and data structures GitHub·click here GitEE·Click here ☆☆☆☆☆
02 Java foundation, concurrency, object-oriented, web development GitHub·click here GitEE·Click here ☆☆☆☆
03 Detailed explanation of SpringCloud microservice basic component case GitHub·click here GitEE·Click here ☆☆☆
04 Comprehensive case of SpringCloud microservice architecture actual combat GitHub·click here GitEE·Click here ☆☆☆☆☆
05 Getting started with SpringBoot framework basic application to advanced GitHub·click here GitEE·Click here ☆☆☆☆
06 SpringBoot framework integrates and develops common middleware GitHub·click here GitEE·Click here ☆☆☆☆☆
07 Basic case of data management, distribution, architecture design GitHub·click here GitEE·Click here ☆☆☆☆☆
08 Big data series, storage, components, computing and other frameworks GitHub·click here GitEE·Click here ☆☆☆☆☆

Guess you like

Origin blog.51cto.com/14439672/2542604