Use of HDFS components of Hadoop's distributed file storage system

1. Composition of HDFS

1、NameNode

  • Stores metadata for the entire HDFS cluster - an index of directories and files stored across the cluster
  • Manage an entire HDFS cluster
  • Receive client request
  • Responsible for node failover

2、DataNode

  • Data is stored in the form of blocks.
  • By default the block size is 128M.
  • Calculation formula of blocksize size:
    • Seeking time: the time it takes to find the file when downloading the file; the best state is when the seeking time is 1% of the transmission time; the average seeking time of HDFS is 10ms
    • Data transmission speed: 100M/s
  • It is responsible for regularly summarizing the block information stored on the entire node and then reporting it to the NN.
  • Responsible for connecting with the client to perform file reading and writing operations.

3、SecondaryNameNode

  • Assist NameNode to complete the merge operation of edits log and fsimage image file.

4. Client: command line/Java API

  • Responsible for communicating with the HDFS cluster to add, delete, modify and check files.
  • Responsible for segmenting blocks

2. Basic use of HDFS

HDFS is a distributed file storage system that can store data (file data). Since HDFS is a file system, it can upload, download, delete files, create folders, etc.

HDFS provides us with two methods of operation: ① Command line operation ② Operation through Java API

1. Command line operation

命令行操作
hdfs dfs -xxxx xxxxx 或者 hadoop fs -xxxx xxxxx
查看 —— hdfs dfs -ls/
新建 —— hdfs dfs -mkdir /demo
上传 —— hdfs dfs -put jdk-8u371-linux-x64.tar.gz /demo
上传并删除Linux本地内容 —— hdfs dfs -moveFromLocal hadoop-3.1.4.tar.gz /demo
下载 —— hdfs dfs -getToLocal /demo/hadoop-3.1.4.tar.gz /opt/software
下载 —— hdfs dfs -copyToLocal /demo/hadoop-3.1.4.tar.gz /opt/software
删除 —— hdfs dfs -rm -r /demo

2. Java API operation

Introduce Hadoop programming dependencies (hadoop-client, hadoop-hdfs) into pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.kang</groupId>
  <artifactId>hdfs-study</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>hdfs-study</name>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>8</maven.compiler.source>
    <maven.compiler.target>8</maven.compiler.target>
    <hadoop.version>3.1.4</hadoop.version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
      <scope>compile</scope>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
    </dependency>
    
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>${hadoop.version}</version>
    </dependency>
  </dependencies>
</project>
  • Basic operations of hdfs
/**
 * HDFS的编程流程
 *   1、创建Hadoop的配置文件对象,配置文件对象指定HDFS的相关连接配置
 *        配置文件对象等同于hadoop的etc/hadoop目录下的哪些xxx.xml配置
 *   2、根据配置获取和HDFS的连接
 *   3、连接去操作HDFS集群
 */

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.RemoteIterator;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

public class Demo {
    
    
    public static void main(String[] args) throws URISyntaxException, IOException, InterruptedException {
    
    
        //1、创建Hadoop的配置文件Configuration对象
        Configuration conf = new Configuration();
        //2、根据配置文件获取HDFS的连接 FileSystem
        FileSystem system = FileSystem.get(new URI("hdfs://192.168.31.104:9000"), conf, "root");
        //3、根据System去操作HDFS集群了
        RemoteIterator<LocatedFileStatus> listedFiles = system.listFiles(new Path("/"), false);
        while(listedFiles.hasNext()){
    
    
            LocatedFileStatus fileStatus = listedFiles.next();
            System.out.println("文件的路径" + fileStatus.getPath());
            System.out.println("文件的所属用户" + fileStatus.getOwner());
            System.out.println("文件的权限" + fileStatus.getPermission());
            System.out.println("文件的blocksize" + fileStatus.getBlockSize());
        }
    }
}

image-20230719175628635

image-20230719175607666

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * 单元测试
 */
public class HDFSTest {
    
    
    public FileSystem fileSystem;

    @Before
    public void init() throws URISyntaxException, IOException, InterruptedException {
    
    
        //1、创建Hadoop的配置文件Configuration对象
        Configuration conf = new Configuration();
        conf.set("dfs.replication","1");
        //2、根据配置文件获取HDFS的连接 FileSystem
        fileSystem = FileSystem.get(new URI("hdfs://192.168.31.104:9000"), conf, "root");
    }
    /**
     * 1、通过JavaAPI实现文件的上传
     */
    @Test
    public void test01() throws IOException {
    
    
        fileSystem.copyFromLocalFile(new Path("D:\\2023PracticalTraining\\software\\InstallPackage\\PowerDesginer16.5.zip"),new Path("/demo"));
        System.out.println("上传成功!");
        //fileSystem.copyToLocalFile();
    }

    /**
     * 2、下载文件
     * 在Windows上远程操作HDFS或者是在Windows上操作MapReduce代码的时候,有些情况下要求windows上也必须有hadoop的软件环境
     * 但是hadoop只能在Linux上安装,因此,Windows上安装的hadoop其实是一个假的环境
     *  报错:exitcode=-107xxxxxxx 原因是因为电脑缺少C语言的运行环境
     */
    @Test
    public void test02() throws IOException {
    
    
        fileSystem.copyToLocalFile(new Path("/jdk-8u371-linux-x64.tar.gz"),new Path("D:\\Desktop"));
        System.out.println("下载成功!");
    }
    /**
     * 3、删除文件的方法
     */
    @Test
    public void test03() throws IOException {
    
    
        boolean delete = fileSystem.delete(new Path("/demo"), true);
        System.out.println(delete);
    }
}

test01:

image-20230719180840341

image-20230719180857446

test02:

An error message will be displayed: HADOOP_HOME and hadoop.home.dir are unset.

When operating HDFS remotely on Windows or operating MapReduce code on Windows, in some cases it is required that Windows must also have a hadoop software environment;

But hadoop can only be installed on Linux. Therefore, hadoop installed on Windows is actually a fake environment.

First decompress the hadoop-3.1.4.tar.gz installation package sent to Linux into hadoop-3.1.4.tar and then decompress it into hadoop-3.1.4

An error will be reported when decompressing, and this error will be ignored.

After decompression, replace all files in the bin directory in the file. The replacement files can be searched and downloaded on Baidu.

Configure environment variables

image-20230719183149170

Edit Path in system variables

image-20230719183209057

Run the program again and it will be successful!

image-20230719181854392

test03:

image-20230719183529104

image-20230719183541369

package com.kang;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * hdfs查看某个路径下的所有文件和文件夹的信息
 */
public class Demo01 {
    
    
    public static void main(String[] args) throws URISyntaxException, IOException, InterruptedException {
    
    
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(new URI("hdfs://192.168.31.104:9000"), conf, "root");

        /**
         *
         */
        FileStatus[] fileStatuses = fs.listStatus(new Path("/"));
        for (FileStatus fileStatus : fileStatuses) {
    
    
            System.out.println(fileStatus.getPath());
            System.out.println(fileStatus.getBlockSize());
            System.out.println(fileStatus.getPermission());
            System.out.println(fileStatus.getOwner());
            System.out.println(fileStatus.getGroup());
        }

    }
}

image-20230719214637916

package com.kang;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * hdfs的相关判断类型的JavaAPI操作
 */
public class Demo02 {
    
    
    public static void main(String[] args) throws URISyntaxException, IOException, InterruptedException {
    
    
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(new URI("hdfs://192.168.31.104:9000"), conf, "root");

        boolean b = fs.isDirectory(new Path("/demo"));
        System.out.println(b);
        boolean b1 = fs.isFile(new Path("/demo"));
        System.out.println(b1);
        boolean exists = fs.exists(new Path("/a"));
        System.out.println(exists);

    }
}
package com.kang;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * hdfs的创建相关的JavaAPI
 */
public class Demo03 {
    
    
    public static void main(String[] args) throws URISyntaxException, IOException, InterruptedException {
    
    
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(new URI("hdfs://192.168.31.104:9000"), conf, "root");

        boolean mkdirs = fs.mkdirs(new Path("/a/b"));
        System.out.println(mkdirs);

        boolean newFile = fs.createNewFile(new Path("/a/a.txt"));
        System.out.println(newFile);

    }
}
package com.kang;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * hdfs提供了一个可以借助JavaIO流读取数据的方法
 *  上传fs.create		下载fs.open
 */
public class Demo04 {
    
    
    public static void main(String[] args) throws URISyntaxException, IOException, InterruptedException {
    
    
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(new URI("hdfs://192.168.31.104:9000"), conf, "root");

        FSDataInputStream inputStream = fs.open(new Path("/jdk-8u371-linux-x64.tar.gz"));
        inputStream.seek(128*1024*1024);
        FileOutputStream fos = new FileOutputStream("D:\\Desktop\\block2");
        int read = 0;
        while((read = inputStream.read())!=-1){
    
    
            fos.write(read);
        }
        System.out.println("第二个数据块读取完成");
    }
}
  • HDFS is not suitable for constants of a large number of small files, and HDFS cannot modify stored files.

3. HDFS workflow issues (principle content of HDFS)

1. HDFS upload data process

客户端在和DN建立连接的时候,是和距离它最近的某一个DN建立连接
怎么判断DN距离客户端的距离:网络拓扑原则
客户端和HDFS的节点在同一个集群上

  • The client requests the namenode to upload a file, and the namenode checks whether the target file already exists and whether the parent directory exists.
  • The namenode returns whether it can be uploaded.
  • Which datanode servers the client requests to upload the first block to.
  • namenode returns 3 datanode nodes, namely dn1, dn2, and dn3.
  • The client requests dn1 to upload data. When dn1 receives the request, it will continue to call dn2, and then dn2 calls dn3 to complete the establishment of this communication pipeline.
  • dn1, dn2, and dn3 respond to the client step by step.
  • The client starts to upload the first block to dn1 (first read the data from the disk and put it into a local memory cache). In packet units, when dn1 receives a packet, it will be passed to dn2, and dn2 will be passed to dn3; each time dn1 passes one The packet will be placed in a response queue waiting for a response.
  • When a block transfer is completed, the client again requests the namenode to upload the second block to the server. (Repeat steps 3-7)

2. HDFS download data process

  • The client requests the namenode to download the file, and the namenode finds the datanode address where the file block is located by querying the metadata.
  • Select a datanode (nearest principle, then random) server and request to read data.
  • The datanode starts transmitting data to the client (reading data from the disk and putting it into the stream, and verifying it in packets).
  • The client receives packets in units of packets, caches them locally, and then writes them to the target file.

3. HDFS copy backup mechanism

数据上传的时候,会根据配置进行block块的备份,备份的时候,选择哪些节点进行数据备份?
机架感知原则进行备份

Low version Hadoop replica node selection:

第一个副本在client所处的节点上。如果客户端在集群外,随机选一个。
第二个副本和第一个副本位于不相同机架的随机节点上。
第三个副本和第二个副本位于相同机架,节点随机。

Hadoop2.8.5 replica node selection:

第一个副本在client所处的节点上。如果客户端在集群外,随机选一个。
第二个副本和第一个副本位于相同机架,随机节点。
第三个副本位于不同机架,随机节点。

4. The working mechanism of NameNode and SecondaryNameNode in HDFS

This working mechanism is how NameNode manages metadata.

Metadata : refers to the similar directory structure of HDFS storage files/folders. The directory records the size, time, and number of blocks of each file, and the node list information stored in the block... NameNode
default The metadata memory is 1000M, which can store and manage metadata information of millions of blocks.

Two files related to metadata - the mechanism to restore metadata after restarting HDFS

edits edit log file : records the client's write and modification operations to the HDFS cluster

fsimage image file : understood as a persistent point check file of HDFS metadata

HDFS safe mode (safemode)

After HDFS starts, it will first enter the safe mode. The safe mode is the period of time when the edits and fsimage files are loaded into the nn memory, and the period of time when dn registers with NN.

If the HDFS cluster cannot be operated in safe mode, the safe mode will automatically exit, the NN memory is loaded (metadata is loaded), and the HDFS cluster still meets the number of nodes to start.

The function of SNN is to perform checkpoint (checkpoint mechanism) operation on NN

  • When does checkpoint trigger?

    • Checkpoint time is up - 1 hour
      dfs.namenode.checkpoint.period 3600s
    • HDFS has reached 1 million
      dfs.namenode.checkpoint.txns 1000000 since the last checkpoint operation.
  • SNN will ask NN every 1 minute whether it wants to perform a checkponit operation
    dfs.namenode.checkpoint.check.period 60s

How to recover metadata after NameNode failure

  • Because the core of metadata is the edits and fsimage files, and when SNN is working, it will copy the nn files to SNN. Therefore, if the metadata of NN is lost, we can copy these files from SNN to the directory of NN. Metadata recovery (recovery may cause part of the metadata to be lost)
    SNN directory: hadoop . tmp . dir / dfs / namesecondary / currentnn directory: {hadoop.tmp.dir}/dfs/namesecondary/current nn directory:ha d oo p . t m p . d i r / df s / nam eseco n d a ry / c u rre n t nn Directory: {hadoop.tmp.dir}/dfs/name/current

  • There is another way to restore metadata: configure the HDFS namenode to save in multiple directories (HDFS edit logs and mirror files save the same backup in multiple directories). This method can only be used on
    the same node. Problem: If the entire node goes down, multiple paths to dfs.namenode.name.dir
    cannot be restored.

    hdfs-site.xml
    <property>
      <name>dfs.namenode.name.dir</name>
    <value>/opt/app/hadoop/data/dfs/name1,/opt/app/hadoop/data/dfs/name2</value>
    </property>
    
  • It is best to reformat HDFS or create a directory manually

  • HA high availability mode

5. Working mechanism between NameNode and DataNode in HDFS

  • Detailed process
    • A data block is stored on the disk in the form of a file on the datanode, including two files, one is the data itself, and the other is metadata including the length of the data block, the checksum of the block data, and the timestamp.
    • After the DataNode is started, it registers with the namenode. After passing the registration, it reports all block information to the namenode periodically (1 hour).
    • The heartbeat is once every 3 seconds, and the heartbeat return result contains the command given by the namenode to the datanode, such as copying block data to another machine, or deleting a certain data block. If the heartbeat of a datanode is not received for more than 10 minutes, the node is considered unavailable.
    • Some machines can be added and exited safely while the cluster is running.

In addition to the data itself, the blocks stored on the DataNode also include the length of the data, data checksum, timestamp...

The data checksum is to ensure the integrity and consistency of the block. The checksum mechanism is a checksum. When a block is created, a checksum will be calculated based on the data itself. In the future, it will be performed again every time DN performs block summary. Checksum calculation, if the two checksums are inconsistent, the block is considered damaged.

DataNode and NameNode heartbeat , the default heartbeat is once every three seconds, the default value can be adjusted
dfs.heartbeat.interval 3s

<property>
  <name>dfs.namenode.heartbeat.recheck-interval</name>
  <value>300000</value>单位毫秒
</property>
<property>
  <name> dfs.heartbeat.interval </name>
  <value>3</value>单位秒
</property>

HDFS needs to be closed when modifying this configuration, but no reformatting is required.
The heartbeat has two functions: 1. To detect whether the DN is alive. 2. To tell the DN what the NN asked the DN to do.

How does NN know that DN is offline - dead - down (time limit for disconnection): If NN does not receive the heartbeat of DN in a certain heartbeat, NN will not think that DN is dead, but will continue to beat. The NN will consider that the DN is dead before the heartbeat succeeds within the offline time limit, and then starts the backup recovery mechanism. The
length of the offline time limit has a calculation formula:
timeout = 2 * dfs.namenode.heartbeat.recheck-interval + 10 * dfs.heartbeat.interval.
dfs.namenode.heartbeat.recheck-interval Heartbeat detection time 5 minutes
dfs.heartbeat.interval Heartbeat time 3s
By default, if no heartbeat from the DN is received for more than 10min30s, the DN is considered dead.

DataNode will report the block information of all blocks on the node to NameNode every 6 hours (default 6 hours)
dfs.blockreport.intervalMsec 21600000ms and report DN block information to NN every 6 hours
dfs.datanode.directoryscan.interval 21600s DN scans the block information on DN every 6 hours.

4. Service of new nodes and decommissioning of old nodes of HDFS and YARN - Configure in hadoop of the node where namenode is located

1. Concept

HDFS is a distributed file storage system. As a big data software, HDFS is basically non-stop 24/7. If the capacity of the HDFS cluster is not enough, then we need to add a new data node, because HDFS cannot be stopped, so we need to dynamically add a data node (service operation of the new node) during the operation of the HDFS cluster; retire the old node.

2. New node service operation

Before serving a new node, you need to create a new virtual node and configure Java, Hadoop environment, SSH password-free login, IP, host mapping, and host name.

1. Create a dfs.hosts file in the Hadoop configuration file directory, and declare the host name of the slave node of the Hadoop cluster in the file.

2. In the hdfs-site.xml file of Hadoop, add a configuration item
dfs.hosts value: the path of the file

<!--dfs.hosts代表改文件中的地址都为白名单,可以访问NameNode节点,与NameNode节点通信-->
<property>
  <name>dfs.hosts</name>
  <value>/opt/app/hadoop-3.1.4/etc/hadoop/dfs.hosts</value>
</property>


3. Refresh the slave node information hdfs dfsadmin -refreshNodes
yarn rmadmin -refreshNodes when HDFS is turned on.

4. You only need to start datanode and nodemanager on the new node to successfully implement the node service.

hadoop-daemon.sh start datanode

hadoop-daemon.sh start nodemanager

3. Retirement of old nodes (if you add retirement files for the first time, you must restart the HDFS cluster)

1. Create a file dfs.hosts.exclude in the Hadoop configuration directory and write the retired host name in the file.

2. Declare the path of the retired node file
dfs.hosts.exclude value file in Hadoop’s hdfs-site.xml configuration file.

<!--dfs.hosts.exculde文件代表namenode访问的黑名单  需要退役的数据节点
黑名单加入的数据节点如果也在dfs.hosts文件存在的话  不会立即退出 而是先把数据块转移到其他数据节点 然后再退役
-->
<property>
  <name>dfs.hosts.exclude</name>
  <value>/opt/app/hadoop-3.1.4/etc/hadoop/dfs.hosts.exclude</value>
</property>

3. At the same time, you need to delete the decommissioned node in the service node file in dfs.hosts.

4. Refresh node information status
hdfs dfsadmin -refreshNodes
yarn rmadmin -refreshNodes

[Note] When decommissioning, the blocks of the retired nodes will be copied to the non-decommissioned nodes first, and then they will go offline. When decommissioning, it must be ensured that the number of nodes in the remaining cluster after decommissioning is greater than or equal to the number of copies.

Guess you like

Origin blog.csdn.net/weixin_57367513/article/details/132715434
Recommended