Hadoop (HDFS) of big data technology

Chapter 1 HDFS Overview

1.1 HDFS output background and definition

  • HDFS generation background

As the amount of data becomes larger and larger, all the data cannot be stored in one operating system, so it will be allocated to more disks managed by the operating system, but it is inconvenient to manage and maintain. There is an urgent need for a system to manage multiple machines This is the distributed file management system. HDFS is just one type of distributed file management system .

  • HDFS definition

HDFS (Hadoop Distributed File System), which is a file system, is used to store files and locate files through directory trees; secondly, it is distributed, and many servers are combined to realize its functions. The servers in the cluster have their own Role.

HDFS usage scenarios: suitable for one-time write, multiple read-out scenarios. A file does not need to be changed after it is created, written to, and closed.

1.2 Advantages and disadvantages of HDFS

1.2.1 Advantages of HDFS

  1. high fault tolerance

  1. Data is automatically saved in multiple copies. It improves fault tolerance by adding copies.

  1. After a copy is lost, it can be automatically restored.

  1. Suitable for handling big data

  1. Data scale: able to handle data with data scales reaching GB, TB, or even PB levels

  1. File size: It can handle the number of files above one million scale , which is quite large.

  1. It can be built on cheap machines, and the reliability can be improved through the multi-copy mechanism

1.2.2 Disadvantages of HDFS

  1. It is not suitable for low-latency data access , such as storing data in milliseconds, it is impossible.

  1. It cannot efficiently store a large number of small files.

  1. If storing a large number of small files, it will occupy a large amount of NameNode memory to store file directory and block information. This is not advisable, because the memory of the NameNode is always limited.

  1. The seek time for small file storage will exceed the read time, which violates the design goal of HDFS.

  1. Concurrent writing and random modification of files are not supported .

  1. A file can only be written by one, and multiple threads are not allowed to write at the same time.

  1. Only supports data append (append) , does not support random modification of files

1.3 HDFS structure

  1. NameNode (nn) is the Master, which is a supervisor and manager.

  1. Manage HDFS namespaces;

  1. copy strategy;

  1. Manage data block (Block) mapping information;

  1. Handle client read and write requests.

  1. DataNode : It is Slave. NameNode issues orders, and DaaNode performs actual operations.

  1. store the actual data block

  1. Perform read/write operations on data blocks.

  1. Client : It is the client.

  1. File segmentation. When a file is uploaded to HDFS, the client divides the file into blocks one by one, and then uploads the file.

  1. Interact with NameNode to obtain the location information of the file;

  1. Interact with DaaNode, read or write data;

  1. Client provides some commands to manage HDFS, such as NameNode formatting;

  1. Client can access HDFS through some commands, such as adding, deleting, modifying and querying HDFS.

  1. Secondary NameNode : It is not a hot standby of NameNode. When the NameNode hangs up, it cannot immediately replace the NameNode and provide services.

  1. Auxiliary NameNode shares its work, such as merging Fsiage and Edits regularly and pushing them to NameNode.

  1. In an emergency, it can assist in recovering the NameNode.

1.4 HDFS file block size (interview focus)

think:

  • Why can't the block size be set too small, nor can it be set too large?

  • The block setting of HDFS is too small , which will increase the seek time , and the program has been looking for the start position of the block;

  • If the block is set too large , the time to transfer data from disk will be significantly longer than the time required to locate the beginning of the block . As a result, the program will be very slow when processing this piece of data

Summary: The HDFS block size setting mainly depends on the disk transfer rate.


Chapter 2 Shell Operation of HDFS

2.1 Basic syntax

hadoop fs [specific command] OR hdfs dfs [specific command]

two are identical

2.2 Common commands

2.2.1 Upload

  1. -moveFromLocal: cut and paste from local to HDFS

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs  -moveFromLocal  ./shuguo.txt  /sanguo
  1. -copyFromLocal: copy files from the local file system to the HDFS path

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -copyFromLocal weiguo.txt /sanguo
  1. -put: Equivalent to copyFromLocal, the production environment is more accustomed to using put

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -put ./wuguo.txt /sanguo
  1. -appendToFile: append a file to the end of an existing file

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -appendToFile liubei.txt /sanguo/shuguo.txt

2.2.2 download

  1. -copyToLocal: copy from HDFS to local

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -copyToLocal /sanguo/shuguo.txt ./
  1. -get: equivalent to copyToLocal, the production environment is more accustomed to using get

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -get /sanguo/shuguo.txt ./shuguo2.txt

2.2.3 HDFS Direct Operation

  1. -ls: display directory information

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -ls /sanguo
  1. -cat: display file content

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -cat /sanguo/shuguo.txt
  1. -chgrp, -chmod, -chown: the same usage in the Linux file system, modify the permissions of the file

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs  -chmod 666  /sanguo/shuguo.txt
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs  -chown  atguigu:atguigu   /sanguo/shuguo.txt
  1. -mkdir: create path

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mkdir /jinguo
  1. -cp: copy from one path of HDFS to another path of HDFS

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -cp /sanguo/shuguo.txt /jinguo
  1. -mv: move files in HDFS directory

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mv /sanguo/wuguo.txt /jinguo
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mv /sanguo/weiguo.txt /jinguo
  1. -tail: display the end 1kb of data of a file

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -tail /jinguo/shuguo.txt
  1. -rm: delete a file or folder

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -rm /sanguo/shuguo.txt
  1. -rm -r: Recursively delete directories and their contents

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -rm -r /sanguo
  1. -du statistical folder size information

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -du -s -h /jinguo
  1. -setrep: set the number of copies of files in HDFS

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -setrep 10 /jinguo/shuguo.txt

2.3 API operation of HDFS

2.3.1 HDFS file upload

@Test
public void testCopyFromLocalFile() throws IOException, InterruptedException, URISyntaxException {

        // 1 获取文件系统
        Configuration configuration = new Configuration();
        FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:8020"), configuration, "atguigu");

        // 2 上传文件
        fs.copyFromLocalFile(new Path("e:/banzhang.txt"), new Path("/banzhang.txt"));

        // 3 关闭资源
        fs.close();

2.3.2 HDFS file download

@Test
public void testCopyToLocalFile() throws IOException, InterruptedException, URISyntaxException{

        // 1 获取文件系统
        Configuration configuration = new Configuration();
        FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:8020"), configuration, "atguigu");
        
        // 2 执行下载操作
        // boolean delSrc 指是否将原文件删除
        // Path src 指要下载的文件路径
        // Path dst 指将文件下载到的路径
        // boolean useRawLocalFileSystem 是否开启文件校验
        fs.copyToLocalFile(false, new Path("/banzhang.txt"), new Path("e:/banhua.txt"), true);
        
        // 3 关闭资源
        fs.close();
}

2.3.3 Parameter priority

Parameter priority order: (1) Value set in client code > (2) User-defined configuration file under ClassPath > (3) Then the default configuration of the server


Chapter 3 HDFS reading and writing process (interview focus)

3.1 HDFS write data process

3.1.1 Analysis file writing

specific process:

(1) The client requests the NameNode to upload files through the Distributed FileSystem module, and the NameNode checks whether the target file exists and whether the parent directory exists.

(2) NameNode returns whether it can be uploaded.

(3) Which DataNode servers the client requests to upload the first Block to.

(4) NameNode returns three DataNode nodes, namely dn1, dn2, and dn3.

(5) The client requests dn1 to upload data through the FSDataOutputStream module. After receiving the request, dn1 will continue to call dn2, and then dn2 will call dn3 to complete the establishment of the communication channel.

(6) dn1, dn2, and dn3 respond to the client step by step.

(7) The client starts to upload the first Block to dn1 (first read data from the disk and put it in a local memory cache), taking Packet as the unit, dn1 will pass a Packet to dn2, and dn2 will pass it to dn3; dn1 Every time a packet is transmitted, it will be put into a response queue to wait for the response.

(8) When a Block transmission is completed, the client requests the NameNode to upload the server of the second Block again. (Repeat steps 3-7).

3.1.2 Rack awareness (replica storage node selection)

  • Rack Awareness Instructions

  • official description

For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode, another replica on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.
  • Hadoop3.1.3 replica node selection

  • The first copy is on the node where the Client is located. If the client is outside the cluster, randomly select one

  • The second replica is on a random node in another rack

  • The third replica is on a random node on the same rack as the second replica

3.2 HDFS read data process

specific process:

(1)客户端通过DistributedFileSystem向NameNode请求下载文件,NameNode通过查询元数据,找到文件块所在的DataNode地址。

(2)挑选一台DataNode(就近原则,然后随机)服务器,请求读取数据。

(3)DataNode开始传输数据给客户端(从磁盘里面读取数据输入流,以Packet为单位来做校验)。

(4)客户端以Packet为单位接收,先在本地缓存,然后写入目标文件。


第4章 NameNode和SecondaryNameNode

4.1 NN和2NN工作机制

1)第一阶段:NameNode启动

(1)第一次启动NameNode格式化后,创建Fsimage和Edits文件。如果不是第一次启动,直接加载编辑日志和镜像文件到内存。

(2)客户端对元数据进行增删改的请求。

(3)NameNode记录操作日志,更新滚动日志。

(4)NameNode在内存中对元数据进行增删改。

2)第二阶段:Secondary NameNode工作

(1)Secondary NameNode询问NameNode是否需要CheckPoint。直接带回NameNode是否检查结果。

(2)Secondary NameNode请求执行CheckPoint。

(3)NameNode滚动正在写的Edits日志。

(4)将滚动前的编辑日志和镜像文件拷贝到Secondary NameNode。

(5)Secondary NameNode加载编辑日志和镜像文件到内存,并合并。

(6)生成新的镜像文件fsimage.chkpoint。

(7)拷贝fsimage.chkpoint到NameNode。

(8)NameNode将fsimage.chkpoint重新命名成fsimage。

4.2 Fsimage和Edits解析

4.3 oiv查看Fsimage文件

  1. 查看oiv和oev命令

[atguigu@hadoop102 current]$ hdfs

oiv apply the offline fsimage viewer to an fsimage

oev apply the offline edits viewer to an edits file

  1. 基本语法

# fsimage
hdfs oiv -p 文件类型 -i镜像文件 -o 转换后文件输出路径
# edits file
hdfs oev -p 文件类型 -i编辑日志 -o 转换后文件输出路径

4.4 CheckPoint时间设置

  1. 通常情况下,SecondaryNameNode每隔一小时执行一次。

[hdfs-default.xml]

<!--  单位:秒 -->
<property>
  <name>dfs.namenode.checkpoint.period</name>
  <value>3600</value>
</property>
  1. 一分钟检查一次操作次数,当操作次数达到1百万时,SecondaryNameNode执行一次。

<property>
  <name>dfs.namenode.checkpoint.txns</name>
  <value>1000000</value>
<description>操作动作次数</description>
</property>

<property>
  <name>dfs.namenode.checkpoint.check.period</name>
  <value>60</value>
<description> 1分钟检查一次操作次数</description>
</property>

第5章 DataNode

5.1 DataNode工作机制

具体流程:

(1)一个数据块在DataNode上以文件形式存储在磁盘上,包括两个文件,一个是数据本身,一个是元数据包括数据块的长度,块数据的校验和,以及时间戳。

(2)DataNode启动后向NameNode注册,通过后,周期性(6小时)的向NameNode上报所有的块信息。

(3)心跳是每3秒一次,心跳返回结果带有NameNode给该DataNode的命令如复制块数据到另一台机器,或删除某个数据块。如果超过10分钟+30秒没有收到某个DataNode的心跳,则认为该节点不可用。

(4)集群运行中可以安全加入和退出一些机器。

5.2 DataNode 数据完整性

DataNode节点保证数据完整性的方法:

(1)当DataNode读取Block的时候,它会计算CheckSum。

(2)如果计算后的CheckSum,与Block创建时值不一样,说明Block已经损坏。

(3)Client读取其他DataNode上的Block。

(4)常见的校验算法crc(32),md5(128),sha1(160)。

(5)DataNode在其文件创建后周期验证CheckSum。

5.3 掉线时限参数设置

需要注意的是hdfs-site.xml 配置文件中的

  • heartbeat.recheck.interval的单位为毫秒

  • dfs.heartbeat.interval的单位为

<property>
    <name>dfs.namenode.heartbeat.recheck-interval</name>
    <value>300000</value>
</property>

<property>
    <name>dfs.heartbeat.interval</name>
    <value>3</value>
</property>

Guess you like

Origin blog.csdn.net/m0_57126939/article/details/129261091