Big Data Study Notes 02-HDFS of Hadoop

HDFS distributed file system


Hadoop=HDFS (distributed file system) + MapReduce (distributed computing framework) + Yarn (resource coordination framework) + Common module. This article mainly sorts out the knowledge points of HDFS.

1. Introduction to HDFS

HDFS (full name: Hadoop Distribute File System) is the core component of Hadoop and a distributed storage service.
Distributed file systems span multiple computers and have broad application prospects in the era of big data. They
provide the necessary expansion capabilities for storing and processing ultra-large-scale data.
HDFS is a kind of distributed file system.

Two, the important concept of HDFS

HDFS locates files through a unified namespace directory tree; in addition, it is distributed, and many servers are combined to achieve its functions. The servers in the cluster have their own roles (the distributed nature is split, and each performs its own duties) ;

-Typical Master/Slave architecture

The architecture of HDFS is a typical Master/Slave structure.
HDFS clusters are often composed of one NameNode (the HA architecture has two NameNodes, federated mechanism) + multiple DataNodes
;
NameNode is the master node of the cluster, and DataNode is the slave node of the cluster.

  • Block storage (block mechanism)

Files in HDFS are physically stored in blocks, and the block size can be specified by configuration parameters; the
default block size in Hadoop 2.x version is 128M;

  • Namespace (NameSpace)

HDFS supports the traditional hierarchical file organization structure. Users or applications can create directories and save files in these directories. The hierarchical structure of the file system namespace is similar to most existing file systems: users can create, delete, move, or rename files.
Namenode is responsible for maintaining the name space of the file system. Any modification to the file system name space or attributes will be
recorded by Namenode.
HDFS provides customers with a single abstract directory tree, the access format is: hdfs://namenode's hostname:port/test/inputhdfs://linux121:9000/test/input

  • NameNode metadata management

We call the directory structure and file block location information metadata. The metadata of the NameNode records the block information corresponding to each file (the id of the block and the information of the DataNode node where it is located)

  • DataNode data storage

The specific storage management of each block of the file is undertaken by the DataNode node. A block will be stored by multiple DataNodes, and the DataNode will periodically report the block information it holds to the NameNode.

  • Copy mechanism

For fault tolerance, all blocks of the file will have copies . The block size and copy coefficient of each file are configurable. The application can specify the number of copies of a file. The copy coefficient can be specified when the file is created, or it can be changed later.
The default number of copies is 3.

  • Write once, read many times

HDFS is designed to adapt to the scenario of one write and multiple reads, and does not support random modification of files. (Supports additional writes, not only random updates)
Because of this, HDFS is suitable for low-level storage services for big data analysis, and is not suitable for applications such as network disks (inconvenient modification, large delay, and high network overhead. the cost is too high)

Three, HDFS architecture

Insert picture description here

NameNode(nn): The manager of the HDFS cluster, Master

  • Maintain and manage the Hdfs namespace (NameSpace)
  • Maintain a copy strategy
  • Record the mapping information of the file block (Block)
  • Responsible for processing client read and write requests

DataNode: NameNode issues commands, DataNode performs actual operations, Slave node

  • Save the actual data block
  • Responsible for reading and writing data blocks

Client: client

  • When uploading files to HDFS, the client is responsible for dividing the files into blocks and uploading them
  • Request NameNode interaction to obtain file location information
  • Read or write files, interact with DataNode
  • Client can use some commands to manage HDFS or access HDFS
    Principle of operation
    . You must have the impression of this picture in your mind to deepen your understanding of HDFS operating principles.

Four, HDFS client operation

  • Shell command line operation HDFS
    basic syntax: bin/hadoop fs specific command OR bin/hdfs dfs specific command
    -command list
    root@linux121 hadoop-2.9.2]# bin/hdfs dfs (this command can open the command manual)
    Usage: hadoop fs [generic options]
    [-appendToFile…]
    [-cat [ -ignoreCrc] …]
    [-checksum …]
    [-chgrp [-R] GROUP PATH…]
    [-chmod [-R] <MODE[,MODE]… | OCTALMODE> PATH…]
    [-chown [-R] [OWNER ][:[GROUP]] PATH…]
    [-copyFromLocal [-f] [-p] [-l] [-d]…]
    [-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] …]
    [-Count [-q] [-h] [-v] [-t []] [-u] [-x] …]
    [-cp [-f] [-p | -p[topax]] [-d] … ]
    [-createSnapshot []]
    [-deleteSnapshot ]
    [-df [-h] [ …]]
    [-Du [-s] [-h] [-x] …]
    [-expunge]
    [-find … …]
    [-get [-f] [-p] [-ignoreCrc] [-crc] … ]
    [-getfacl [-R] ]
    [-getfattr [-R] {-n name | -d} [-e en] ]
    [-getmerge [-nl] [-skip-empty-file] ]
    [-help [cmd …]]
    [-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [ …]]
    [-mkdir [-p] …]
    [-moveFromLocal … ]
    [-moveToLocal ]
    [-mv … ]
    [-put [-f] [-p] [-l] [-d] … ]
    …略

  • HDFS command demonstration
    Some common HDFS command demonstration

  1. Start the Hadoop cluster (to facilitate subsequent tests)
[root@linux121 hadoop-2.9.2]$ sbin/start-dfs.sh
[root@linux122 hadoop-2.9.2]$ sbin/start-yarn.sh

2.-help: output this command parameter

[root@linux121 hadoop-2.9.2]$ hadoop fs -help rm
  1. -ls: display directory information
[root@linux121 hadoop-2.9.2]$ hadoop fs -ls /
  1. -mkdir: create a directory on HDFS
[root@linux121 hadoop-2.9.2]$ hadoop fs -mkdir -p /test/bigdata
  1. -moveFromLocal: Cut and paste from local to HDFS
[root@linux121 hadoop-2.9.2]$ touch hadoop.txt
[root@linux121 hadoop-2.9.2]$ hadoop fs  -moveFromLocal ./hadoop.txt  /test/bigdata
  1. -cat: display the contents of the file
[root@linux121 hadoop-2.9.2]$ hadoop fs -cat /test/bigdata/hadoop.txt
  1. -chgrp, -chmod, -chown: The usage in the Linux file system is the same, modify the permissions of the file
[root@linux121 hadoop-2.9.2]$ hadoop fs  -chmod  777 /test/bigdata/hadoop.txt
[root@linux121 hadoop-2.9.2]$ hadoop fs  -chown root:root /test/bigdata/hadoop.txt
  1. -copyFromLocal: Copy files from the local file system to the HDFS path
[root@linux121 hadoop-2.9.2]$ hadoop fs -copyFromLocal README.txt /
  1. -copyToLocal: copy from HDFS to local
 [root@linux121 hadoop-2.9.2]$ hadoop fs -copyToLocal /test/bigdata/hadoop.txt
 ./
  1. -get: equivalent to copyToLocal , which is to download files from HDFS to the local
[root@linux121 hadoop-2.9.2]$ hadoop fs -get /test/bigdata/hadoop.txt ./
  1. -put: equivalent to copyFromLocal
[root@linux121 hadoop-2.9.2]$ hadoop fs -mkdir -p /user/root/test/
#本地文件系统创建yarn.txt
[root@linux121 hadoop-2.9.2]$ vim yarn.txt
resourcemanager nodemanager
[root@linux121 hadoop-2.9.2]$ hadoop fs -put ./yarn.txt /user/root/test
  1. -setrep: Set the number of copies of files in HDFS
[root@linux121 hadoop-2.9.2]$ hadoop fs -setrep 10 /lagou/bigdata/hadoop.txt

The number of replicas set here is only recorded in the metadata of the NameNode. Whether there are really so many replicas depends on the number of DataNodes. Because there are currently only 3 devices, there are at most 3 copies, and only when the number of nodes increases to 10, the number of copies can reach 10.
Insert picture description here

Five HDFS read and write analysis

5.1 HDFS read data process
Insert picture description here

  1. The client requests the NameNode to download the file through the Distributed FileSystem, and the NameNode finds the DataNode address where the file block is located by querying metadata.
  2. Pick a DataNode (proximity principle, then random) server and request to read data.
  3. The DataNode starts to transmit data to the client (read the data input stream from the disk, and use the Packet as the unit for verification).
  4. The client receives it in packets, first caches it locally, and then writes it to the target file.

5.2 HDFS write data process
Insert picture description here

  1. The client requests the NameNode to upload files through the Distributed FileSystem module, and the NameNode checks whether the target file already exists and whether the parent directory exists.
  2. The NameNode returns whether it can be uploaded.
  3. Which DataNode servers the client requests to upload the first block to.
  4. NameNode returns 3 DataNode nodes, namely dn1, dn2, and dn3.
  5. The client requests dn1 to upload data through the FSDataOutputStream module. When dn1 receives the request, it will continue to call dn2, and then dn2 will call dn3 to complete the establishment of the communication channel. (Ensure data security)
  6. dn1, dn2, and dn3 answer the client level by level.
  7. The client first began to upload dn1 Block (start reading data into a local disk cache memory) to Packet is a single
    bit, dn1 Packet will receive a pass dn2, dn2 passed dn3; dn1 per pass A packet will be put into a confirmation queue waiting for confirmation.
  8. After the transmission of a block is completed, the client again requests the NameNode to upload the second block to the server. (Repeat
    step 7).

Six NN and 2NN

6.1 HDFS metadata management mechanism (critical)

Question 1: How does the NameNode manage and store metadata?
There are two ways to store data in the computer: memory or disk.
If metadata is stored in disk: the storage disk cannot face any fast and low-latency response from the client to the metadata information, but the security is high.
If the metadata is stored in the memory Medium: Metadata is stored in memory, which can efficiently query and quickly respond to client query requests. Data is stored in memory. If there is a breakpoint, all data in the memory will be lost.
Solution : memory + disk; NameNode memory + FsImage file (disk)
New problem: how to divide the metadata in disk and memory?
The two data are exactly the same, or is the two data combined to form a complete data?
If the two pieces of data are exactly the same: If the client adds, deletes, and modifies metadata, it needs to ensure the consistency of the two pieces of data. The operation efficiency of FsImage files is not high.
Solution:
Two merges = complete data: NameNode introduced an edits file (log file: only additional writes). The edits file records the
client's addition, deletion and modification operations, and no longer chooses to let NameNode dump the data to form an FsImage file ( This kind of operation consumes more resources).

Insert picture description here

The first stage: NameNode start

  • After starting the NameNode formatting for the first time, create the Fsimage and Edits files. If it is not the first time to start, directly load the edit log and image file to the memory.
  • The client requests to add, delete, or modify metadata.
  • The NameNode records the operation log and updates the rolling log.
  • The NameNode adds, deletes, and changes data in memory.

The second stage: Secondary NameNode work

  • The Secondary NameNode asks whether the NameNode needs CheckPoint. Directly bring back the result of whether the NameNode performs a checkpoint operation.

  • The Secondary NameNode requests to execute CheckPoint.

  • The NameNode scrolls the Edits log being written.

  • Copy the edit log and mirror file before rolling to the Secondary NameNode.

  • The Secondary NameNode loads the edit log and image file into the memory and merges them.

  • Generate a new image file fsimage.chkpoint.

  • Copy fsimage.chkpoint to NameNode.

  • NameNode renamed fsimage.chkpoint to fsimage.

6.2 Fsimage and Edits file analysis

After the NameNode executes the formatting, the /opt/lagou/servers/hadoop-2.9.2/data/tmp/dfs/name/currentfollowing files will be generated in the directory
Insert picture description here

  • Fsimage file: It is a mirror image of metadata in the namenode, generally called a checkpoint, which contains all the directories and file-related information of the HDFS file system (number of blocks, number of copies, permissions, etc.)
  • Edits file: Stores all update operation records of the client to the HDFS file system. All update operations of the client to the HDFS file system will be recorded in the Edits file (not including query operations)
  • seen_txid: The file is a number saved, and the number corresponds to the number of the last Edits file name
  • VERSION: This file records some version number information of namenode, such as: CusterId, namespaceID, etc.

Insert picture description here
Insert picture description here
Question: How to determine which Edits files to load when the NameNode starts?
When NN starts, it needs to load the fsimage file and the edits files that have not been merged by 2nn. How does NN determine which edits have been merged?
Answer: The number of the fsimage file itself can be used to determine which has been merged.
Insert picture description here

Seven Hadoop quotas and archives and cluster security model

HDFS file quota configuration
HDFS file quota configuration allows us to limit the number of files we upload in a certain directory
or the total amount of file content by file size or number of files , in order to reach our limit similar to Baidu network disk network disk, etc. per user Maximum amount of files allowed to be uploaded
1. Quantity limit

hdfs dfs -mkdir -p /user/root/test     #创建hdfs文件夹
hdfs dfsadmin -setQuota 2 /user/root/test       # 给该文件夹下面设置最多上传两个文件,上传文件,发现只能上传一个文件
hdfs dfsadmin -clrQuota /user/root/test   # 清除文件数量限制

2. Space size limit

hdfs dfsadmin -setSpaceQuota 4k /user/root/test  # 限制空间大小4KB
#上传超过4Kb的文件大小上去提示文件超过限额
hdfs dfs -put /export/softwares/xxx.tar.gz /user/root/test
hdfs dfsadmin -clrSpaceQuota /user/root/test  #清除空间限额
#查看hdfs文件限额数量
hdfs dfs -count -q -h /user/root/test

HDFS safe mode
secure mode is a special state in which the HDFS, in this state, the file system only accepts a read data request, does not accept, edit or delete the change request. When the NameNode master node starts, HDFS first enters the safe mode. When the DataNode starts, it will report the available blocks and other states to the NameNode. When the entire system reaches the safety standard, HDFS automatically leaves the safe mode. If HDFS is in safe mode, the file block cannot perform any copy copy operations. Therefore, the minimum number of copies required is determined based on the status of the DataNode when it starts, and no more copies will be made when it starts (so as to achieve the minimum copy). Quantity requirement). When the HDFS cluster is just started, the default time of 30S is out of the safety period. Only after 30S, the cluster is out of the safety period, and then the cluster can be operated.

hdfs dfsadmin  -safemode

Hadoop archiving technology
mainly solves the problem of large numbers of small files in HDFS clusters! !
Since a large number of small files will occupy the memory of the NameNode, storing a large number of small files in HDFS causes a waste of NameNode memory resources!
Hadoop archive file HAR file is a more efficient file archiving tool. The HAR file is created from a set of files through the archive tool. While reducing the memory usage of the NameNode, the file can be accessed transparently. In layman's terms That is, the HAR file is a file for the NameNode, which reduces the waste of memory, and is still an independent file for the actual operation and processing of the file.
Insert picture description here
Case study

  1. Start the YARN cluster
[root@linux121 hadoop-2.9.2]$ start-yarn.sh
  1. Archive files Archive all the files in
    the /user/lagou/input directory into an archive file called input.har, and
    store the archived files in the /user/lagou/output path.
[root@linux121 hadoop-2.9.2]$ bin/hadoop archive -archiveName input.har –p
/user/root/input /user/root/output
  1. View archive
[root@linux121 hadoop-2.9.2]$ hadoop fs -lsr /user/root/output/input.har
[root@linux121 hadoop-2.9.2]$ hadoop fs -lsr
har:///user/root/output/input.har
  1. Unarchive files
[root@linux121 hadoop-2.9.2]$ hadoop fs -cp har:/// user/root/output/input.har/*
/user/root

to sum up

Mainly master the HDFS read and write process and HDFS metadata management mechanism.

Guess you like

Origin blog.csdn.net/weixin_43900780/article/details/114047833