HDFS Introduction and HDFS SHELL Operation Command Encyclopedia

1 Overview of HDFS

1.1 Background and definition of HDFS generation

1.1.1 HDFS generation background

​ With the increasing amount of data, all the data cannot be stored in one operating system, so it is allocated to more disks managed by the operating system, but it is inconvenient to manage and maintain. There is an urgent need for a system to manage multiple The files on the machine, this is the distributed file management system. HDFS is just one type of distributed file management system.

1.1.2 Definition of HDFS

​ HDFS (Hadoop Distributed File System), which is a file system, is used to store files and locate files through directory trees; secondly, it is distributed, and many servers are combined to realize its functions. The servers in the cluster have their own character of.

​Usage scenarios of HDFS: Suitable for one-time write, multiple read-out scenarios, and does not support file modification. It is suitable for data analysis, but not suitable for network disk applications.

1.2 Advantages and disadvantages of HDFS

1.2.1 Advantages

1 High fault tolerance

​ 1.1 Data is automatically saved in multiple copies. It improves fault tolerance by adding copies.

​ 1.2 After a copy is lost, it can be automatically restored.

2 Suitable for processing big data

2.1 Data scale: able to handle data with a data scale reaching GB, TB, or even PB levels;

​ 2.2 File size: It can handle the number of files above one million, which is quite large.

3 It can be built on cheap machines, and the reliability can be improved through the multi-copy mechanism.

1.2.2 Disadvantages

1 It is not suitable for low-latency data access, such as storing data in milliseconds, which is impossible.

2 It cannot efficiently store a large number of small files.

2.1 If you store a large number of small files, it will take up a lot of NameNode memory to store file directories and fast information. This is not advisable. Because the memory of NameNode is always limited;

2.2 The addressing time of small file storage will exceed the reading time, which violates the design goal of HDFS.

3 Does not support concurrent writing and random modification of files.

​ 3.1 A file can only be written by one, and multiple threads are not allowed to write at the same time;

​ 3.2 Only data append (append) is supported, and random modification of files is not supported.

1.3 HDFS structure

1 NameNode (nn): Master, which is a supervisor and manager.

​ 1.1 Manage the namespace of HDFS;

​ 1.2 Configure the copy strategy;

1.3 Manage data block (Block) mapping information;

​ 1.4 Handle client read and write requests.

2 DataNode: Slave. The NameNode issues commands, and the DataNode performs the actual operations.

​ 2.1 Store the actual data block;

2.2 Perform read/write operations on data blocks.

3 Client: It is the client.

​ 3.1 File segmentation. When a file is uploaded to HDFS, the Client divides the file into blocks one by one, and then uploads them;

​ 3.2 Interact with the NameNode to obtain the location information of the file;

3.3 Interact with DataNode, read or write data;

​ 3.4 Client provides some commands to manage HDFS, such as NameNode formatting;

3.5 Client can access HDFS through some commands, such as adding, deleting, checking and modifying HDFS.

4 Secondary NameNode: Not the hot standby of NameNode. When the NameNode hangs up, it cannot immediately replace the NameNode and provide services.

​ 4.1 Assist NameNode to share its workload, such as regularly merging Fsimage and Edits, and pushing them to NameNpde;

4.2 In case of emergency, NamaNode can be assisted in recovery.

Please add a picture description

1.4 HDFS file block size

​ The files in HDFS are physically stored in blocks (Block). The size of the block can be specified through the configuration parameter (dfs.blocksize). The default size is 128 M in Hadoop2.x version and 64 M in the old version.

​ For blocks in the cluster, if the addressing time is about 10 ms (that is, the time to find the target block is 10 ms), and because the addressing time is 1% of the transmission time, it is the best state, so the transmission time = 10 ms/0.01=1000 ms=1 s. At present, the transfer rate of the disk is generally 100 MB/s, so the block size=1 s*100 MB/s=100 MB

1 The block setting of HDFS is too small, which will increase the seek time, and the program has been looking for the start position of the block;

2 If the data block is set too large, the time to transfer data from the disk will be significantly longer than the time required to locate the start position of this block. As a result, the program will be very slow when processing this piece of data.

In summary, the size of the HDFS block mainly depends on the disk transfer rate, and generally takes the size of the data transferred in 1 s.

2 Shell operation of HDFS

2.1 Basic syntax

There are two types of operation modes of HDFS in Sheel, which have exactly the same functions except for the different forms:

  • hadoop fs + specific commands
  • hdfs dfs + specific commands

2.2 Common commands

2.2.1 Upload

  1. -moveFromLocal: cut and paste from local to HDFS

hadoop fs -moveFromLocal LocalFilePath HDFSPath

or

hdfs dfs -moveFromLocal LocalFilePath HDFSPath

  1. -copyFromLocal: Copy and paste from the local file system to the HDFS path

hadoop fs -moveFromLocal LocalFilePath HDFSPath

  1. -appendToFile: append a file to the end of an existing file

hadoop fs -appendToFile HDFSFilePath FilePath

  1. -put: equivalent to copyFromLocal

2.2.2 download

  1. -copyToLocal: copy from HDFS to local

hadoop fs -copyToLocal HDFSFilePath LocalPath

  1. -get: equivalent to copyToLocal, which is to download files from HDFS to the local
  2. -getmerge: Merge and download multiple files, for example, there are multiple files under /user/test in HDFS: log1.txt, log2.txt,...

hadoop fs -getmerge /user/test/* ./zaiyiqi.txt

2.2.3 HDFS Direct Operation

  1. -ls: display directory information
  2. -mkdir: create a directory on HDFS

hadoop fs -mkdir -p Path

  1. -cat: display file content
  2. -chmod, -chown, -chgrp: Same as the usage in the Linux file system, used to modify the readable, writable, and executable permissions of the file .

hadoop fs -chmod [permission parameter][owner][:[group]] path

  1. -cp: Copy from one path of HDFS to another path of HDFS.
  2. -mv: Move files in HDFS directory .
  3. -tail: display the end 1kb of data of a file
  4. -rm: delete a file or folder
  5. -rmdir: remove empty directories
  6. -du Statistical folder size information
  7. -setrep: set the number of copies of files in HDFS

hadoop fs -setrep 10

Guess you like

Origin blog.csdn.net/meng_xin_true/article/details/126038850