07-HDFS Getting Started and Shell Commands

1 file system

  • Is a method of storing and organizing data that makes file access and querying easy
  • The abstract logical concepts of files and tree directories replace the concept of data blocks used by physical devices such as disks. Users who use the file system to save data do not need to care about where the underlying data is stored on the hard disk. They only need to remember the directory and file name of the file.
  • File systems typically use storage devices such as disks and optical disks and maintain the physical location of files on the device.
  • A file system is a set of abstract data types that implement operations such as data storage, hierarchical organization, access and acquisition.

file name

In the DOS operating system, the file name consists of the file main name and the extension, separated by a small dot.

​ File names can be used to locate storage locations and distinguish different files. The computer implements access by name.

​ Certain symbols have special meanings and are generally not allowed to appear in file names.

metadata

Metadata (metadata) is also called explanatory data, data that records data

​ File system metadata generally refers to information such as file size, last modification time, underlying storage location, attributes, user, permissions, etc.

File system classification

  • disk-based file system

It is the classic way of storing documents on non-volatile media (disk, optical disc). Used to preserve the contents of a file between sessions. Including ext2/3/4, xfs, ntfs, iso9660 and other well-known file systems. Linux system can use df -Th to view

  • virtual file system

File system generated in the kernel, such as proc

The proc file system is a virtual file system that enables a new method to communicate between the Linux kernel space and users.

  • network file system

Network file system (NFS, network file system) is a mechanism to mount partitions (directories) on a remote host to the local system through the network

Allows the local computer to access data on another computer. Operations on files in such a file system are performed over a network connection.

2 Distributed file system HDFS

2.1 Introduction to HDFS

  • HDFS (Hadoop Distributed File System) Hadoop distributed file system . It is one of the core components of Apache Hadoop and exists as the lowest-level distributed storage service in the big data ecosystem.
  • Distributed file systems solve the problem of how to store big data . Distributed means a storage system that spans multiple computers .
  • HDFS is a distributed file system that can run on common hardware. It is highly fault-tolerant and suitable for applications with large data sets. It is very suitable for storing large data (such as TB PB)
  • HDFS uses multiple computers to store files and provides a unified access interface , using a distributed file system just like accessing an ordinary file system.

2.2 HDFS design goals

  • **Hardware Failure** is the norm. HDFS may be composed of hundreds or thousands of servers, and each component may fail. Therefore, fault detection and automatic fast recovery are the core architectural goals of HDFS.
  • Applications on HDFS mainly use streaming data access (Streaming Data Access) . HDFS is designed for batch processing , not user interactive. Compared with the response time of data access, more emphasis is placed on the high throughput of data access.
  • Typical HDFS file sizes range from GB to TB. All, HDFS is tuned to support Large Data Sets . It should provide high aggregate data bandwidth, support hundreds of nodes in a cluster, and support tens of millions of files in a cluster.
  • Most HDFS applications require a write-one-read-many access model for files (write once and read many times). Once a file is created, written, and closed, it does not need to be modified. This assumption simplifies the data consistency problem and enables high-throughput data access.
  • Mobile computing is less expensive than moving data . Computations requested by an application are more efficient the closer they are to the data it operates on. It is obviously better to move computation to the data attachment than to move the data to the application location.
  • HDFS is designed to be easily portable from one platform to another. This facilitates the widespread use of HDFS as the platform of choice for a large number of applications.

2.3 HDFS application scenarios

Applicable scene

large files

Data streaming access

Write once, read many

Low cost deployment, cheap PC

High fault tolerance

Not applicable scenarios

small files

Data interactive access

Frequent and arbitrary modifications

Low latency processing

2.4 Important features of HDFS

1. Master-slave architecture

HDFS adopts master/slave architecture. Generally, an HDFS cluster consists of a NameNode and a certain number of DataNodes.

NameNode is the HDFS master node, and DataNode is the HDFS slave node . Both roles perform their own duties and coordinate to complete distributed file storage services.

2. Block storage mechanism

Files in HDFS are physically stored in blocks (blocks). The size of the blocks is determined by configuration parameters, which are located in hdfs-default.xml dfs.blocksize. The default size is 128M (134217728)

3. Copy mechanism

All blocks of the file will be copied. The block size (dfs.blocksize) and replication factor (dfs.replication) of each file are configurable. The copy factor can be specified when the file is created, or it can be changed later through commands.

The default value of dfs.replication is 3 , which means two additional copies, together with itself, total 3 copies.

4、namespace

HDFS supports traditional hierarchical file organization structure . Users can create directories and then save files in these directories. The file system namespace hierarchy is similar to that of most existing file systems: users can create, delete, move, or rename files.

The NameNode is responsible for maintaining the namespace namespace of the file system. Any modification to the file system namespace or attributes will be recorded by the NameNode.

HDFS will provide the client with a unified abstract directory tree , and the client will provide paths to access files.

形如:hdfs://namenode:port/dir-a/dir-b/dir-c/file.data

5. Metadata management

In HDFS, there are two types of metadata managed by NameNode:

  • File own attribute information

File name, permissions, modification time, file size, replication factor, data block size

  • File block location mapping information

Record the mapping information between blocks and DataNode, that is, which block is located on which node

6. Data block storage

The specific storage management of each block of the file is undertaken by the DataNode node . Each block can be stored on multiple DataNodes.

7. HDFS block size

Files in HDFS are physically stored in blocks (Block). The block size can be set through the configuration parameter (dfs.blocksize). The default size is 128M in Hadoop2.x/3.x version and 64M in 3.x version.

For example:

​ If the seeking time is 10ms, the transmission time = 10ms/0.01=1s (the addressing time is 1% of the transmission time is optimal), if the current disk transfer rate is 100MB/s, the block size = 1s * 100MB/ s = 100MB, it is more appropriate to set all block sizes to 128M

  • The block setting of HDFS is too small, which will increase the seeking time (the seeking time is greater than the transmission time), and the program is always looking for the starting position of the block.
  • If the block setting of HDFS is too large, the time to transfer data from the disk will be significantly longer than the time required to locate the starting position of the block (the transfer time is much longer than the seeking time), which will cause the data to be processed very slowly.
  • The optimal state is when the ratio of addressing time to transmission time is 1%
  • The block size setting of HDFS mainly depends on the disk transfer rate

3 HDFS shell CLI

Hadoop provides a shell command line client for the file system. The usage is as follows:

hdfs [options] subcommand [subcommand options] 
	
subcommand: admin commands / client command / daemon commands

The command related to file system reading and writing is hdfs dfs [generic options]

  • HDFS Shell CLI supports operating multiple file systems, including local file systems (file:), distributed file systems (hdfs://nn:8020), etc.
  • Which file system is operated depends on the prefix protocol in the URL
  • If no prefix is ​​specified, the fs.defaultFS attribute in the environment variable will be read and this attribute value will be used as the default file system.
hdfs dfs -ls file:///      # 操作本地文件系统(客户端所在的机器)
hdfs dfs -ls hdfs://node1:8020/    # 操作HDFS分布式文件系统
hdfs dfs -ls /     # 直接跟目录,没有指定协议,将加载读取fs.defaultFS属性默认值

3.1 HDFS Shell CLI Client

The difference between hadoop dfs, hdfs dfs and hadoop fs

  • hadoop dfs can only operate HDFS file system (including operations with Local FS), but it has been Deprecated
  • hdfs dfs can only operate HDFS file system (including operations with Local FS), commonly used
  • hadoop fs can operate on any operating system (not just hdfs file system, but has a wider range of applications)

The current version officially recommends using hadoop fs

3.2 HDFS Shell common commands

-mkdir creates directory

hadoop fs -mkdir [-p] <path>
# path 为待创建目录
# -p 表示沿着路径创建父目录

-ls View the contents of the specified directory

hadoop fs -ls [-h] [-R] [<path> ... ]
# path 指定目录路径
# -h 人性化显示文件size
# -R 递归查看指定目录及其子目录 

-put upload files to the specified directory

hadoop fs -put [-f] [-p] <localsrc>... <dst>
# -f 覆盖目录文件(如果目标文件存在,则覆盖)
# -p 保留访问和修改时间,所有权和权限
# localsrc 本地文件系统(客户端所在机器)
# dst 目标文件系统(HDFS)

-copyFromLocal copies files from the local file system to HDFS

(Same as put, customary to use put)

hadoop fs -copyFromLocal <localsrc>... <dst>

-moveFromLocal cuts local file system files to HDFS

(i.e. cut files from local to HDFS)

hadoop fs -moveFromLocal <localsrc>... <dst>
# 和-put功能相似,只不过上传结束会删除源数据

-appendToFile appends a file to an existing file

hadoop fs -appendToFile <localsrc> ... <dst> 

-cat\-head\-tail View HDFS file contents

hadoop fs -cat <src> ...
# 对于大文件内容读取,慎重
hadoop fs -head <file>
# 查看文件前1kB的内容
hadoop fs -tail [-f] <file>
# 查看文件最后1kB的内容
# -f 选择可以动态显示文件中追加的内容

-get\coptToLocal\-getmerge download HDFS files

(Copy files from HDFS to local)

hadoop fs -get [-f] [-p] <src>... <localdst>
# 下载文件至本地文件系统指定目录,localdst必须是目录
# -f 覆盖目标文件(如果本地文件系统存在该文件,则覆盖) 
# -p 保留访问和修改时间,所有权和权限

hadoop fs -copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst> 
# 等同于get,习惯上使用get
hadoop fs -getmerge [-n1] [-skip-empty-file] <src> <localdst>
# 下载多个文件合并到本地文件系统的一个文件中
# -n1 表示在每个文件的末尾添加换行符

-cp copies HDFS files

hadoop fs -cp [-f] <src> ... <dst>
# -f 覆盖目标文件(若目标文件存在,则覆盖)

-appendToFile appends data to HDFS files

hadoop fs -appendToFile <localsrc>... <dst>
# 将所给给定的本地文件的内容追加到给定的dst文件
# 若dst文件不存在,将创建文件
# 如果<localsrc>为-,则输入为从标准输入中读取

-df View HDFS disk space

hadoop fs -df [-h] [<path>...]
# 显示文件系统的容量,可以空间和已用空间

-du View the amount of space used by HDFS files

hadoop fs -du [-s] [-h] <path>...
# -s 表示显示指定路径文件长度的汇总摘要,而部署单个文件的摘要
# -h 表示人性化显示

-mvHDFS data movement

hadoop fs -mv <src> ... <dst>
# 移动文件至指定文件夹
# 可以使用该命令移动数据、重命名文件

-rm -r delete files/folders

hadoop fs -rm -r 路径
# -r 表示递归

-setrep changes the number of HDFS file copies

hadoop fs -setrep [-R] [-w] <rep> <path>...
# 修改指定文件的副本个数
# -R 表示递归,修改文件夹及其下所有
# -w 客户端是否等待副本修改完毕

illustrate:

​ The number of replicas set here is only recorded in the metadata of the NameNode. Whether there will really be so many replicas depends on the number of DataNodes. Because there are currently only 3 devices, there are only 3 replicas at most. Only when the number of nodes increases to 10, the number of replicas can reach 10.

-chgrp\-chmod\-chown modify file permissions

-chgrp, -chmod, -chown: the same usage in Linux file systems

hadoop fs  -chmod 666 文件所在路径
hadoop fs  -chown  nhk:nhk 文件所在路径

Guess you like

Origin blog.csdn.net/weixin_56058578/article/details/132260470