Table of contents
Hadoop Distributed File System
Introduction to HDFS
HDFS (Hadoop Distribute File System) is a very reliable storage system in the field of big data. It stores very large data files in a distributed manner, but it is not suitable for storing a large number of small data files. At the same time, HDFS is the data storage layer of Hadoop and other components. It runs on a cluster composed of cheap commercial machines, and the probability of failure of cheap machines is relatively high. Therefore, HDFS adopts various mechanisms in its design. Guarantees data integrity in case of hardware failure.
Overall, HDFS should achieve the following goals:
- Compatible with cheap hardware devices : to ensure data integrity even in the event of hardware failure
- Stream data read and write : does not support random read and write operations
- Large data sets : the amount of data is generally at the level of GB or TB
- Simple file model : write once, read many
- Strong Cross-Platform Compatibility :
Java
Implemented in Languages
However, HDFS also has the following limitations:
- Not suitable for low-latency data access: HDFS is mainly designed for large-scale data batch processing . It adopts streaming data reading and has a high data throughput rate . However, this also means high latency . Therefore, HDFS does not It is suitable for applications that require low latency (such as tens of milliseconds). For applications with low latency requirements, HBase is a better choice;
- Unable to efficiently store a large number of small files : A small file refers to a file whose size is smaller than one block. HDFS cannot efficiently store and process a large number of small files. Too many small files will bring many problems to system scalability and performance:
- HDFS uses the name node (NameNode) to manage the metadata of the file system. These metadata are stored in memory, so that the client can quickly obtain the actual storage location of the file. Usually, each file, directory, and block occupies about 150 bytes. If there are 10 million files and each file corresponds to a block, then the name node will consume at least 3GB of memory to save these metadata information. Obviously, the efficiency of metadata retrieval is relatively low at this time, and it takes more time to find the actual storage location of a file. Moreover, if it continues to expand to billions of files, the memory space required by the name node to save metadata will increase greatly. With the existing hardware level, it is impossible to save such a large amount of metadata in memory;
- When using MapReduce to process a large number of small files, too many Map tasks will be generated, and the thread management overhead will be greatly increased, so the speed of processing a large number of small files is far lower than the speed of processing large files of the same size;
- The speed of accessing a large number of small files is much lower than the speed of accessing large files, because accessing a large number of small files requires continuous jumping from one data node to another, seriously affecting performance.
- Does not support multi-user writing and arbitrarily modifying files : HDFS only allows one writer for a file, does not allow multiple users to write to the same file, and only allows append operations to files, not random write operations.
Architecture of HDFS
HDFS adopts the master-slave (Master/Slave) structure model. An HDFS cluster includes a name node (NameNode) and several data nodes (DataNode) .
- As the central server, the name node is responsible for managing the namespace of the file system and the client's access to files.
- The data node is responsible for processing the read/write request of the file system client, and performs operations such as creation, deletion and replication of data blocks under the unified scheduling of the name node.
- Each data node will periodically send **"heartbeat" information** to the name node to report its status, and the data node that does not send the heartbeat information on time will be marked as "downtime", and will not be assigned any more I/O requests.
When using HDFS, users can still use file names to store and access files as in ordinary file systems.
In fact, within the system, a file is divided into several data blocks, and these data blocks are distributed and stored on several data nodes. When the client needs to access a file, it first sends the file name to the name node, and the name node finds the corresponding data block according to the file name (a file may include multiple data blocks), and then finds the data actually stored in the file according to the information of each data block. The location of the data node of each data block, and send the location of the data node to the client, and finally the client directly accesses these data nodes to obtain the data. During the entire access process, the name node does not participate in the transmission of data. This design method enables the data of each file to be accessed concurrently on different data nodes, greatly improving the data access speed.
HDFS usage and basic commands
1. Start HDFS-related processes of Hadoop
and switch to root user
sudo passwd root
su
Switch to the /hadoop/sbin path:
cd /opt/hadoop/sbin/
Open file to add content
Enter i and press Enter to enter the insert mode, add content at the end of the file;
press the esc key to exit editing;
enter :wq and press Enter to save and exit.
vim /opt/hadoop/sbin/start-dfs.sh
vim /opt/hadoop/sbin/stop-dfs.sh
vim /opt/hadoop/sbin/start-yarn.sh
vim /opt/hadoop/sbin/stop-yarn.sh
Add the following parameters to the top of the two files start-dfs.sh and stop-dfs.sh
#!/usr/bin/env bash
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
Also, start-yarn.sh, stop-yarn.sh top also need to add the following:
#!/usr/bin/env bash
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
Start the HDFS service of Hadoop, and use the root user to execute the following command:
./start-dfs.sh
start ssh service
Start the HDFS daemon process, a warning message appears:
WARNING: HADOOP_SECURE_DN_USER has been replaced by HDFS_DATANODE_SECURE_USER. Using value of HADOOP_SECURE_DN_USER.
Solution:
vim /opt/hadoop/sbin/start-dfs.sh
vim /opt/hadoop/sbin/stop-dfs.sh
#把start-dfs.sh,stop-dfs.sh文件的如下代码:
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
#改为,如下代码:
HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
2. View HDFS process
Start the Java environment
source /etc/profile
Enter jps
the command to view all Java
processes
jps
3. Verify the running status of HDFS
Create a directory on hdfs and execute the following command to verify that it can be created successfully:
hadoop fs -mkdir /myhadoop1
If the creation is successful, execute the following command to query the root directory of the hdfs file system, and you will see /myhadoop1
the directory:
hadoop fs -ls /
4. ls
Command
To list the directories and files under the root directory of the hdfs file system, execute the command as follows:
hadoop fs -ls /
To list all directories and files in the hdfs file system, execute the command as follows:
hadoop fs -ls -R /
The execution results are as follows:
5. put
Command
1) Copy files
and upload local files to hdfs, the command format is as follows:
hadoop fs -put <local file> <hdfs file>
The <hdfs file>
parent directory must exist, otherwise the command execution fails. For example, to upload the /opt/hadoop
file README.txt
to the root directory of the hdfs file system, the command is as follows:
hadoop fs -put /opt/hadoop/README.txt /
2) Copy the directory
and upload the local folder to the hdfs folder. The command format is as follows:
hadoop fs -put <local dir> <hdfs dir>
The <hdfs dir>
parent directory must exist, otherwise the command will fail. For example, to upload /opt/hadoop/
the log
folder to the root directory of the hdfs file system, the command is as follows:
hadoop fs -put /opt/hadoop/logs /
3) Check whether the copy is successful
To check whether the uploaded file or directory is successful, execute the following command:
hadoop fs -ls <hdfs file/hdfs dir>
README.txt
For example, to check whether the files and directories just uploaded log
exist in the hdfs root directory, the command is as follows:
hadoop fs -ls /
6. moveFromLocal
Command
1) Copy files or directories
and upload local files/folders to hdfs, but the local files/folders will be deleted. The command format is as follows:
hadoop fs -moveFromLocal <local src> <hdfs dst>
For example, execute the following command to upload local files/folders to hdfs:
hadoop fs -moveFromLocal /opt/hadoop/NOTICE.txt /myhadoop1
hadoop fs -moveFromLocal /opt/hadoop/logs /myhadoop1
2) Check whether the copy is successful
To check whether the uploaded file or directory is successful, execute the following command:
hadoop fs -ls <hdfs file/hdfs dir>
NOTICE.txt
For example, to check whether the files and directories just uploaded exist in the directory log
of the hdfs file system , the command is as follows:/myhadoop1
hadoop fs -ls /myhadoop1
7. get
Order
1) Copy files or directories to the local
Download the files/folders in the hdfs file system to the local, the command format is as follows:
hadoop fs -get < hdfs file or dir > < local file or dir>
For example, to download the and /myhadoop1
under the directory in the hdfs file system to the local path directory, execute the command as follows:NOTICE.txt
logs
/opt/hadoop
hadoop fs -get /myhadoop1/NOTICE.txt /opt/hadoop/
hadoop fs -get /myhadoop1/logs /opt/hadoop/
Notice:
- When copying multiple files or directories to the local, the local should be the folder path
local file
It cannothdfs file
be the same as the name, otherwise it will prompt that the file already exists.- If the user is not the root user ,
local
the path should use the path under the user folder, otherwise there will be a permission problem
2) Check whether it is successfully copied to the local
to check whether /opt/hadoop
there are copied NOTICE
files or logs
directories in the local directory, execute the following command:
cd /opt/hadoop
ls -l
8. rm
Command
1) Delete one or more files
In the hdfs file system, delete one or more files, the command format is as follows:
hadoop fs -rm <hdfs file> ...
README.txt
For example, to delete files in the root directory of the hdfs file system , the command is as follows:
hadoop fs -rm /README.txt
2) Delete one or more directories
In the hdfs file system, delete one or more directories, the command format is as follows:
hadoop fs -rm -r <hdfs dir> ...
For example, to delete the directory under the root directory of the hdfs file system logs
, the command is as follows:
hadoop fs -rm -r /logs
3) Check whether the deletion is successful Check whether the files and directories
just deleted exist in the hdfs root directory, the command is as follows:README.txt
log
hadoop fs -ls /
If the deletion was successful, you will not see /logs
and /NOTICE.txt
.
9. mkdir
Order
1) Create a new directory
Use the following command to create a directory in the hdfs file system. This command can only create directories level by level. If the parent directory does not exist, an error will be reported:
hadoop fs -mkdir <hdfs path>
For example, /myhadoop1
to create test
a directory under the directory of the hdfs file system, the command is as follows:
hadoop fs -mkdir /myhadoop1/test
2) Create a new directory ( -p
option)
Use the following command to create a directory in the hdfs file system. If the parent directory does not exist, create the parent directory:
hadoop fs -mkdir -p <hdfs dir> ...
For example, to create a directory in the hdfs file system /myhadoop1/test
, the command is as follows:
hadoop fs -mkdir -p /myhadoop2/test
3) Query the directory
to check whether the newly created /myhadoop1/test
and /myhadoop2/test
directory exists. The command is as follows:
hadoop fs -ls /
hadoop fs -ls /myhadoop1
hadoop fs -ls /myhadoop2
10. cp
Command
Use the following command to copy a file or directory on the hdfs file system. If the target file does not exist, the command execution fails, which is equivalent to renaming and saving the file, and the source file still exists:
hadoop fs -cp <hdfs file or dir>... <hdfs dir>
Follow the steps below and use cp
the command to /LICENSE.txt
copy to /myhadoop1
the directory:
1) Copy a local file to the root directory of HDFS
and upload the file in the local /opt/hadoop
directory LICENSE.txt
to the root directory of the hdfs file system, the command is as follows:
hadoop fs -put /opt/hadoop/LICENSE.txt /
To check whether the root directory of the hdfs file system LICENSE.txt
exists, the command is as follows:
hadoop fs -ls /
2) Copy this file to /myhadoop1
the directory
Use the command to copy the file cp
in the root directory of the hdfs file system to the directory, the command is as follows:LICENSE.txt
/myhadoop1
hadoop fs -cp /LICENSE.txt /myhadoop1
3) Check /myhadoop1
the directory
Use the following command to check /myhadoop1
whether there are files in the directory of the hdfs file system LICENSE.txt
:
hadoop fs -ls /myhadoop1
11. mv
Order
Use the following command to move a file or directory on the hdfs file system. If the target file does not exist, the command execution fails, which is equivalent to renaming and saving the file. The source file does not exist; when there are multiple source paths, the target path Must be a directory and must exist:
hadoop fs -mv <hdfs file or dir>... <hdfs dir>
**Note:** Movement across file systems (local to hdfs or vice versa) is not allowed.
Follow the steps below and use mv
the command to /myhadoop1/LICENSE.txt
move to /myhadoop2
the directory:
1) Move an HDFS file
Use mv
the command to move the file /myhadoop1
under the directory of the hdfs file system to the directory, the command is as follows:LICENSE.txt
/myhadoop2
hadoop fs -mv /myhadoop1/LICENSE.txt /myhadoop2
2) Query /myhadoop2
directory
Use the following command to check /myhadoop2
whether there are LICENSE.txt
files in the directory of the hdfs file system:
hadoop fs -ls /myhadoop2
12. count
Order
Use the following command to count the number of directories, the number of files, and the total size of files under the path corresponding to hdfs:
hadoop fs -count <hdfs path>
For example, to view /myhadoop1/logs
the number of directories, the number of files, and the total size of files under a directory, the command is as follows:
hadoop fs -count /myhadoop1/logs
du
Order
- Display the size of each folder and file under the corresponding path of hdfs
hadoop fs -du <hdsf path>
- Display the sum of all file sizes under the hdfs corresponding path
hadoop fs -du -s <hdsf path>
- Display the size of each folder and file under the corresponding path of hdfs. The size of the file is expressed in an easy-to-read form, for example, use 64M instead of 67108864
hadoop fs -du -h <hdsf path>
For example, execute the following command to view /myhadoop2
the size of each folder and file in the hdfs file system directory, and the total size of all files:
hadoop fs -du /myhadoop2
hadoop fs -du -s /myhadoop2
hadoop fs -du -h /myhadoop2
hadoop fs -du -s -h /myhadoop2
Execution result description:
- The first column: indicates the total file size in this directory
- The second column: Indicates the total storage size of all files in the directory on the cluster. The size is related to the number of copies. The default number of copies is 3, so the second column is three times that of the first column (the content of the second column = file size × \times× number of duplicates)
- The third column: indicates the directory of the query
14. setrep
Order
Use the following command to change the number of copies of a file in the hdfs file system. The number 3 indicates the number of copies set. Among them, the -R
option can recursively perform the operation of changing the number of copies on all directories and files under a directory:
hadoop fs -setrep -R 3 <hdfs path>
For example, to recursively execute all directories and files under the directory in the hdfs file system /myhadoop1
, set to 3 copies, the command is as follows:
hadoop fs -setrep -R 3 /myhadoop1
15. stat
Command
Use the following command to view the status information of the corresponding path:
hdoop fs -stat [format] < hdfs path >
Among them, [format]
the optional parameters are:
%b
:File size%o
: Block size%n
:file name%r
: Number of copies%y
: Last modified date and time
For example, to view /myhadoop2/LICENSE.txt
the size of files in the hdfs file system, the command is as follows:
hadoop fs -stat %b /myhadoop2/LICENSE.txt
16. balancer
Order
This command is mainly used when the administrator finds that some DataNode
saved data is too much and some DataNode
saved data is relatively small, you can use the following command to manually start the internal balancing process:
hadoop balancer
或
hdfs balancer
dfsadmin
Order
This command is mainly used by administrators to dfsadmin
manage HDFS by:
1) Use -help
parameters, view related help:
hdfs dfsadmin -help
2) Use -report
parameters to view the basic data of the file system:
hdfs dfsadmin -report
3) Using -safemode
parameters, operate in safe mode:
hdfs dfsadmin -safemode <enter | leave | get | wait>
in:
enter
: enter safe modeleave
: leave safe modeget
: Check whether safe mode is enabledwait
: wait to leave safe mode
For example, to enter safe mode, execute the command as follows:
hdfs dfsadmin -safemode enter
18 cat
orders
Use cat
commands to view the contents of text files in the hdfs file system, for example, to view the contents of files in the root directory deom.txt
:
hadoop fs -cat /demo.txt
hadoop fs -tail -f /demo.txt
After using hadoop fs -tail -f
the command, the terminal will track according to the file descriptor. When the file is renamed or deleted, the tracking will stop. The terminal operation is as follows:
- If you want to pause the refresh at this time, use
Ctrl+S
the pause terminalS
to indicatesleep
- If you want to continue to refresh the terminal, use
Ctrl+Q
,Q
saidquiet
- If you want to exit
tail
the command, you can use it directlyCtrl+C
, or you can useCtrl+Z
Ctrl+C
andCtrl+Z
are both interrupt commands, but their functions are different:Ctrl+C
It is more violent, it is sentTerminal
to the current program, for example, a search function is running, the file is being searched, and the useCtrl+C
will forcibly end the current processCtrl+Z
It will suspend the current program and suspend the execution of this program. For example,mysql
in the terminal, you need to jump out to perform other file operations, and you don’t want to exitmysql
the terminal (because you need to enter the user name and password to enter next time, which is very troublesome), so you can useCtrl+Z
themysql
Hang up, and then perform other operations, enterfg
the carriage return to return tomysql
the terminal, or suspend many processes to the background, and execute thefg <编号>
suspended process to return to the current terminal. Coordinationbg
andfg
commands can make it easier to switch between the front and back
appendToFile
Order
Append the content of the local file to the text file in the hdfs file system, the command format is as follows:
hadoop fs -appendToFile <local file> <hdfs file>
chown
Commands
Usechown
commands to modify the read, write, and execute permissions of files in the hdfs file system. Examples of commands are as follows:
hadoop fs -chown user:group /datawhale
hadoop fs -chmod 777 /datawhale
Among them, the parameters are described as follows:
chown
: defines who owns the filechmod
: defines what can be done with the file