Big data 02-HDFS usage and basic commands

Hadoop Distributed File System

Introduction to HDFS

  HDFS (Hadoop Distribute File System) is a very reliable storage system in the field of big data. It stores very large data files in a distributed manner, but it is not suitable for storing a large number of small data files. At the same time, HDFS is the data storage layer of Hadoop and other components. It runs on a cluster composed of cheap commercial machines, and the probability of failure of cheap machines is relatively high. Therefore, HDFS adopts various mechanisms in its design. Guarantees data integrity in case of hardware failure.

  Overall, HDFS should achieve the following goals:

  • Compatible with cheap hardware devices : to ensure data integrity even in the event of hardware failure
  • Stream data read and write : does not support random read and write operations
  • Large data sets : the amount of data is generally at the level of GB or TB
  • Simple file model : write once, read many
  • Strong Cross-Platform Compatibility : JavaImplemented in Languages

  However, HDFS also has the following limitations:

  • Not suitable for low-latency data access: HDFS is mainly designed for large-scale data batch processing . It adopts streaming data reading and has a high data throughput rate . However, this also means high latency . Therefore, HDFS does not It is suitable for applications that require low latency (such as tens of milliseconds). For applications with low latency requirements, HBase is a better choice;
  • Unable to efficiently store a large number of small files : A small file refers to a file whose size is smaller than one block. HDFS cannot efficiently store and process a large number of small files. Too many small files will bring many problems to system scalability and performance:
    1. HDFS uses the name node (NameNode) to manage the metadata of the file system. These metadata are stored in memory, so that the client can quickly obtain the actual storage location of the file. Usually, each file, directory, and block occupies about 150 bytes. If there are 10 million files and each file corresponds to a block, then the name node will consume at least 3GB of memory to save these metadata information. Obviously, the efficiency of metadata retrieval is relatively low at this time, and it takes more time to find the actual storage location of a file. Moreover, if it continues to expand to billions of files, the memory space required by the name node to save metadata will increase greatly. With the existing hardware level, it is impossible to save such a large amount of metadata in memory;
    2. When using MapReduce to process a large number of small files, too many Map tasks will be generated, and the thread management overhead will be greatly increased, so the speed of processing a large number of small files is far lower than the speed of processing large files of the same size;
    3. The speed of accessing a large number of small files is much lower than the speed of accessing large files, because accessing a large number of small files requires continuous jumping from one data node to another, seriously affecting performance.
  • Does not support multi-user writing and arbitrarily modifying files : HDFS only allows one writer for a file, does not allow multiple users to write to the same file, and only allows append operations to files, not random write operations.

Architecture of HDFS

  HDFS adopts the master-slave (Master/Slave) structure model. An HDFS cluster includes a name node (NameNode) and several data nodes (DataNode) .

  • As the central server, the name node is responsible for managing the namespace of the file system and the client's access to files.
  • The data node is responsible for processing the read/write request of the file system client, and performs operations such as creation, deletion and replication of data blocks under the unified scheduling of the name node.
  • Each data node will periodically send **"heartbeat" information** to the name node to report its status, and the data node that does not send the heartbeat information on time will be marked as "downtime", and will not be assigned any more I/O requests.

  When using HDFS, users can still use file names to store and access files as in ordinary file systems.

  In fact, within the system, a file is divided into several data blocks, and these data blocks are distributed and stored on several data nodes. When the client needs to access a file, it first sends the file name to the name node, and the name node finds the corresponding data block according to the file name (a file may include multiple data blocks), and then finds the data actually stored in the file according to the information of each data block. The location of the data node of each data block, and send the location of the data node to the client, and finally the client directly accesses these data nodes to obtain the data. During the entire access process, the name node does not participate in the transmission of data. This design method enables the data of each file to be accessed concurrently on different data nodes, greatly improving the data access speed.

HDFS usage and basic commands

1. Start HDFS-related processes of Hadoop
and switch to root user

sudo passwd root
su

Switch to the /hadoop/sbin path:

cd /opt/hadoop/sbin/

Open file to add content

Enter i and press Enter to enter the insert mode, add content at the end of the file;
press the esc key to exit editing;
enter :wq and press Enter to save and exit.

vim /opt/hadoop/sbin/start-dfs.sh

vim /opt/hadoop/sbin/stop-dfs.sh

vim /opt/hadoop/sbin/start-yarn.sh

vim /opt/hadoop/sbin/stop-yarn.sh

Add the following parameters to the top of the two files start-dfs.sh and stop-dfs.sh

#!/usr/bin/env bash
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root

Also, start-yarn.sh, stop-yarn.sh top also need to add the following:

#!/usr/bin/env bash
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

  Start the HDFS service of Hadoop, and use the root user to execute the following command:

./start-dfs.sh

insert image description hereinsert image description here
insert image description hereinsert image description here
insert image description hereinsert image description here
start ssh service
insert image description here

Start the HDFS daemon process, a warning message appears:

WARNING: HADOOP_SECURE_DN_USER has been replaced by HDFS_DATANODE_SECURE_USER. Using value of HADOOP_SECURE_DN_USER.

Solution:

vim /opt/hadoop/sbin/start-dfs.sh

vim /opt/hadoop/sbin/stop-dfs.sh
#把start-dfs.sh,stop-dfs.sh文件的如下代码:

HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs 
HDFS_NAMENODE_USER=root 
HDFS_SECONDARYNAMENODE_USER=root


#改为,如下代码:

HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs 
HDFS_NAMENODE_USER=root 
HDFS_SECONDARYNAMENODE_USER=root

2. View HDFS process

Start the Java environment

source /etc/profile

Enter jpsthe command to view all Javaprocesses

jps

insert image description hereinsert image description here
insert image description hereinsert image description here

3. Verify the running status of HDFS

  Create a directory on hdfs and execute the following command to verify that it can be created successfully:

hadoop fs -mkdir /myhadoop1

  If the creation is successful, execute the following command to query the root directory of the hdfs file system, and you will see /myhadoop1the directory:

hadoop fs -ls /

insert image description hereinsert image description here

4. lsCommand

  To list the directories and files under the root directory of the hdfs file system, execute the command as follows:

hadoop fs -ls /

  To list all directories and files in the hdfs file system, execute the command as follows:

hadoop fs -ls -R /

  The execution results are as follows:
insert image description here
5. putCommand

1) Copy files
  and upload local files to hdfs, the command format is as follows:

hadoop fs -put <local file> <hdfs file>

  The <hdfs file>parent directory must exist, otherwise the command execution fails. For example, to upload the /opt/hadoopfile README.txtto the root directory of the hdfs file system, the command is as follows:

hadoop fs -put /opt/hadoop/README.txt /

2) Copy the directory
  and upload the local folder to the hdfs folder. The command format is as follows:

hadoop fs -put <local dir> <hdfs dir>

  The <hdfs dir>parent directory must exist, otherwise the command will fail. For example, to upload /opt/hadoop/the logfolder to the root directory of the hdfs file system, the command is as follows:

hadoop fs -put /opt/hadoop/logs / 

3) Check whether the copy is successful
  To check whether the uploaded file or directory is successful, execute the following command:

hadoop fs -ls <hdfs file/hdfs dir>

README.txtFor example, to check whether the files and directories   just uploaded logexist in the hdfs root directory, the command is as follows:

hadoop fs -ls /

insert image description here
6. moveFromLocalCommand

1) Copy files or directories
  and upload local files/folders to hdfs, but the local files/folders will be deleted. The command format is as follows:

hadoop fs -moveFromLocal <local src> <hdfs dst>

  For example, execute the following command to upload local files/folders to hdfs:

hadoop fs -moveFromLocal /opt/hadoop/NOTICE.txt /myhadoop1
hadoop fs -moveFromLocal /opt/hadoop/logs /myhadoop1

2) Check whether the copy is successful
  To check whether the uploaded file or directory is successful, execute the following command:

hadoop fs -ls <hdfs file/hdfs dir>

NOTICE.txtFor example, to check whether the files and directories   just uploaded exist in the directory logof the hdfs file system , the command is as follows:/myhadoop1

hadoop fs -ls /myhadoop1

insert image description here
7. getOrder

1) Copy files or directories to the local
  Download the files/folders in the hdfs file system to the local, the command format is as follows:

hadoop fs -get < hdfs file or dir > < local file or dir>

  For example, to download the and /myhadoop1under the directory in the hdfs file system to the local path directory, execute the command as follows:NOTICE.txtlogs/opt/hadoop

hadoop fs -get /myhadoop1/NOTICE.txt /opt/hadoop/
hadoop fs -get /myhadoop1/logs /opt/hadoop/

Notice:

  1. When copying multiple files or directories to the local, the local should be the folder path
  2. local fileIt cannot hdfs filebe the same as the name, otherwise it will prompt that the file already exists.
  3. If the user is not the root user , localthe path should use the path under the user folder, otherwise there will be a permission problem

2) Check whether it is successfully copied to the local
  to check whether /opt/hadoopthere are copied NOTICEfiles or logsdirectories in the local directory, execute the following command:

cd /opt/hadoop
ls -l

insert image description here
8. rmCommand

1) Delete one or more files
  In the hdfs file system, delete one or more files, the command format is as follows:

hadoop fs -rm <hdfs file> ...

README.txtFor example, to delete files   in the root directory of the hdfs file system , the command is as follows:

hadoop fs -rm /README.txt

2) Delete one or more directories
  In the hdfs file system, delete one or more directories, the command format is as follows:

hadoop fs -rm -r <hdfs dir> ...

  For example, to delete the directory under the root directory of the hdfs file system logs, the command is as follows:

hadoop fs -rm -r /logs

3) Check whether the deletion is successful Check whether the files and directories
  just deleted exist in the hdfs root directory, the command is as follows:README.txtlog

hadoop fs -ls /

  If the deletion was successful, you will not see /logsand /NOTICE.txt.
9. mkdirOrder

1) Create a new directory
  Use the following command to create a directory in the hdfs file system. This command can only create directories level by level. If the parent directory does not exist, an error will be reported:

hadoop fs -mkdir <hdfs path>

  For example, /myhadoop1to create testa directory under the directory of the hdfs file system, the command is as follows:

hadoop fs -mkdir /myhadoop1/test

2) Create a new directory ( -poption)
  Use the following command to create a directory in the hdfs file system. If the parent directory does not exist, create the parent directory:

hadoop fs -mkdir -p <hdfs dir> ...

  For example, to create a directory in the hdfs file system /myhadoop1/test, the command is as follows:

hadoop fs -mkdir -p /myhadoop2/test

3) Query the directory
  to check whether the newly created /myhadoop1/testand /myhadoop2/testdirectory exists. The command is as follows:

hadoop fs -ls /
hadoop fs -ls /myhadoop1
hadoop fs -ls /myhadoop2

insert image description here
10. cpCommand

  Use the following command to copy a file or directory on the hdfs file system. If the target file does not exist, the command execution fails, which is equivalent to renaming and saving the file, and the source file still exists:

hadoop fs -cp <hdfs file or dir>... <hdfs dir>

  Follow the steps below and use cpthe command to /LICENSE.txtcopy to /myhadoop1the directory:
1) Copy a local file to the root directory of HDFS
  and upload the file in the local /opt/hadoopdirectory LICENSE.txtto the root directory of the hdfs file system, the command is as follows:

hadoop fs -put /opt/hadoop/LICENSE.txt /

  To check whether the root directory of the hdfs file system LICENSE.txtexists, the command is as follows:

hadoop fs -ls /

2) Copy this file to /myhadoop1the directory
  Use the command to copy the file cpin the root directory of the hdfs file system to the directory, the command is as follows:LICENSE.txt/myhadoop1

hadoop fs -cp /LICENSE.txt /myhadoop1

3) Check /myhadoop1the directory
  Use the following command to check /myhadoop1whether there are files in the directory of the hdfs file system LICENSE.txt:

hadoop fs -ls /myhadoop1

insert image description here
11. mvOrder

  Use the following command to move a file or directory on the hdfs file system. If the target file does not exist, the command execution fails, which is equivalent to renaming and saving the file. The source file does not exist; when there are multiple source paths, the target path Must be a directory and must exist:

hadoop fs -mv <hdfs file or dir>... <hdfs dir>

**Note:** Movement across file systems (local to hdfs or vice versa) is not allowed.

  Follow the steps below and use mvthe command to /myhadoop1/LICENSE.txtmove to /myhadoop2the directory:
1) Move an HDFS file
  Use mvthe command to move the file /myhadoop1under the directory of the hdfs file system to the directory, the command is as follows:LICENSE.txt/myhadoop2

hadoop fs -mv /myhadoop1/LICENSE.txt /myhadoop2

2) Query /myhadoop2directory
  Use the following command to check /myhadoop2whether there are LICENSE.txtfiles in the directory of the hdfs file system:

hadoop fs -ls /myhadoop2

insert image description here

12. countOrder

  Use the following command to count the number of directories, the number of files, and the total size of files under the path corresponding to hdfs:

hadoop fs -count <hdfs path>

  For example, to view /myhadoop1/logsthe number of directories, the number of files, and the total size of files under a directory, the command is as follows:

hadoop fs -count /myhadoop1/logs
  1. duOrder
  • Display the size of each folder and file under the corresponding path of hdfs
hadoop fs -du <hdsf path>
  • Display the sum of all file sizes under the hdfs corresponding path
hadoop fs -du -s <hdsf path>
  • Display the size of each folder and file under the corresponding path of hdfs. The size of the file is expressed in an easy-to-read form, for example, use 64M instead of 67108864
hadoop fs -du -h <hdsf path>

  For example, execute the following command to view /myhadoop2the size of each folder and file in the hdfs file system directory, and the total size of all files:

hadoop fs -du /myhadoop2
hadoop fs -du -s /myhadoop2
hadoop fs -du -h /myhadoop2
hadoop fs -du -s -h /myhadoop2

  Execution result description:

  • The first column: indicates the total file size in this directory
  • The second column: Indicates the total storage size of all files in the directory on the cluster. The size is related to the number of copies. The default number of copies is 3, so the second column is three times that of the first column (the content of the second column = file size × \times× number of duplicates)
  • The third column: indicates the directory of the query

insert image description here
14. setrepOrder

  Use the following command to change the number of copies of a file in the hdfs file system. The number 3 indicates the number of copies set. Among them, the -Roption can recursively perform the operation of changing the number of copies on all directories and files under a directory:

hadoop fs -setrep -R 3 <hdfs path>

  For example, to recursively execute all directories and files under the directory in the hdfs file system /myhadoop1, set to 3 copies, the command is as follows:

hadoop fs -setrep -R 3 /myhadoop1

insert image description here
15. statCommand
  Use the following command to view the status information of the corresponding path:

hdoop fs -stat [format] < hdfs path >

  Among them, [format]the optional parameters are:

  • %b:File size
  • %o: Block size
  • %n:file name
  • %r: Number of copies
  • %y: Last modified date and time

  For example, to view /myhadoop2/LICENSE.txtthe size of files in the hdfs file system, the command is as follows:

hadoop fs -stat %b /myhadoop2/LICENSE.txt

insert image description here
16. balancerOrder

  This command is mainly used when the administrator finds that some DataNodesaved data is too much and some DataNodesaved data is relatively small, you can use the following command to manually start the internal balancing process:

hadoop balancer
hdfs balancer
  1. dfsadminOrder

  This command is mainly used by administrators to dfsadminmanage HDFS by:

1) Use -helpparameters, view related help:

hdfs dfsadmin -help

2) Use -reportparameters to view the basic data of the file system:

hdfs dfsadmin -report

insert image description here

3) Using -safemodeparameters, operate in safe mode:

hdfs dfsadmin -safemode <enter | leave | get | wait>

in:

  • enter: enter safe mode
  • leave: leave safe mode
  • get: Check whether safe mode is enabled
  • wait: wait to leave safe mode

  For example, to enter safe mode, execute the command as follows:

hdfs dfsadmin -safemode enter

insert image description here

18 catorders

  Use catcommands to view the contents of text files in the hdfs file system, for example, to view the contents of files in the root directory deom.txt:

 hadoop fs -cat /demo.txt
 hadoop fs -tail -f /demo.txt

insert image description here

  After using hadoop fs -tail -fthe command, the terminal will track according to the file descriptor. When the file is renamed or deleted, the tracking will stop. The terminal operation is as follows:

  • If you want to pause the refresh at this time, use Ctrl+Sthe pause terminal Sto indicatesleep
  • If you want to continue to refresh the terminal, use Ctrl+Q, Qsaidquiet
  • If you want to exit tailthe command, you can use it directly Ctrl+C, or you can use Ctrl+Z
    Ctrl+Cand Ctrl+Zare both interrupt commands, but their functions are different:
    1. Ctrl+CIt is more violent, it is sent Terminalto the current program, for example, a search function is running, the file is being searched, and the use Ctrl+Cwill forcibly end the current process
    2. Ctrl+ZIt will suspend the current program and suspend the execution of this program. For example, mysqlin the terminal, you need to jump out to perform other file operations, and you don’t want to exit mysqlthe terminal (because you need to enter the user name and password to enter next time, which is very troublesome), so you can use Ctrl+Zthe mysqlHang up, and then perform other operations, enter fgthe carriage return to return to mysqlthe terminal, or suspend many processes to the background, and execute the fg <编号>suspended process to return to the current terminal. Coordination bgand fgcommands can make it easier to switch between the front and back
  1. appendToFileOrder

  Append the content of the local file to the text file in the hdfs file system, the command format is as follows:

hadoop fs -appendToFile <local file> <hdfs file>
  1. chownCommands
      Use chowncommands to modify the read, write, and execute permissions of files in the hdfs file system. Examples of commands are as follows:
hadoop fs -chown user:group /datawhale
hadoop fs -chmod 777 /datawhale

Among them, the parameters are described as follows:

  • chown: defines who owns the file
  • chmod: defines what can be done with the file

learning reference

https://github.com/datawhalechina/juicy-bigdata

Guess you like

Origin blog.csdn.net/weixin_45735391/article/details/128982173