Dark Horse Programmer-Introduction to Big Data to Practical Combat-HDFS Distributed Storage

1. Why do we need distributed storage

① The amount of data is too large, and the storage capacity of a single machine has an upper limit, which needs to be solved by quantity.
② The increase in quantity brings about comprehensive improvements in network transmission, disk read and write, CPU, memory, etc., and the distributed combination can achieve The effect of 1+1>2

2. Distributed infrastructure analysis

① Decentralized mode: There is no clear center, everyone coordinates work
② Centralized mode: There is a clear center, and work is distributed based on the central node (Hadoop)

3. HDFS infrastructure

  • NameNode: master role, manage HDFS cluster and DataNode role
  • DataNode: from the role, responsible for data storage
  • SecondaryNameNode: Auxiliary role, assisting NameNode to organize metadata

4. HDFS cluster environment deployment

4.1 Deployment in VMware virtual machine

4.1.1 Cluster Planning

The role of Hadoop HDFS

  • NameNode
  • DataNode
  • SecondaryNameNode

node service

  • node1:NameNode、DataNode、SecondaryNameNode
  • node2:DataNode
  • node3:DataNode

4.1.2 Upload and decompress

  1. Upload the Hadoop installation package to the node1 node
  2. Unzip the installation package to /export/server/
tar -zxvf hadoop-3.3.6.tar.gz -C /export/server/
  1. build soft link
cd /export/server
ln -s /export/server/hadoop-3.3.6 hadoop
  1. Enter the hadoop installation package
cd hadoop

4.1.3 Hadoop installation package directory structure

  • bin: store various programs of Hadoop

  • etc: store Hadoop configuration files

  • sbin: administrator program

  • lib: store the dynamic link library (.so file) of the Linux system

  • libexec: Store script files (.sh and .cmd) that configure the Hadoop system

4.1.4 Modify configuration file

To configure the HDFS cluster, we mainly involve the modification of the following files

  1. Configure the workers file
cd etc/hadoop  #进入配置文件目录
vim workers # 编辑workers文件
# 填入如下内容
node1
node2
node3
  1. Configure hadoop-env.sh file
# 填入如下内容
export JAVA_HOME=/export/server/jdk
export HADOOP_HOME=/export/server/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
  1. Configure the core-site.xml file
# 在文件内部填入如下内容
<configuration>
  <property>
    <name>fs.defaultFS</name> # HDFS文件系统的网络通讯路径
    <value>hdfs://node1:8020</value> 
  </property>

  <property>
    <name>io.file.buffer.size</name>
    <value>131072</value> # io操作文件缓冲区大小
  </property>
</configuration>
  • hdfs://node1:8020 is the internal communication address of the entire HDFS, and the application protocol is hdfs:// (Hadoop built-in protocol)

  • Indicates that DataNode will communicate with port 8020 of node1, and node1 is the machine where NameNode is located

  • This configuration fixes that node1 must start the NameNode process

  1. Configure the hdfs-site.xml file
# 在文件内部填入如下内容
<configuration>
  <property>
    <name>dfs.datanode.data.dir.perm</name>  # hdfs文件系统,默认创建的文件权限设置
    <value>700</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>    # NameNode元数据的存储位置
    <value>/data/nn</value>
  </property>
  <property>
    <name>dfs.namenode.hosts</name>     # NameNode允许哪几个节点的DataNode连接(即允许加入集群)
    <value>node1,node2,node3</value>
  </property>
  <property>
    <name>dfs.blocksize</name>     # hdfs默认块大小
    <value>268435456</value>
  </property>
  <property>
    <name>dfs.namenode.handler.count</name>    # namenode处理的并发线程数
    <value>100</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>   # 从节点DataNode的数据存储目录
    <value>/data/dn</value>
  </property>
</configuration>

4.1.5 Prepare the data directory

  1. In node1 node:
mkdir -p /data/nn
mkdir /data/dn
  1. On node2, node3 nodes:
mkdir -p /data/dn

4.1.6 Distributing Hadoop Folders

  1. distribution
# 在node1执行如下命令
cd /export/server
scp -r hadoop-3.3.6 node2:`pwd`/
scp -r hadoop-3.3.6 node3:`pwd`/
  1. Execute on node2, configure soft links for hadoop
# 在node2执行如下命令
ln -s /export/server/hadoop-3.3.6 /export/server/hadoop
  1. Execute on node3, configure soft links for hadoop
# 在node3执行如下命令
ln -s /export/server/hadoop-3.3.6 /export/server/hadoop

4.1.7 Configure environment variables

  1. vim /etc/profile
# 在/etc/profile文件底部追加如下内容
export HADOOP_HOME=/export/server/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
  1. Configure the same environment variables in node2 and node3

4.1.8 Format the entire file system

  1. format namenode
# 确保以hadoop用户执行
su - hadoop
# 格式化namenode
hadoop namenode -format
  1. start up
# 一键启动hdfs集群
start-dfs.sh
# 一键关闭hdfs集群
stop-dfs.sh

# 如果遇到命令未找到的错误,表明环境变量未配置好,可以以绝对路径执行
/export/server/hadoop/sbin/start-dfs.sh
/export/server/hadoop/sbin/stop-dfs.sh

5. Shell operation of HDFS

5.1 Process start and stop management

5.1.1 One-click start and stop script

  1. One-click start HDFS cluster
$HADOOP_HOME/sbin/start-dfs.sh

Execution principle:

  • On the machine executing this script, start the SecondaryNameNode
  • Read the content of core-site.xml (fs.defaultFS item), confirm the machine where the NameNode is located, and start the NameNode
  • Read the content of workers, confirm the machine where the DataNode is located, and start all DataNodes
  1. One-click shutdown of HDFS cluster
$HADOOP_HOME/sbin/stop-dfs.sh

Execution principle:

  • On the machine executing this script, shut down the SecondaryNameNode
  • Read the content of core-site.xml (fs.defaultFS item), confirm the machine where the NameNode is located, and shut down the NameNode
  • Read the content of workers, confirm the machine where the DataNode is located, and shut down all NameNodes

5.1.2 Single process start and stop

Separately control the start and stop of the process of the machine where it is located

  1. $HADOOP_HOME/sbin/hadoop-daemon.sh
hadoop-daemon.sh (start|status|stop) (namenode|secondarynamenode|datanode)
  1. $HADOOP_HOME/bin/hdfs
hdfs --daemon (start|status|stop) (namenode|secondarynamenode|datanode)

5.2 File system operation commands

  1. create folder
hadoop fs -mkdir [-p] <path>
hdfs dfs -mkdir [-p] <path>
  • -p: create parent directory along path
  1. View the contents of the specified directory
hadoop fs -ls [-h] [-R] [<path> ....]
hdfs dfs -ls [-h] [-R] [<path> ....]
  • -h: Humanized display file size
  • -R: recursively view the specified directory and its subdirectories
  1. Upload files to the specified HDFS directory
hadoop fs -put [-f] [-p] <localsrc> <dst>
hdfs dfs -put [-f] [-p] <localsrc> <dst>
  • -f: overwrite target file
  • -p: preserve access and modification times, ownership and permissions
  • localsrc: local file system
  • dst: destination file system (HDFS)
  1. View HDFS file content
hadoop fs -cat <src> ....| more
hdfs dfs -cat <src> ....| more
  1. download HDFS file
hadoop fs -get [-f] [-p] <src>....<localdst>
hdfs dfs -get [-f] [-p] <src>....<localdst>
  • -f: overwrite target file
  • -p: preserve access and modification times
  1. Copy HDFS files
hadoop fs -cp [-f] <src>....<dst>
hdfs dfs -cp [-f] <src>....<dst>
  • -f: overwrite target file
  1. Append data to HDFS file
hadoop fs -appendToFile <localsrc>....<dst>
hdfs dfs -appendToFile <localsrc>....<dst>
  1. HDFS data movement operations
hadoop fs -mv <src>....<dst>
hdfs dfs -mv <src>....<dst>
  1. HDFS data delete operation
hadoop fs -rm -f [-skipTrash] [URl..]
hdfs dfs -rm -f [-skipTrash] [URl..]
  • -skipTrash skips the recycle bin and deletes directly

5.3 HDFS permissions

slightly

5.4 HDFS client

slightly

6. The storage principle of HDFS

6.1 Storage principle

slightly

6.2 fsck command

6.2.1 Configuration of the number of HDFS replica blocks

  • Configure in hdfs-site.xml:
<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>
  • or
hadoop fs -D dfs.replication=2 -put test.txt /tmp/

6.2.2 The fsck command checks the number of copies of the file

hdfs fsck path [-files [-blocks [-locations]]]

6.2.3 block configuration

 <property>
    <name>dfs.blocksize</name>
    <value>268435456</value>
    <description>设置HDFS块大小,单位是b</description>
  </property>

6.3 NameNode metadata

slightly

6.4 HDFS data read and write process

6.4.1 Data writing process

1

  1. The client initiates a request to the NameNode
  2. After the NameNode audits the permission and the remaining space, it satisfies the conditions to allow writing, and informs the client of the DataNode address to write
  3. The client sends a data packet to the specified DataNode
  4. The DataNode whose data is written completes the replication of the data copy at the same time, and distributes the data it receives to other DataNodes
  5. As shown above, DataNode1 replicates to DataNode2, and then replicates to DataNode3 and DataNode4 based on DataNode2
  6. After the writing is completed, the client notifies the NameNode, and the NameNode does the metadata recording work

6.4.2 Data reading process

2

  1. The client applies to the NameNode to read a file
  2. After the NameNode judges the details of the client's permissions, it allows reading and returns the block list of this file
  3. After the client gets the block list, it can find the DataNode to read it by itself

Guess you like

Origin blog.csdn.net/m0_68111267/article/details/131734908