Article directory
Copyright Notice
- The content of this blog is based on my personal study notes from the Dark Horse Programmer course. I hereby declare that all copyrights belong to Dark Horse Programmers or related rights holders. The purpose of this blog is only for personal learning and communication, not commercial use.
- I try my best to ensure accuracy when organizing my study notes, but I cannot guarantee the completeness and timeliness of the content. The content of this blog may become outdated over time or require updating.
- If you are a Dark Horse programmer or a related rights holder, if there is any copyright infringement, please contact me in time and I will delete it immediately or make necessary modifications.
- For other readers, please abide by relevant laws, regulations and ethical principles when reading the content of this blog, refer to it with caution, and bear the resulting risks and responsibilities at your own risk. Some of the views and opinions in this blog are my own and do not represent the position of Dark Horse Programmers.
The origin of distributed storage
- The amount of data is too large, and the storage capacity of a single machine has an upper limit. The problem needs to be solved by quantity.
- The increase in quantity has brought about comprehensive improvements in network transmission, disk reading and writing, CPU, memory and other aspects. Distributed combination can achieve the effect of 1+1>2
2. Distributed infrastructure
2.1 Big data architecture model
In the big data system, there are two main types of architecture models for distributed scheduling:
- Decentralized model
- Centralized model
2.2 Master-slave mode
- The big data framework and most of the infrastructure are in line with the centralized model . That is: there is a central node (server) to coordinate the work of other servers, provide unified command, and unified dispatch to avoid chaos. This mode is also called: one master and multiple slaves mode, referred to as master and slave mode (Master And Slaves)
Three HDFS infrastructure
HDFS role composition
NameNode:
- The main role of the HDFS system is an independent process
- Responsible for managing the entire HDFS file system
- Responsible for managing DataNode
SecondaryNameNode:
- NameNode's auxiliary is an independent process
- Mainly helps NameNode complete metadata organization work (doing miscellaneous tasks)
DataNode:
- The slave role of the HDFS system is an independent process
- Mainly responsible for data storage, that is, storing data and taking out data
- A typical HDFS cluster consists of 1 DataNode plus several (at least one) DataNode
Four HDFS cluster environment deployment
4.1 Download the installation package
- official website
- VMware three virtual machine environment preparation
The roles of Hadoop HDFS include:
- NameNode, master node manager
- DataNode, slave node worker
- SecondaryNameNode, primary node secondary
- Upload the Hadoop installation package to the node1 node
- Unzip the installation package
/export/server/
intotar -zxvf hadoop-3.3.6.tar.gz -C /export/server
- Build soft links
cd /export/server
ln -s /export/server/hadoop-3.3.6 hadoop
- Enter the hadoop installation package
cd hadoop
4.2 Directory structure of Hadoop installation package
- cd into the Hadoop installation package and view the internal structure of the folder through the ls -l command
4.3 Modify the configuration file and apply customized settings
Configuring the HDFS cluster mainly involves hadoop/etc/hadoop
modifying the following files (all in the directory):
- workers: Configure slave nodes (DataNode)
- hadoop-env.sh: Configure Hadoop-related environment variables
- core-site.xml: Hadoop core configuration file
- hdfs-site.xml: HDFS core configuration file
These files exist in $HADOOP_HOME/etc/hadoop
the folder.
- ps:
$HADOOP_HOME
is the environment variable to be set later, which refers to the Hadoop installation folder./export/server/hadoop
- Configure workers file: filled in node1, node2, node3 indicates that the cluster records three slave nodes (DataNode)
# 进入配置文件目录
cd etc/hadoop
# 编辑workers文件
vim workers
# 填入如下内容
node1
node2
node3
- Configuration
/export/server/hadoop/etc/hadoop/hadoop-env.sh
file
# 填入如下内容
export JAVA_HOME=/export/server/jdk8
export HADOOP_HOME=/export/server/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
- JAVA_HOME, indicates the location of the JDK environment
- HADOOP_HOME, indicates the Hadoop installation location
- HADOOP_CONF_DIR, indicates the location of the Hadoop configuration file directory
- HADOOP_LOG_DIR, indicates the location of the Hadoop running log directory
- Configure the core-site.xml file and fill in the following content inside the file
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node1:8020</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
-
key:fs.defaultFS
-
Meaning: Network communication path of HDFS file system
-
Value: hdfs://node1:8020
- The protocol is hdfs://
- namenode为node1
- The namenode communication port is 8020
-
key:io.file.buffer.size
-
Meaning: io operation file buffer size
-
Value: 131072 bit
-
hdfs://node1:8020 is the internal communication address of the entire HDFS, and the application protocol is hdfs:// (Hadoop built-in protocol)
-
Indicates that the DataNode will communicate with the 8020 port of node1, which is the machine where the NameNode is located.
-
This configuration fixes that node1 must start the NameNode process
- Configure hdfs-site.xml file
# 在文件内部填入如下内容
<configuration>
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>700</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/data/nn</value>
</property>
<property>
<name>dfs.namenode.hosts</name>
<value>node1,node2,node3</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>268435456</value>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>100</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/data/dn</value>
</property>
</configuration>
- We need to manually create the relevant directories
- On the node1 node:
mkdir -p /data/nn mkdir /data/dn
- On node2 and node3 nodes:
mkdir -p /data/dn
4.4 Distribute Hadoop folders
- At this point, the configuration of Hadoop has been basically completed. You can remotely copy the hadoop installation folder from node1 to node2 and node3 for distribution.
# 在node1执行如下命令 cd /export/server scp -r hadoop-3.3.4 node2:`pwd`/ scp -r hadoop-3.3.4 node3:`pwd`/
- Execute on node2, configure soft link for hadoop
# 在node2执行如下命令 ln -s /export/server/hadoop-3.3.4 /export/server/hadoop
- Execute on node3, configure soft link for hadoop
# 在node3执行如下命令 ln -s /export/server/hadoop-3.3.4 /export/server/hadoop
4.5 Configure environment variables
- In order to facilitate the operation of Hadoop, you can configure some Hadoop scripts and programs into PATH for subsequent use.
- There are many scripts and programs in the bin and sbin folders in the Hadoop folder. Now let’s configure the environment variables.
- Edit environment variable configuration file
vim /etc/profile
# 在/etc/profile文件底部追加如下内容
export HADOOP_HOME=/export/server/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
- Configure the same environment variables on node2 and node3
- Make environment variables effective
source /etc/profile
4.6 Authorize as hadoop user
- The preparations for hadoop deployment are basically completed. In order to ensure security, the hadoop system is not started as the root user. We start the entire Hadoop service as the ordinary user hadoop. Therefore, file permissions now need to be authorized.
- As root, execute the following commands on the three servers node1, node2, and node3.
# 以root身份,在三台服务器上均执行
chown -R hadoop:hadoop /data
chown -R hadoop:hadoop /export
4.7 Format the entire file system
- Format namenode
# 确保以hadoop用户执行 su - hadoop # 格式化namenode hadoop namenode -format
- start up
# 一键启动hdfs集群 start-dfs.sh # 一键关闭hdfs集群 stop-dfs.sh # 如果遇到命令未找到的错误,表明环境变量未配置好,可以以绝对路径执行 /export/server/hadoop/sbin/start-dfs.sh /export/server/hadoop/sbin/stop-dfs.sh
- Check the running status of three roles
# node1的情况 [hadoop@node1 ~]$ jps 25937 Jps 25063 NameNode 25207 DataNode 25543 SecondaryNameNode
- View HDFS WEBUI