Win every step of the way: Create a cool and attractive Hadoop HDFS distributed file system cluster deployment solution

Copyright Notice

  • The content of this blog is based on my personal study notes from the Dark Horse Programmer course. I hereby declare that all copyrights belong to Dark Horse Programmers or related rights holders. The purpose of this blog is only for personal learning and communication, not commercial use.
  • I try my best to ensure accuracy when organizing my study notes, but I cannot guarantee the completeness and timeliness of the content. The content of this blog may become outdated over time or require updating.
  • If you are a Dark Horse programmer or a related rights holder, if there is any copyright infringement, please contact me in time and I will delete it immediately or make necessary modifications.
  • For other readers, please abide by relevant laws, regulations and ethical principles when reading the content of this blog, refer to it with caution, and bear the resulting risks and responsibilities at your own risk. Some of the views and opinions in this blog are my own and do not represent the position of Dark Horse Programmers.

The origin of distributed storage

  • The amount of data is too large, and the storage capacity of a single machine has an upper limit. The problem needs to be solved by quantity.
  • The increase in quantity has brought about comprehensive improvements in network transmission, disk reading and writing, CPU, memory and other aspects. Distributed combination can achieve the effect of 1+1>2

2. Distributed infrastructure

2.1 Big data architecture model

In the big data system, there are two main types of architecture models for distributed scheduling:

  • Decentralized model
  • Centralized model
    Insert image description here
    Insert image description here

2.2 Master-slave mode

  • The big data framework and most of the infrastructure are in line with the centralized model . That is: there is a central node (server) to coordinate the work of other servers, provide unified command, and unified dispatch to avoid chaos. This mode is also called: one master and multiple slaves mode, referred to as master and slave mode (Master And Slaves)
    Insert image description here

Three HDFS infrastructure

Insert image description here

HDFS role composition

Insert image description here
NameNode:

  • The main role of the HDFS system is an independent process
  • Responsible for managing the entire HDFS file system
  • Responsible for managing DataNode

SecondaryNameNode:

  • NameNode's auxiliary is an independent process
  • Mainly helps NameNode complete metadata organization work (doing miscellaneous tasks)

DataNode:

  • The slave role of the HDFS system is an independent process
  • Mainly responsible for data storage, that is, storing data and taking out data

Insert image description here

  • A typical HDFS cluster consists of 1 DataNode plus several (at least one) DataNode

Four HDFS cluster environment deployment

4.1 Download the installation package


The roles of Hadoop HDFS include:

  • NameNode, master node manager
  • DataNode, slave node worker
  • SecondaryNameNode, primary node secondary

Insert image description here
Insert image description here

  1. Upload the Hadoop installation package to the node1 node
  2. Unzip the installation package /export/server/into
    tar -zxvf hadoop-3.3.6.tar.gz -C /export/server
    
  3. Build soft links
cd /export/server
ln -s /export/server/hadoop-3.3.6 hadoop
  1. Enter the hadoop installation package
cd hadoop

4.2 Directory structure of Hadoop installation package

  • cd into the Hadoop installation package and view the internal structure of the folder through the ls -l command
    Insert image description here
bin
(Hadoop程序/命令)
etc
(Hadoop配置文件)
include
(C语言头文件)
lib
(Linux系统动态链接库 - .so文件)
libexec
(Hadoop系统配置脚本 - .sh和.cmd文件)
licenses-binary
(许可证文件)
sbin
(管理员程序 - 超级bin)
share
(二进制源代码 - Java jar包)
hadop根目录

4.3 Modify the configuration file and apply customized settings

Configuring the HDFS cluster mainly involves hadoop/etc/hadoopmodifying the following files (all in the directory):

  • workers: Configure slave nodes (DataNode)
  • hadoop-env.sh: Configure Hadoop-related environment variables
  • core-site.xml: Hadoop core configuration file
  • hdfs-site.xml: HDFS core configuration file

These files exist in $HADOOP_HOME/etc/hadoopthe folder.

  • ps: $HADOOP_HOMEis the environment variable to be set later, which refers to the Hadoop installation folder./export/server/hadoop

  • Configure workers file: filled in node1, node2, node3 indicates that the cluster records three slave nodes (DataNode)
# 进入配置文件目录
cd etc/hadoop
# 编辑workers文件
vim workers
# 填入如下内容
node1
node2
node3

  • Configuration /export/server/hadoop/etc/hadoop/hadoop-env.shfile
# 填入如下内容
export JAVA_HOME=/export/server/jdk8
export HADOOP_HOME=/export/server/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
  • JAVA_HOME, indicates the location of the JDK environment
  • HADOOP_HOME, indicates the Hadoop installation location
  • HADOOP_CONF_DIR, indicates the location of the Hadoop configuration file directory
  • HADOOP_LOG_DIR, indicates the location of the Hadoop running log directory

  • Configure the core-site.xml file and fill in the following content inside the file
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://node1:8020</value>
  </property>

  <property>
    <name>io.file.buffer.size</name>
    <value>131072</value>
  </property>
</configuration>
  • key:fs.defaultFS

  • Meaning: Network communication path of HDFS file system

  • Value: hdfs://node1:8020

    • The protocol is hdfs://
    • namenode为node1
    • The namenode communication port is 8020
  • key:io.file.buffer.size

  • Meaning: io operation file buffer size

  • Value: 131072 bit

  • hdfs://node1:8020 is the internal communication address of the entire HDFS, and the application protocol is hdfs:// (Hadoop built-in protocol)

  • Indicates that the DataNode will communicate with the 8020 port of node1, which is the machine where the NameNode is located.

  • This configuration fixes that node1 must start the NameNode process


  • Configure hdfs-site.xml file
# 在文件内部填入如下内容
<configuration>
  <property>
    <name>dfs.datanode.data.dir.perm</name>
    <value>700</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/data/nn</value>
  </property>
  <property>
    <name>dfs.namenode.hosts</name>
    <value>node1,node2,node3</value>
  </property>
<property>
    <name>dfs.blocksize</name>
    <value>268435456</value>
  </property>
  <property>
    <name>dfs.namenode.handler.count</name>
    <value>100</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/data/dn</value>
  </property>
</configuration>

Insert image description here
Insert image description here
Insert image description here

  • We need to manually create the relevant directories
  • On the node1 node:
    mkdir -p /data/nn
    mkdir /data/dn
    
  • On node2 and node3 nodes:
    mkdir -p /data/dn
    

4.4 Distribute Hadoop folders

  • At this point, the configuration of Hadoop has been basically completed. You can remotely copy the hadoop installation folder from node1 to node2 and node3 for distribution.
    # 在node1执行如下命令
    cd /export/server
    scp -r hadoop-3.3.4 node2:`pwd`/
    scp -r hadoop-3.3.4 node3:`pwd`/
    
  • Execute on node2, configure soft link for hadoop
    # 在node2执行如下命令
    ln -s /export/server/hadoop-3.3.4 /export/server/hadoop
    
  • Execute on node3, configure soft link for hadoop
    # 在node3执行如下命令
    ln -s /export/server/hadoop-3.3.4 /export/server/hadoop
    

4.5 Configure environment variables

  • In order to facilitate the operation of Hadoop, you can configure some Hadoop scripts and programs into PATH for subsequent use.
    Insert image description here
  • There are many scripts and programs in the bin and sbin folders in the Hadoop folder. Now let’s configure the environment variables.
  1. Edit environment variable configuration file
vim /etc/profile
# 在/etc/profile文件底部追加如下内容
export HADOOP_HOME=/export/server/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
  1. Configure the same environment variables on node2 and node3
  2. Make environment variables effective
source /etc/profile

4.6 Authorize as hadoop user

  • The preparations for hadoop deployment are basically completed. In order to ensure security, the hadoop system is not started as the root user. We start the entire Hadoop service as the ordinary user hadoop. Therefore, file permissions now need to be authorized.
  • As root, execute the following commands on the three servers node1, node2, and node3.
# 以root身份,在三台服务器上均执行
chown -R hadoop:hadoop /data
chown -R hadoop:hadoop /export

4.7 Format the entire file system

  • Format namenode
    # 确保以hadoop用户执行
    su - hadoop
    # 格式化namenode
    hadoop namenode -format
    
  • start up
    # 一键启动hdfs集群
    start-dfs.sh
    # 一键关闭hdfs集群
    stop-dfs.sh
    # 如果遇到命令未找到的错误,表明环境变量未配置好,可以以绝对路径执行
    /export/server/hadoop/sbin/start-dfs.sh
    /export/server/hadoop/sbin/stop-dfs.sh
    
  • Check the running status of three roles
    # node1的情况
    [hadoop@node1 ~]$ jps
    25937 Jps
    25063 NameNode
    25207 DataNode
    25543 SecondaryNameNode
    
  • View HDFS WEBUI
    Insert image description here
    Insert image description here

Guess you like

Origin blog.csdn.net/yang2330648064/article/details/132368218