Overview and installation of Hadoop

Hadoop technology - three papers from Google (big data software generally requires 7*24 hours of non-stop downtime)

Solve all the two core problems encountered in big data (the storage problem of massive data and the calculation problem of massive data)

1. Three core components inside Hadoop

1. HDFS: Distributed file storage system

Distributed thinking solves the problem of distributed storage of massive data

Composed of three core components

  • NameNode: master node
    • Stores metadata (directory structure) for the entire HDFS cluster
    • Manage an entire HDFS cluster
  • DataNode: data node/slave node
    • To store data, DataNode stores files in the form of Block blocks.
  • SecondaryNameNode: little secretary
    • (metadata) that helps NameNode merge log data

2. YARN: Distributed resource scheduling system

Solve the problem of resource allocation and task monitoring of distributed computing programs

Mesos: Distributed Resource Management System (YARN Replacement)

Composed of two core components

  • ResourceManager: master node
    • Manages the entire YARN cluster and is responsible for overall resource allocation.
  • NodeManager: slave node
    • Those who are truly responsible for resource provision

3. MapReduce: Distributed offline computing framework

Distributed thinking solves the problem of distributed computing of massive data

4. Hadoop Common (just understand it)

2. An ecosystem where Hadoop technology was born

Data collection and storage

flume、Kafka、hbase、hdfs

Data cleaning preprocessing

MapReduce、Spark

Data statistical analysis

Hive、Pig

data migration

sqoop

data visualization

ercharts

zookeeper

3. Mainly focus on learning the Hadoop distribution version of Apache

Official website: https://hadoop.apache.org

apache hadoop distribution

  • hadoop1.x
  • hadoop2.x
  • hadoop3.x
    • hadoop3.1.4

4. Four modes of Hadoop installation

In the hadoop software, HDFS and YARN are a system and a distributed system. At the same time, they are also master-slave architecture software.

The first: local installation mode: only MapReduce can be used, HDFS and YARN cannot be used - basically not used

Second: Pseudo-distributed installation mode: the master-slave architecture software of HDFS and YARN are all installed on the same node.

The third type: fully distributed installation mode: the master-slave architecture components of hdfs and yarn are installed on different nodes

Fourth: HA high-availability installation mode: the master-slave architecture components of hdfs and yarn are installed on different nodes. At the same time, two or three more master nodes need to be installed, but only one master node can be provided to the outside world at the same time. Service - can be achieved with the help of Zookeeper software

修改配置文件:hadoop-env.sh、core-site.xml、hdfs-site.xml、mapred-env.sh、mapred-site.xml、yarn-site.xml、yarn-env.sh、workers、log4j.properties、capacity-scheduler.xml、dfs.hosts、dfs.hosts.exclude

5. Hadoop pseudo-distribution installation process

1. You need to install JDK on Linux first . The bottom layer of Hadoop is developed based on Java.

  • There are two main places to configure environment variables.
/etc/profile:系统环境变量
~/.bash_profile:用户环境变量
环境变量配置完成必须重新加载配置文件
source  环境变量文件路径

2. Configure the host mapping of the current host and ssh password-free login

3. Install the local version of Hadoop

  • Upload - use xftp to transfer the Windows download hadoop-3.1.4.tar.gzto the /opt/software directory
  • Unzip——tar -zxvf hadoop-3.1.4.tar.gz -C /opt/app
  • Configure environment variables
    • vim /etc/profile
    • export HADOOP_HOME=/opt/app/hadoop-3.1.4
      export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
    • source /etc/profile

4. Install the pseudo-distributed version of Hadoop

Just modify various hadoop configuration files

  • hadoop-env.sh configures the Java path
vim hadoop-env.sh
#第54行
export JAVA_HOME=/opt/app/jdk1.8.0_371
#第58行
export HADOOP_HOME=/opt/app/hadoop-3.1.4
#第68行
export HADOOP_CONF_DIR=/opt/app/hadoop-3.1.4/etc/hadoop
#最后一行
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

image-20230718210944383

  • core-site.xml configures some common configuration items for HDFS and YARN
    • Configure HDFS NameNode path
    • Configure the file path for HDFS cluster storage
vim core-site.xml
<!--在configuration标签中增加如下配置-->
<configuration>
        <!-- 指定HDFS中NameNode的地址 -->
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://single:9000</value>
        </property>
        <!-- 指定hadoop运行时产生文件的存储目录  HDFS相关文件存放地址-->
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/opt/app/hadoop-3.1.4/metaData</value>
        </property>
        <!-- 整合hive 用户代理设置 -->
        <property>
                <name>hadoop.proxyuser.root.hosts</name>
                <value>*</value>
        </property>
        <property>
                <name>hadoop.proxyuser.root.groups</name>
                <value>*</value>
        </property>

</configuration>

image-20230822161902347

  • hdfs-site.xml configures HDFS related components
    • Configure NameNode's web access path, DN's web access website, SNN's web access path, etc.
vim hdfs-site.xml
<configuration>
        <!-- 指定HDFS副本的数量 -->
        <property>
                <name>dfs.replication</name>
            	<!-- hdfs的dn存储的block的备份数-->
                <value>1</value>
        </property>
    <!--hdfs取消用户权限校验-->
       <property>
          <name>dfs.permissions.enabled</name>
          <value>false</value>
      </property>
        <property>
                <name>dfs.namenode.http-address</name>
                <value>0.0.0.0:9870</value>
<!-- 50070,9870--> 
        </property>
        <property>
                <name>dfs.datanode.http-address</name>
                <value>0.0.0.0:9864</value>
<!-- 50075,9864--> 
        </property>
        <property>
                <name>dfs.secondary.http-address</name>
                <value>0.0.0.0:9868</value>
<!-- 50090,9868--> 
        </property>
    <!--用于指定NameNode的元数据存储目录-->
    	<property>
                <name>dfs.namenode.name.dir</name>
                <value>/opt/app/hadoop-3.1.4/metaData/dfs/name1,/opt/app/hadoop-3.1.4/metaData/dfs/name2</value>
        </property>
</configuration>

image-20230822192047054

  • mapred-env.sh configures the associated software (Java YARN) path when the MR program is run
vim mapred-env.sh
#最后一行
export JAVA_HOME=/opt/app/jdk1.8.0_371

image-20230718212027604

  • mapred-site.xml configures the MR program running environment
    • Configure the MR program to run on YARN
vim mapred-site.xml
<!-- 指定mr运行在yarn上 -->
  <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
  </property>

  <!-- 指定MR APP Master需要用的环境变量  hadoop3.x版本必须指定-->
  <property>
      <name>yarn.app.mapreduce.am.env</name>
      <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
  </property>
  <!-- 指定MR 程序 map阶段需要用的环境变量 hadoop3.x版本必须指定-->
  <property>
      <name>mapreduce.map.env</name>
      <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
  </property>
  <!-- 指定MR程序 reduce阶段需要用的环境变量 hadoop3.x版本必须指定-->
  <property>
      <name>mapreduce.reduce.env</name>
      <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
  </property>
<property>
      <name>mapreduce.map.memory.mb</name>
       <value>250</value>
</property>
<property>
      <name>mapreduce.map.java.opts</name>
       <value>-Xmx250M</value>
</property>
<property>
      <name>mapreduce.reduce.memory.mb</name>
       <value>300</value>
</property>
<property>
      <name>mapreduce.reduce.java.opts</name>
       <value>-Xmx300M</value>
</property>
  <property>
      <name>mapreduce.jobhistory.address</name>
       <value>single:10020</value>
</property>
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>single:19888</value>
</property>

image-20230718212347926

image-20230822192150660

  • yarn-env.sh configures the component path associated with YARN
vim  yarn-env.sh
#最后一行
export JAVA_HOME=/opt/app/jdk1.8.0_371

image-20230718212518831

  • yarn-site.xml configures YARN related components
    • Configure the web access paths of RM and NM, etc.
vim yarn-site.xml
<!-- reducer获取数据的方式 -->
  <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
  </property>

  <!-- 指定YARN的ResourceManager的地址 -->
  <property>
        <name>yarn.resourcemanager.hostname</name>
      	<!-- 指定yarn的RM组件安装到哪个主机上-->
        <value>single</value>
  </property>
  <property>
      <name>yarn.application.classpath</name>
      <!-- 指定yarn软件在运行时需要的一些环境路径-->
      <value>
      /opt/app/hadoop-3.1.4/etc/hadoop,
      /opt/app/hadoop-3.1.4/share/hadoop/common/*,
      /opt/app/hadoop-3.1.4/share/hadoop/common/lib/*,
      /opt/app/hadoop-3.1.4/share/hadoop/hdfs/*,
      /opt/app/hadoop-3.1.4/share/hadoop/hdfs/lib/*,
      /opt/app/hadoop-3.1.4/share/hadoop/mapreduce/*,
      /opt/app/hadoop-3.1.4/share/hadoop/mapreduce/lib/*,
      /opt/app/hadoop-3.1.4/share/hadoop/yarn/*,
      /opt/app/hadoop-3.1.4/share/hadoop/yarn/lib/*
      </value>
  </property>
<!-- yarn.resourcemanager.webapp.address:指的是RM的web访问路径-->
<!-- 日志聚集功能启动 -->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>
<!-- 日志保留时间设置7天 -->
<property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
</property>

<property>
       <name>yarn.log.server.url</name>
       <value>http://single:19888/jobhistory/logs</value>
</property>
<!--关闭yarn对虚拟内存的限制检查 -->
<property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
</property>

image-20230718212909109

image-20230822192349611

  • workers/slaves Configure HDFS and YARN slave node hosts
    • Configure which nodes DN and NM need to be installed on
vim workers
<!-- 将localhost改为single -->
single

image-20230718213001560

  • log4j.properties - configure the log output directory during Hadoop operation
vim log4j.properties
#第19行
hadoop.log.dir=/opt/app/hadoop-3.1.4/logs 
#指定Hadoop运行过程中日志输出目录

image-20230719092759757

6. Format HDFS cluster

hdfs namenode -format

7. Start HDFS and YARN

  • HDFS

    • start-dfs.sh

    Report an error

    image-20230718213604928

    solution:

    vim /etc/profile
    #在最后一行加入以下内容
    # HADOOP 3.X版本还需要增加如下配置
    export HDFS_NAMENODE_USER=root
    export HDFS_DATANODE_USER=root
    export HDFS_SECONDARYNAMENODE_USER=root
    export YARN_RESOURCEMANAGER_USER=root
    export YARN_NODEMANAGER_USER=root
    #然后使配置文件生效
    source /etc/profile
    

    image-20230718213809936

    image-20230718214052308

    • stop-dfs.sh
    • Provides a web access website that can monitor the status information of the entire HDFS cluster
      http://ip:9870 hadoop3.x
      ip:50070 hadoop2.x
  • yarn

    • start-yarn.sh

    image-20230718214130649

    • stop-yarn.sh
    • Provides a web website that can monitor the status of the entire YARN cluster:
      http://ip:8088

8. Fully distributed installation of Hadoop

1. Clone a virtual machine

The three virtual machines need to configure IP, host name, host IP mapping, SSH password-free login, time server installation and synchronization, and replace the yum data warehouse with a domestic mirror source.

image-20230719095947238

image-20230719095813819

image-20230719100610519

image-20230719101920020

Installation and synchronization of time server chrony

yum install -y chrony

image-20230719103934985

Configure the main server first

vim /etc/chrony.conf

On line 7 addallow 192.168.31.0/24

image-20230719104240688

Configure two more slave servers

vim /etc/chrony.conf

After deleting the server lines 3-6, add a lineserver node1 iburst

image-20230719104606877

Start service

image-20230719104959143

2. Install JDK

Omitted here, if necessary, please check the previous blog

3. Install Hadoop fully distributed

  • hdfs.site.xml
<configuration>
      <property>
          <name>dfs.replication</name>
          <value>3</value>
      </property>
      <!--secondary namenode地址-->
      <property>
          <name>dfs.namenode.secondary.http-address</name>
          <value>node3:9868</value>
      </property>
      <!--hdfs取消用户权限校验-->
       <property>
          <name>dfs.permissions.enabled</name>
          <value>false</value>      </property>
      <property>
          <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
          <value>true</value>
      </property>
</configuration>

image-20230719112153586

  • yarn.site.xml
<configuration>
<!-- reducer获取数据的方式 -->
      <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
      </property>
      <!-- 指定YARN的ResourceManager的地址 -->
      <property>
          <name>yarn.resourcemanager.hostname</name>
          <value>node2</value>
      </property>
      <property>
          <name>yarn.application.classpath</name>
          <value>
          /opt/app/hadoop-3.1.4/etc/hadoop,
          /opt/app/hadoop-3.1.4/share/hadoop/common/*,
          /opt/app/hadoop-3.1.4/share/hadoop/common/lib/*,
          /opt/app/hadoop-3.1.4/share/hadoop/hdfs/*,
          /opt/app/hadoop-3.1.4/share/hadoop/hdfs/lib/*,
          /opt/app/hadoop-3.1.4/share/hadoop/mapreduce/*,
          /opt/app/hadoop-3.1.4/share/hadoop/mapreduce/lib/*,
          /opt/app/hadoop-3.1.4/share/hadoop/yarn/*,
          /opt/app/hadoop-3.1.4/share/hadoop/yarn/lib/*
          </value>
      </property>
</configuration>

image-20230719112649613

  • mapred-site.xml
 <!-- 指定mr运行在yarn上 -->
      <property>
          <name>mapreduce.framework.name</name>
          <value>yarn</value>
      </property>
      <property>
          <name>yarn.app.mapreduce.am.env</name>
          <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
      </property>
      <property>
          <name>mapreduce.map.env</name>
          <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
      </property>
      <property>
          <name>mapreduce.reduce.env</name>
          <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
      </property>

image-20230719112903402

A total of nine related files need to be configured

image-20230719113122255

Then send /opt/app on node1 to /opt on node2 and node3 nodes

scp -r /opt/app root@node2:/opt

4. Format HDFS

Formatting the node where namenode is located

hdfs namenode -format

5. Start HDFS and YARN

1. HDFS is started on the node where namenode is located (node1)

2. YARN is started on the node where RM is located (node2)

Guess you like

Origin blog.csdn.net/weixin_57367513/article/details/132677894