Overview and installation of Hadoop
- 1. Three core components inside Hadoop
- 2. An ecosystem where Hadoop technology was born
- 3. Mainly focus on learning the Hadoop distribution version of Apache
- 4. Four modes of Hadoop installation
- 5. Hadoop pseudo-distribution installation process
- 6. Format HDFS cluster
- 7. Start HDFS and YARN
- 8. Fully distributed installation of Hadoop
Hadoop technology - three papers from Google (big data software generally requires 7*24 hours of non-stop downtime)
Solve all the two core problems encountered in big data (the storage problem of massive data and the calculation problem of massive data)
1. Three core components inside Hadoop
1. HDFS: Distributed file storage system
Distributed thinking solves the problem of distributed storage of massive data
Composed of three core components
- NameNode: master node
- Stores metadata (directory structure) for the entire HDFS cluster
- Manage an entire HDFS cluster
- DataNode: data node/slave node
- To store data, DataNode stores files in the form of Block blocks.
- SecondaryNameNode: little secretary
- (metadata) that helps NameNode merge log data
2. YARN: Distributed resource scheduling system
Solve the problem of resource allocation and task monitoring of distributed computing programs
Mesos: Distributed Resource Management System (YARN Replacement)
Composed of two core components
- ResourceManager: master node
- Manages the entire YARN cluster and is responsible for overall resource allocation.
- NodeManager: slave node
- Those who are truly responsible for resource provision
3. MapReduce: Distributed offline computing framework
Distributed thinking solves the problem of distributed computing of massive data
4. Hadoop Common (just understand it)
2. An ecosystem where Hadoop technology was born
Data collection and storage
flume、Kafka、hbase、hdfs
Data cleaning preprocessing
MapReduce、Spark
Data statistical analysis
Hive、Pig
data migration
sqoop
data visualization
ercharts
zookeeper
3. Mainly focus on learning the Hadoop distribution version of Apache
Official website: https://hadoop.apache.org
apache hadoop distribution
- hadoop1.x
- hadoop2.x
- hadoop3.x
- hadoop3.1.4
4. Four modes of Hadoop installation
In the hadoop software, HDFS and YARN are a system and a distributed system. At the same time, they are also master-slave architecture software.
The first: local installation mode: only MapReduce can be used, HDFS and YARN cannot be used - basically not used
Second: Pseudo-distributed installation mode: the master-slave architecture software of HDFS and YARN are all installed on the same node.
The third type: fully distributed installation mode: the master-slave architecture components of hdfs and yarn are installed on different nodes
Fourth: HA high-availability installation mode: the master-slave architecture components of hdfs and yarn are installed on different nodes. At the same time, two or three more master nodes need to be installed, but only one master node can be provided to the outside world at the same time. Service - can be achieved with the help of Zookeeper software
修改配置文件:hadoop-env.sh、core-site.xml、hdfs-site.xml、mapred-env.sh、mapred-site.xml、yarn-site.xml、yarn-env.sh、workers、log4j.properties、capacity-scheduler.xml、dfs.hosts、dfs.hosts.exclude
5. Hadoop pseudo-distribution installation process
1. You need to install JDK on Linux first . The bottom layer of Hadoop is developed based on Java.
- There are two main places to configure environment variables.
/etc/profile:系统环境变量
~/.bash_profile:用户环境变量
环境变量配置完成必须重新加载配置文件
source 环境变量文件路径
2. Configure the host mapping of the current host and ssh password-free login
3. Install the local version of Hadoop
- Upload - use xftp to transfer the Windows download
hadoop-3.1.4.tar.gz
to the /opt/software directory - Unzip——
tar -zxvf hadoop-3.1.4.tar.gz -C /opt/app
- Configure environment variables
vim /etc/profile
export HADOOP_HOME=/opt/app/hadoop-3.1.4
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source /etc/profile
4. Install the pseudo-distributed version of Hadoop
Just modify various hadoop configuration files
- hadoop-env.sh configures the Java path
vim hadoop-env.sh
#第54行
export JAVA_HOME=/opt/app/jdk1.8.0_371
#第58行
export HADOOP_HOME=/opt/app/hadoop-3.1.4
#第68行
export HADOOP_CONF_DIR=/opt/app/hadoop-3.1.4/etc/hadoop
#最后一行
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
- core-site.xml configures some common configuration items for HDFS and YARN
- Configure HDFS NameNode path
- Configure the file path for HDFS cluster storage
vim core-site.xml
<!--在configuration标签中增加如下配置-->
<configuration>
<!-- 指定HDFS中NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://single:9000</value>
</property>
<!-- 指定hadoop运行时产生文件的存储目录 HDFS相关文件存放地址-->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/app/hadoop-3.1.4/metaData</value>
</property>
<!-- 整合hive 用户代理设置 -->
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
</configuration>
- hdfs-site.xml configures HDFS related components
- Configure NameNode's web access path, DN's web access website, SNN's web access path, etc.
vim hdfs-site.xml
<configuration>
<!-- 指定HDFS副本的数量 -->
<property>
<name>dfs.replication</name>
<!-- hdfs的dn存储的block的备份数-->
<value>1</value>
</property>
<!--hdfs取消用户权限校验-->
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>0.0.0.0:9870</value>
<!-- 50070,9870-->
</property>
<property>
<name>dfs.datanode.http-address</name>
<value>0.0.0.0:9864</value>
<!-- 50075,9864-->
</property>
<property>
<name>dfs.secondary.http-address</name>
<value>0.0.0.0:9868</value>
<!-- 50090,9868-->
</property>
<!--用于指定NameNode的元数据存储目录-->
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/app/hadoop-3.1.4/metaData/dfs/name1,/opt/app/hadoop-3.1.4/metaData/dfs/name2</value>
</property>
</configuration>
- mapred-env.sh configures the associated software (Java YARN) path when the MR program is run
vim mapred-env.sh
#最后一行
export JAVA_HOME=/opt/app/jdk1.8.0_371
- mapred-site.xml configures the MR program running environment
- Configure the MR program to run on YARN
vim mapred-site.xml
<!-- 指定mr运行在yarn上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- 指定MR APP Master需要用的环境变量 hadoop3.x版本必须指定-->
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<!-- 指定MR 程序 map阶段需要用的环境变量 hadoop3.x版本必须指定-->
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<!-- 指定MR程序 reduce阶段需要用的环境变量 hadoop3.x版本必须指定-->
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>250</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx250M</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>300</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx300M</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>single:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>single:19888</value>
</property>
- yarn-env.sh configures the component path associated with YARN
vim yarn-env.sh
#最后一行
export JAVA_HOME=/opt/app/jdk1.8.0_371
- yarn-site.xml configures YARN related components
- Configure the web access paths of RM and NM, etc.
vim yarn-site.xml
<!-- reducer获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<!-- 指定yarn的RM组件安装到哪个主机上-->
<value>single</value>
</property>
<property>
<name>yarn.application.classpath</name>
<!-- 指定yarn软件在运行时需要的一些环境路径-->
<value>
/opt/app/hadoop-3.1.4/etc/hadoop,
/opt/app/hadoop-3.1.4/share/hadoop/common/*,
/opt/app/hadoop-3.1.4/share/hadoop/common/lib/*,
/opt/app/hadoop-3.1.4/share/hadoop/hdfs/*,
/opt/app/hadoop-3.1.4/share/hadoop/hdfs/lib/*,
/opt/app/hadoop-3.1.4/share/hadoop/mapreduce/*,
/opt/app/hadoop-3.1.4/share/hadoop/mapreduce/lib/*,
/opt/app/hadoop-3.1.4/share/hadoop/yarn/*,
/opt/app/hadoop-3.1.4/share/hadoop/yarn/lib/*
</value>
</property>
<!-- yarn.resourcemanager.webapp.address:指的是RM的web访问路径-->
<!-- 日志聚集功能启动 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 日志保留时间设置7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://single:19888/jobhistory/logs</value>
</property>
<!--关闭yarn对虚拟内存的限制检查 -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
- workers/slaves Configure HDFS and YARN slave node hosts
- Configure which nodes DN and NM need to be installed on
vim workers
<!-- 将localhost改为single -->
single
- log4j.properties - configure the log output directory during Hadoop operation
vim log4j.properties
#第19行
hadoop.log.dir=/opt/app/hadoop-3.1.4/logs
#指定Hadoop运行过程中日志输出目录
6. Format HDFS cluster
hdfs namenode -format
7. Start HDFS and YARN
-
HDFS
- start-dfs.sh
Report an error
solution:
vim /etc/profile #在最后一行加入以下内容 # HADOOP 3.X版本还需要增加如下配置 export HDFS_NAMENODE_USER=root export HDFS_DATANODE_USER=root export HDFS_SECONDARYNAMENODE_USER=root export YARN_RESOURCEMANAGER_USER=root export YARN_NODEMANAGER_USER=root #然后使配置文件生效 source /etc/profile
- stop-dfs.sh
- Provides a web access website that can monitor the status information of the entire HDFS cluster
http://ip:9870 hadoop3.x
ip:50070 hadoop2.x
-
yarn
- start-yarn.sh
- stop-yarn.sh
- Provides a web website that can monitor the status of the entire YARN cluster:
http://ip:8088
8. Fully distributed installation of Hadoop
1. Clone a virtual machine
The three virtual machines need to configure IP, host name, host IP mapping, SSH password-free login, time server installation and synchronization, and replace the yum data warehouse with a domestic mirror source.
Installation and synchronization of time server chrony
yum install -y chrony
Configure the main server first
vim /etc/chrony.conf
On line 7 addallow 192.168.31.0/24
Configure two more slave servers
vim /etc/chrony.conf
After deleting the server lines 3-6, add a lineserver node1 iburst
Start service
2. Install JDK
Omitted here, if necessary, please check the previous blog
3. Install Hadoop fully distributed
- hdfs.site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<!--secondary namenode地址-->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>node3:9868</value>
</property>
<!--hdfs取消用户权限校验-->
<property>
<name>dfs.permissions.enabled</name>
<value>false</value> </property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>true</value>
</property>
</configuration>
- yarn.site.xml
<configuration>
<!-- reducer获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node2</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>
/opt/app/hadoop-3.1.4/etc/hadoop,
/opt/app/hadoop-3.1.4/share/hadoop/common/*,
/opt/app/hadoop-3.1.4/share/hadoop/common/lib/*,
/opt/app/hadoop-3.1.4/share/hadoop/hdfs/*,
/opt/app/hadoop-3.1.4/share/hadoop/hdfs/lib/*,
/opt/app/hadoop-3.1.4/share/hadoop/mapreduce/*,
/opt/app/hadoop-3.1.4/share/hadoop/mapreduce/lib/*,
/opt/app/hadoop-3.1.4/share/hadoop/yarn/*,
/opt/app/hadoop-3.1.4/share/hadoop/yarn/lib/*
</value>
</property>
</configuration>
- mapred-site.xml
<!-- 指定mr运行在yarn上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
A total of nine related files need to be configured
Then send /opt/app on node1 to /opt on node2 and node3 nodes
scp -r /opt/app root@node2:/opt
4. Format HDFS
Formatting the node where namenode is located
hdfs namenode -format
5. Start HDFS and YARN
1. HDFS is started on the node where namenode is located (node1)
2. YARN is started on the node where RM is located (node2)