1. Preparation
1. Preliminary installation package
I put all the packages I need on the Baidu disk:
link: https://pan.baidu.com/s/1NHxweoK7zYf5hqP1aLIHAw Extraction code: ip4c
hadoop-2.8.5.tar.gz, hbase-2.1.1-bin. tar.gz, apache-hive-2.3.4-bin.tar.gz, jdk-8u102-linux-x64.tar.gz, mysql-community-*.rpm, xshell, xftp, CentOS-7-x86_64-Minimal- 1804.ISO, mysql-connector-java-8.0.13.jar
Note, mysql is not completely necessary. Now hive comes with derby, so mysql is not needed, and it is actually easier to configure. Xshell and xftp are used to connect to the virtual machine, which is more convenient. For the virtual machine, I use vmware, and the installation package is not uploaded here. Solve it by yourself
2. Install the virtual machine
Install three dime virtual machines, and then install xshell and xftp on windows. How to install Baidu by yourself is very simple.
Static address setting
After installing the virtual machine, set the address of the virtual machine to a static address. Otherwise, the address of the virtual machine will change periodically, which will cause unnecessary trouble. If you want to set the virtual machine to a static address, first understand the gateway of the virtual machine. What's the code.
Click [Edit], [Virtual Network Editor], select nat mode, and then [nat settings] as shown in the figure
And there are 子网ip
, 子网掩码
, 网关ip
information, and use them to set up a static ip.
Under the root
user, do the following things:
vi /etc/sysconfig/network-scripts/ifcfg-ens33
Then modify it as follows:
TYPE="Ethernet"
PROXY_METHOD="none"
BROWSER_ONLY="no"
BOOTPROTO="none" # 这里是修改过的地方
DEFROUTE="yes"
IPV4_FAILURE_FATAL="no"
IPV6INIT="yes"
IPV6_AUTOCONF="yes"
IPV6_DEFROUTE="yes"
IPV6_FAILURE_FATAL="no"
IPV6_ADDR_GEN_MODE="stable-privacy"
NAME="ens33"
UUID="006888a8-6385-4dda-b8e8-9d6f89b07a4f"
DEVICE="ens33"
ONBOOT="yes"
# 下面四个是新添加的,ipaddr就是我们为虚拟机设置的静态ip,每个主机要设置不同的静态
# ip,并且前只有ip的后面那个8位可以变,就分别设成129,130,131吧
IPADDR=192.168.208.129
GATEWAY=192.168.208.2
NETMASK=255.255.255.0
DNS1=192.168.208.2
Then restart the network service
service network restart
You can use the ip addr
command to view the local IP as follows:
[fay@master ~]$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:0c:29:96:b6:0a brd ff:ff:ff:ff:ff:ff
inet 192.168.208.129/24 brd 192.168.208.255 scope global noprefixroute ens33
valid_lft forever preferred_lft forever
inet6 fe80::5914:336a:4dde:d580/64 scope link noprefixroute
valid_lft forever preferred_lft forever
You can see 192.168.208.129
that the configured static ip
and then use xshell to connect to the three hosts, use xftp to upload the pre-downloaded installation package to one of the virtual machines, select it as the master, and the other two as the younger brothers slave1 and slave2, here each host configuration /etc/hosts
file as follows:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.208.129 master
192.168.208.130 slave1
192.168.208.131 slave2
Then each host creates a user fay
, and then gives /opt
the permissions of this directory to this user:, chown fay -R /opt
switch to this user. The process of creating a user can be set when the virtual machine is installed (if it is vmware), just follow the steps, and you don’t want to use root directly. Everything behind is /opt
loaded.
Turn off firewall
systemctl stop firewalld
systemctl disable firewalld
Time synchronization
The time of the three virtual machines may not be synchronized, which will affect the use of HBASE. Therefore, it is recommended to synchronize the time of the three virtual machines. There are two methods, which can be used simultaneously.
The first type : Set up the virtual machine
Set up the virtual machine, find [Options] ==> [vmwaretools] ==> [Click to synchronize the client time with the host] ==> [OK]
The second: install ntp service on the virtual machine
yum install -y ntpdate
cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
ntpdate -u ntp.api.bz
The second type is recommended here. If you often suspend the virtual machine, the first type does not seem to have much effect, and the second type can be restored later.
Back operation by fay
the user
Install java
java is required
su fay
tar -zxvf jdk-8u102-linux-x64.tar.gz -C /opt
Configure environment variables
vi ~/.bashrc
export JAVA_HOME=/opt/jdk1.8.0_102
export PATH=$PATH:$JAVA_HOME/bin
# 退出来
source ~/.bashrc
# 测试java是否安装成功
java -version
Second, Hadoop
Keyless SSH login
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Then try to get the secret key free ssh localhost
, and if possible, give your public key to the two slaves
ssh-copy-id fay@slave1
ssh-copy-id fay@slave2
For the other two slaves, execute the same command, and then hand over the public keys to the other two machines respectively. At this point, the three virtual machines can switch to each other without secret keys.
Then unzip hadoop, then enter the etc/hadoop folder to modify the configuration file
tar -zxvf hadoop-2.8.5.tar.gz -C /opt
cd /opt/hadoop-2.8.5/etc/hadoop/
Modify core-site.xml, these xml files, some have, some have template or default fields, just copy them into the file I said
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/fay/tmp</value>
</property>
</configuration>
Modify hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.max.xcievers</name>
<value>4096</value>
<dedication> Datanode 有一个同时处理文件的上限,至少要有4096</dedication>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>master:9001</value>
</property>
<property> <!--设置为true,可以在浏览器中IP+port查看-->
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
Modify mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<!--配置实际的主机名和端口-->
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
</configuration>
Modify yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!--日志保存时间 默认保存3-7-->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>86400</value>
</property>
<property> <!--ResourceManager 对客户端暴露的地址-->
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property> <!--ResourceManager 对ApplicationMaster暴露的地址-->
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property> <!--ResourceManager 对NodeManager暴露的地址-->
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
<property> <!--ResourceManager 对管理员暴露的地址-->
<name>yarn.resourcemanager.admin.address</name>
<value>master:8033</value>
</property>
<property> <!--ResourceManager 对外web暴露的地址,可在浏览器查看-->
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
</configuration>
Modify yarn-env.sh
andhadoop-env.sh
#将这句话放到带有java_home的位置,主要是有时候它就是不认你配置的java_home环境变量
export JAVA_HOME=/opt/jdk1.8.0_102
Modify the slaves file
#删掉localhost
slave1
slave2
Add hadoop to the environment variable, modify ~/.bashrc
export JAVA_HOME=/opt/jdk1.8.0_102
export HADOOP_HOME=/opt/hadoop-2.8.5
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
then source ~/.bashrc
The other two machines are simple, just copy it directly
scp -r /opt fay@slave1:/
scp -r /opt fay@slave2:/
In theory, this step is okay, but you have insufficient permissions if you have not configured the /opt permissions before, so fix the permissions.
Then copy the environment variable file directly
scp ~/.bashrc fay@slave1:/home/fay/
scp ~/.bashrc fay@slave2:/home/fay/
Two machines are source ~/.bashrc
then master
node initialization namenode:
hdfs namenode -format
Then start hadoop
start-dfs.sh
start-yarn.sh
Enter on the master node jps
and see the following display:
[fay@master hadoop-2.8.5]$ jps
35184 SecondaryNameNode
34962 NameNode
35371 ResourceManager
35707 Jps
Two slave nodes:
[fay@slave1 ~]$ jps
16289 Jps
16035 DataNode
16152 NodeManager
At this point, hadoop has basically been installed. Of course, you may not be so smooth, and there may be errors. Then find solutions online according to the errors. If it succeeds once, it only means that I wrote too well and you are serious.
Test the Hadoop
input on the windows browser. The 192.168.208.129:8088
interface shows that the yarn should not be a problem.
Of course, you still have to run the mapreduce use case that comes with Hadoop.
$ cd /opt/hadoop-2.8.5
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/fay
$ hdfs dfs -put etc/hadoop input
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.1.jar grep input output 'dfs[a-z.]+'
# 没有报java错误,那就ok,在output目录查看下输出结果:
$ hdfs dfs -cat output/*
Okay, you are fine with the above, and hadoop itself should be fine. Continue to update the installation of hbase and hive later
For the installation and configuration of HBase and Hive, please refer to the most detailed Hadoop+Hbase+Hive fully distributed environment setup tutorial (2)