Detailed steps to build a big data environment (installation and configuration of Hadoop, Hive, Zookeeper, Kafka, Flume, Hbase, Spark, etc.)

Big data environment installation and configuration (Hadoop2.7.7, Hive2.3.4, Zookeeper3.4.10, Kafka2.1.0, Flume1.8.0, Hbase2.1.1, Spark2.4.0, etc.)

 

Preface: This article is based on Hadoop, the basic steps to build various environments that may be used, including: Hadoop, Hive, Zookeeper, Kafka, Flume, Hbase, Spark, etc. It may not be necessary to use all of these in practical applications, so readers should choose according to their needs.
Note: Because there are interdependencies between some environments, you should pay attention to the order in the process of setting up the environment or using it. For example, Hive relies on Hadoop. Before building and using Hive, the Hadoop cluster must be built and started in advance; when building and using Hbase, because it depends on Hadoop and Zookeeper, it is necessary to build and start Hadoop and ZooKeeper clusters in advance. Be careful! Insert picture description here
In addition, for the convenience of readers, I don’t need to find the relevant installation package to download every time. I have downloaded the installation package involved in the construction of this article and put it into Baidu Cloud
Baidu Cloud link https:// pan.baidu.com/s/1jKgua2U1yacbrbgQk4-wFg Extraction code: r93h ) You
can download it to your host at one time, and then upload it directly to your own virtual machine system (upload the host file to WinSCP software can be used in the virtual machine)

systems mannual

  • System: CentOS 7.6
  • Node information:
node ip
master 192.168.185.150
slave1 192.168.185.151
slave2 192.168.185.152

Detailed construction steps

1. Node basic configuration

1. Configure each node network

# 注意:centos自从7版本以后网卡名变成ens33而不是我这里的eth0了,我是习惯eth0了所以在安装的时候修改了网卡名,如果你的centos网卡名是ens33不要紧,就把我这里eth0的地方都换成你的ens33,对后面没影响。

[root@master ~]# vim /etc/sysconfig/network-scripts/ifcfg-eth0
TYPE="Ethernet"
BOOTPROTO="static"
NAME="eth0"
DEVICE="eth0"
ONBOOT="yes"
IPADDR=192.168.185.150
NETMASK=255.255.255.0
GATEWAY=192.168.185.2

[root@master ~]# vim /etc/resolv.conf
nameserver 192.168.185.2

# 对其他两个slave节点也同样做上述操作,只不过在IPADDR值不一样,分别填其节点对应的ip

2. Modify the host name of each node and add each node mapping

# 在其他两个子节点的hostname处分别填slave1和slave2
[root@master ~]# vim /etc/hostname
master

[root@master ~]# vim /etc/hosts
192.168.185.150 master
192.168.185.151 slave1
192.168.185.152 slave2

3. Turn off the firewall

# 三个节点都要做

# 把SELINUX那值设为disabled
[root@master ~]# vim /etc/selinux/config
SELINUX=disabled

[root@master ~]# systemctl stop firewalld
[root@master ~]# systemctl disable firewalld
[root@master ~]# systemctl status firewalld

4. Restart all to take effect

[root@master ~]# reboot
[root@master ~]# ping www.baidu.com

# 注意下,重启后若ping百度不通,可能是因为namesever那重启后自动被改了,所以导致ping百度不通,如果这样的话就再重新写下上面的resolv.conf
[root@master ~]# vim /etc/resolv.conf
nameserver 192.168.185.2

# 这下应该就通了,ping下百度试试看
[root@master ~]# ping www.baidu.com
PING www.a.shifen.com (119.75.217.109) 56(84) bytes of data.
64 bytes from 119.75.217.109: icmp_seq=1 ttl=128 time=30.6 ms
64 bytes from 119.75.217.109: icmp_seq=2 ttl=128 time=30.9 ms
64 bytes from 119.75.217.109: icmp_seq=3 ttl=128 time=30.9 ms

5. Configure SSH password-free login between nodes

[root@master ~]# ssh-keygen -t rsa
# 上面这条命令,遇到什么都别管,一路回车键敲下去

# 拷贝本密钥到三个节点上
[root@master ~]# ssh-copy-id master
[root@master ~]# ssh-copy-id slave1
[root@master ~]# ssh-copy-id slave2

# master节点上做完后,再在其他两个节点上重复上述操作

After everything is done, use the ssh command to test each other:

[root@master ~]# ssh slave1
# 就会发现在master节点上免密登陆到了slave1,再敲logout就退出slave1了

Insert picture description here

6. Install java

# 之后我们所有的环境配置包都放到/usr/local/下

# 新建java目录,把下载好的jdk的二进制包拷到下面(你可以直接在centos里下载,或者在你主机下载好,上传到虚拟机的centos上)
[root@master ~]# cd /usr/local
[root@master local]# mkdir java
[root@master local]# cd java
[root@master java]# tar -zxvf jdk-8u191-linux-x64.tar.gz 

# 配置环境变量,在profile文件最后添加java的环境变量
[root@master ~]# vim /etc/profile
export JAVA_HOME=/usr/local/java/jdk1.8.0_191
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin

[root@master ~]# source /etc/profile
[root@master ~]# java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

# 在其他两个节点上重复上述操作

So far, the basic configuration is over.

Two, Hadoop installation and configuration

– Introduction:
Hadoop is a distributed system infrastructure developed by the Apache Foundation. The core design of the Hadoop framework is: HDFS and MapReduce. HDFS provides storage for massive amounts of data, while MapReduce provides calculations for massive amounts of data.
HDFS, Hadoop Distributed File System, is a distributed file system used to store files on all storage nodes in a Hadoop cluster, including a NameNode and a large number of DataNodes. NameNode, which provides metadata services inside HDFS, is responsible for managing the file system name space and controlling external client access, and deciding whether to map files to DataNodes. DataNode, which provides storage blocks for HDFS and responds to read and write requests from HDFS clients.
MapReduce is a programming model for parallel operations on large-scale data sets. The concepts "Map (mapping)" and "Reduce (reduce)" are their main ideas, that is, to specify a Map (mapping) function to map a set of key-value pairs into a set of new key-value pairs, and specify concurrency The Reduce function is used to ensure that each of all mapped key-value pairs share the same key group.

1. Download and unzip

# 在/usr/local下创建hadoop文件夹,将下载好的hadoop-2.7.7压缩包上传进去解压
[root@master ~]# cd /usr/local
[root@master local]# mkdir hadoop
[root@master local]# cd hadoop
[root@master hadoop]# tar -zxvf hadoop-2.7.7.tar

2. Configure environment variables

[root@master hadoop]# vim /etc/profile
export JAVA_HOME=/usr/local/java/jdk1.8.0_191
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

[root@master hadoop]# source /etc/profile

3, placement core-site.xml

# 配置文件主要在hadoop-2.7.7/etc/hadoop下面
[root@master hadoop]# cd hadoop-2.7.7/etc/hadoop

# 把该文件<configuration>那块按如下修改
[root@master hadoop]# vim core-site.xml
<configuration>
<property>
  <name>fs.defaultFS</name>
  <value>hdfs://master:9000</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/usr/local/data</value>
</property>
</configuration>

# 配置文件中的/usr/local/data是用来存储临时文件的,所以该文件夹需要手动创建
[root@master hadoop]# mkdir /usr/local/data

4. Configure hdfs-site.xml

[root@master hadoop]# vim hdfs-site.xml
<configuration>
<property>
  <name>dfs.name.dir</name>
  <value>/usr/local/data/namenode</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>/usr/local/data/datanode</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>2</value>
</property>
</configuration>

5, placement mapred-site.xml

# 先修改文件名字
[root@master hadoop]# mv mapred-site.xml.template mapred-site.xml

[root@master hadoop]# vim mapred-site.xml
<configuration>
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>
</configuration>

6, placement yarn-site.xml

[root@master hadoop]# vim yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>master</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services</name>              
  <value>mapreduce_shuffle</value>     
</property>
</configuration>

7. Modify slaves

[root@master hadoop]# vim slaves
slave1
slave2

8. Modify the hadoop-env.sh file

# 在“export JAVA_HOME=”那一行把java环境修改成自己的路径
[root@master hadoop]# vim hadoop-env.sh
export JAVA_HOME=/usr/local/java/jdk1.8.0_191

9. Pass the configured hadoop package directly to the same location of the remaining two child nodes

[root@master hadoop]# cd /usr/local
[root@master local]# scp -r hadoop [email protected]:/usr/local/
[root@master local]# scp -r hadoop [email protected]:/usr/local/

10. Don't miss the operations in the other two child nodes

# 别忘了!在两个子节点/usr/local/下也要创建好data目录。

# 别忘了!在两个子节点重复下步骤2, 配置好hadoop环境变量。

11. Whether the test is successful

# 只要在主节点上启动,执行过程可能稍慢,耐心等待

# 先格式化
[root@master ~]# hdfs namenode -format

# 启动hdfs
[root@master ~]# cd /usr/local/hadoop/hadoop-2.7.7/
[root@master hadoop-2.7.7]# sbin/start-dfs.sh

# 启动yarn
[root@master hadoop-2.7.7]# sbin/start-yarn.sh

Enter the jps command on the main node to view, the following is correct: enter the jps command on the
Insert picture description here
child node to view, the following is
Insert picture description here
correct: visit the visualization page on the browser: http://192.168.185.150:50070
Insert picture description here
so far, hadoop The configuration is over.

Three, Hive installation and configuration

– Introduction:

Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, and provides simple SQL query functions, and can convert SQL statements into MapReduce tasks for execution. Its advantage is low learning cost, simple MapReduce statistics can be quickly realized through HiveQL language similar to SQL, no need to develop special MapReduce application, very suitable for statistical analysis of data warehouse. At the same time, this language also allows developers familiar with MapReduce to develop custom mappers and reducers to handle complex analysis tasks that the built-in mappers and reducers cannot complete.
Hive has no special data format. All Hive data is stored in a Hadoop compatible file system (such as HDFS). Hive will not make any changes to the data during the process of loading data. It just moves the data to the directory set by Hive in HDFS. Therefore, Hive does not support the rewriting and addition of data. All data is at the time of loading. definite.

1. Environment configuration

# 注意:Hive只需要在master节点上安装配置

[root@master ~]# cd /usr/local
[root@master local]# mkdir hive
[root@master local]# cd hive
[root@master hive]# tar -zxvf apache-hive-2.3.4-bin.tar.gz 
[root@master hive]# mv apache-hive-2.3.4-bin hive-2.3.4

# 添加Hive环境变量
[root@master hive]# vim /etc/profile               
export JAVA_HOME=/usr/local/java/jdk1.8.0_191
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HIVE_HOME=/usr/local/hive/hive-2.3.4
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin

[root@master hive]# source /etc/profile

2. Modify hive-site.xml

[root@master hive]# cd hive-2.3.4/conf
[root@master conf]# mv hive-default.xml.template   hive-site.xml

# 在hive-site.xml中找到下面的几个对应name的property,然后把value值更改
# 这里提醒一下,因为hive-site.xml几千多行,根据name找property的话不太方便,有两种建议:
# 1、把这个xml文件弄到你自己的主机上,用软件(比如notepad++)修改好,在上传回centos上相应位置
# 2、在之前给你的百度云链接里,我也上传了修改好的hive-site.xml文件,如果你版本跟我用的一样,可以直接拿去用

[root@master conf]# vim hive-site.xml 

 <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://master:3306/hive_metadata?createDatabaseIfNotExist=true</value>
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
 </property>
 
 <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
 </property>
  
 <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>Username to use against metastore database</description>
 </property>
    
 <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
    <description>password to use against metastore database</description>
 </property>

 <property>
    <name>hive.querylog.location</name>
    <value>/usr/local/hive/hive-2.3.4/tmp/hadoop</value>
    <description>Location of Hive run time structured log file</description>
  </property>
 
  <property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>/usr/local/hive/hive-2.3.4/tmp/hadoop/operation_logs</value>
    <description>Top level directory where operation logs are stored if logging functionality is enabled</description>
  </property>
  
  <property>
    <name>hive.exec.local.scratchdir</name>
    <value>/usr/local/hive/hive-2.3.4/tmp/hadoop</value>
    <description>Local scratch space for Hive jobs</description>
  </property>
  
  <property>
    <name>hive.downloaded.resources.dir</name>
    <value>/usr/local/hive/hive-2.3.4/tmp/${hive.session.id}_resources</value>
    <description>Temporary local directory for added resources in the remote file system.</description>
  </property>
  
  <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
    <description>
      Enforce metastore schema version consistency.
      True: Verify that version information stored in is compatible with one from Hive jars.  Also disable automatic
            schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
            proper metastore schema migration. (Default)
      False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
    </description>
  </property>

3. Modify the hive-env.sh file

[root@master conf]# mv hive-env.sh.template hive-env.sh

# 找到下面的位置,做对应修改
[root@master conf]# vim hive-env.sh 

# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7

# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=/usr/local/hive/hive-2.3.4/conf

# Folder containing extra libraries required for hive compilation/execution can be controlled by:
# export HIVE_AUX_JARS_PATH=
export JAVA_HOME=/usr/local/java/jdk1.8.0_191
export HIVE_HOME=/usr/local/hive/hive-2.3.4

4. Copy the downloaded mysql-connector-java.jar jar package to /usr/local/hive/hive-2.3.4/lib/, which is available in the Baidu cloud link for you

5. Install and configure mysql (because the metadata of hive is stored in mysql)

[root@master ~]# cd /usr/local/src/
[root@master src]# wget http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
[root@master src]# rpm -ivh mysql-community-release-el7-5.noarch.rpm
[root@master src]# yum install mysql-community-server

# 这里时间较长,耐心等待...

# 安装完成后,重启服务
[root@master src]# service mysqld restart
[root@master src]# mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.6.42 MySQL Community Server (GPL)
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>

# mysql安装成功

6. Create a hive metadata database on mysql, create a hive account, and authorize

# 在mysql上连续执行下述命令:
# create database if not exists hive_metadata;
# grant all privileges on hive_metadata.* to 'hive'@'%' identified by 'hive';
# grant all privileges on hive_metadata.* to 'hive'@'localhost' identified by 'hive';
# grant all privileges on hive_metadata.* to 'hive'@'master' identified by 'hive';
# flush privileges;
# use hive_metadata;

[root@master src]# mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.6.42 MySQL Community Server (GPL)
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> create database if not exists hive_metadata;
Query OK, 1 row affected (0.00 sec)

mysql> grant all privileges on hive_metadata.* to 'hive'@'%' identified by 'hive';
Query OK, 0 rows affected (0.00 sec)

mysql> grant all privileges on hive_metadata.* to 'hive'@'localhost' identified by 'hive';
Query OK, 0 rows affected (0.00 sec)

mysql> grant all privileges on hive_metadata.* to 'hive'@'master' identified by 'hive';
Query OK, 0 rows affected (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

mysql> use hive_metadata;
Database changed
mysql> exit
Bye

7. Initialization

[root@master src]# schematool -dbType mysql -initSchema  
  •  

8. Test and verify hive

# 我们先创建一个txt文件存点数据等下导到hive中去
[root@master src]# vim users.txt
1,浙江工商大学
2,杭州
3,I love
4,ZJGSU
5,加油哦

# 进入hive,出现命令行就说明之前搭建是成功的
[root@master src]# hive
hive>

# 创建users表,这个row format delimited fields terminated by ','代表我们等下导过来的文件中字段是以逗号“,”分割字段的
# 所以我们上面users.txt不同字段中间有逗号
hive> create table users(id int, name string) row format delimited fields terminated by ',';
OK
Time taken: 7.29 seconds

# 导数据
hive> load data local inpath '/usr/local/src/users.txt' into table users;
Loading data to table default.users
OK
Time taken: 1.703 seconds

# 查询
hive> select * from users;
OK
1       浙江工商大学
2       杭州
3       I love
4       ZJGSU
5       加油哦
Time taken: 2.062 seconds, Fetched: 5 row(s)

# ok,测试成功!

So far, the hive configuration is over. In fact, the hive configuration is quite cumbersome. Don't rush it slowly, come on!

Fourth, ZooKeeper installation and configuration

– Introduction:
ZooKeeper is a distributed application coordination service and an important component of Hadoop and Hbase. It is a software that provides consistent services for distributed applications. The functions provided include: configuration maintenance, domain name services, distributed synchronization, group services, etc. Its goal is to encapsulate key services that are complex and error-prone, and provide users with simple and easy-to-use interfaces and systems with high performance and stable functions.
So what can Zookeeper do? To give a simple example: suppose we have 20 search engine servers (each responsible for a part of the search task in the general index) and a general server (responsible for sending search requests to the servers of these 20 search engines and merging the result sets) , A spare master server (responsible for replacing the master server when the master server is down), and a web cgi (sending a search request to the master server). Fifteen of the search engine servers provide search services, and five servers are generating indexes. The servers of these 20 search engines often have to stop the server that is providing the search service and start generating the index, or the server that generates the index has completed the index generation and can provide the search service. Using Zookeeper can ensure that the main server automatically senses how many servers provide search engines and sends search requests to these servers. When the main server goes down, the standby main server is automatically activated.

1. Environment configuration

[root@master local]# mkdir zookeeper
[root@master local]# cd zookeeper

# 将下载好的zookeeper压缩包上传进来解压
[root@master zookeeper]# tar -zxvf zookeeper-3.4.10.tar.gz 

# 配置zookeeper环境变量
[root@master zookeeper]# vim /etc/profile
export JAVA_HOME=/usr/local/java/jdk1.8.0_191
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HIVE_HOME=/usr/local/hive/hive-2.3.4
export ZOOKEEPER_HOME=/usr/local/zookeeper/zookeeper-3.4.10
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$ZOOKEEPER_HOME/bin

[root@master zookeeper]# source /etc/profile

2. Configure the zoo.cfg file

[root@master zookeeper]# cd zookeeper-3.4.10/conf
[root@master conf]# mv zoo_sample.cfg zoo.cfg

# 把 dataDir 那一行修改成自己的地址,在文件最后再加上三行server的配置
[root@master conf]# vim zoo.cfg 

dataDir=/usr/local/zookeeper/zookeeper-3.4.10/data

server.0=master:2888:3888 
server.1=slave1:2888:3888 
server.2=slave2:2888:3888

3. Configure the myid file

[root@master conf]# cd ..
[root@master zookeeper-3.4.10]# mkdir data
[root@master zookeeper-3.4.10]# cd data
[root@master data]# vim myid
0

4. Configure the other two nodes

# 把上面配置好的zookeeper文件夹直接传到两个子节点
[root@master data]# cd ../../..
[root@master local]# scp -r zookeeper [email protected]:/usr/local/
[root@master local]# scp -r zookeeper [email protected]:/usr/local/

# 注意在两个子节点上把myid文件里面的 0 给分别替换成 1 和 2

# 注意在两个子节点上像步骤1一样,在/etc/profile文件里配置zookeeper的环境变量,保存后别忘source一下

5. Test it

# 在三个节点上分别执行命令,启动服务: zkServer.sh start

# 在三个节点上分别执行命令,查看状态: zkServer.sh status 
# 正确结果应该是:三个节点中其中一个是leader,另外两个是follower

# 在三个节点上分别执行命令: jps 
# 检查三个节点是否都有QuromPeerMain进程

So far, the zookeeper configuration is over, this should not be difficult.

Five, Kafka installation and configuration

– Introduction:
Kafka is a high-throughput distributed publish-and-subscribe messaging system that can process all action flow data in consumer-scale websites. The Producer is the producer, and sends messages to the Kafka cluster. Before sending the message, the message is classified, that is, the topic (Topic). The message can be classified by specifying the topic for the message. Consumers can only pay attention to the messages in the Topic they need. . Consumer, that is, the consumer, the consumer continuously pulls messages from the cluster by establishing a long connection with the Kafka cluster, and then can process these messages.

1. Install Scala

Kafka is written by Scala and Java, so we first need to install and configure Scala:

[root@master ~]# cd /usr/local
[root@master local]# mkdir scala
[root@master local]# cd scala/
# 下载好的scala压缩包上传进去解压
[root@master scala]# tar -zxvf scala-2.11.8.tgz

# 配置环境变量
[root@master scala]# vim /etc/profile
export JAVA_HOME=/usr/local/java/jdk1.8.0_191
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HIVE_HOME=/usr/local/hive/hive-2.3.4
export ZOOKEEPER_HOME=/usr/local/zookeeper/zookeeper-3.4.10
export SCALA_HOME=/usr/local/scala/scala-2.11.8
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$ZOOKEEPER_HOME/bin:$SCALA_HOME/bin

[root@master scala]# source /etc/profile

# 验证
[root@master scala-2.11.8]# scala -version
Scala code runner version 2.11.8 -- Copyright 2002-2018, LAMP/EPFL and Lightbend, Inc.

# 然后在剩下两个子节点中重复上述步骤!

2. Install and configure Kafka

# 创建目录,把下载好的压缩包上传解压
[root@master local]# mkdir kafka
[root@master local]# cd kafka
[root@master kafka]# tar -zxvf kafka_2.11-2.1.0.tgz 
[root@master kafka]# mv kafka_2.11-2.1.0 kafka-2.1.0

# 配置环境变量
[root@master kafka]# vim /etc/profile
export JAVA_HOME=/usr/local/java/jdk1.8.0_191
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HIVE_HOME=/usr/local/hive/hive-2.3.4
export ZOOKEEPER_HOME=/usr/local/zookeeper/zookeeper-3.4.10
export KAFKA_HOME=/usr/local/kafka/kafka-2.1.0
export SCALA_HOME=/usr/local/scala/scala-2.11.8
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$ZOOKEEPER_HOME/bin:$KAFKA_HOME/bin:$SCALA_HOME/bin

[root@master kafka]# source /etc/profile

# 修改server.properties文件,找到对应的位置,修改如下
[root@master kafka]# vim kafka-2.1.0/config/server.properties
broker.id=0
listeners=PLAINTEXT://192.168.185.150:9092
advertised.listeners=PLAINTEXT://192.168.185.150:9092
zookeeper.connect=192.168.185.150:2181,192.168.185.151:2181,192.168.185.152:2181

# 把master节点上修改好的kafka整个文件夹传到其余两个子节点
[root@master kafka]# cd /usr/local
[root@master local]# scp -r kafka [email protected]:/usr/local/
[root@master local]# scp -r kafka [email protected]:/usr/local/

# 在另外两个节点上,对server.properties要有几处修改
# broker.id 分别修改成: 1 和 2
# listeners 在ip那里分别修改成子节点对应的,即 PLAINTEXT://192.168.185.151:9092 和 PLAINTEXT://192.168.185.152:9092
# advertised.listeners 也在ip那里分别修改成子节点对应的,即 PLAINTEXT://192.168.185.151:9092 和 PLAINTEXT://192.168.185.152:9092
# zookeeper.connect 不需要修改
# 另外两个节点上也别忘了配置kafka环境变量

3. Test

# 在三个节点都启动kafka
[root@master local]# cd kafka/kafka-2.1.0/
[root@master kafka-2.1.0]# nohup kafka-server-start.sh /usr/local/kafka/kafka-2.1.0/config/server.properties & 

# 在主节点上创建主题TestTopic
[root@master kafka-2.1.0]# kafka-topics.sh --zookeeper 192.168.185.150:2181,192.168.185.151:2181,192.168.185.152:2181 --topic TestTopic --replication-factor 1 --partitions 1 --create

# 在主节点上启动一个生产者
[root@master kafka-2.1.0]# kafka-console-producer.sh --broker-list 192.168.185.150:9092,192.168.185.151:9092,192.168.185.152:9092 --topic TestTopic

# 在其他两个节点上分别创建消费者
[root@slave1 kafka-2.1.0]# kafka-console-consumer.sh --bootstrap-server 192.168.185.151:9092 --topic TestTopic --from-beginning
[root@slave2 kafka-2.1.0]# kafka-console-consumer.sh --bootstrap-server 192.168.185.152:9092 --topic TestTopic --from-beginning

# 在主节点生产者命令行那里随便输入一段话:
> hello world

# 然后你就会发现在其他两个消费者节点那里也出现了这句话,即消费到了该数据

So far, the kafka configuration is over.

Six, Flume installation and configuration

– Introduction:
Flume is a highly available, highly reliable, distributed system for massive log collection, aggregation and transmission provided by Cloudera. Flume supports customizing various data senders in the log system to collect data; at the same time, Flume provides the ability to simply process data and write to various data recipients (customizable). Flume provides two modes from console (console), RPC (Thrift-RPC), text (file), tail (UNIX tail), syslog (syslog log system), supporting TCP and UDP), exec (command execution) The ability to collect data on other data sources.
Using Flume, we can quickly transfer data obtained from multiple servers to Hadoop, and efficiently store log information collected from multiple web servers in HDFS/HBase.

Note: flume only needs to be configured on the main node, not on other nodes

1. Environment configuration

# 创建目录,将下载好的压缩包上传并解压
[root@master local]# mkdir flume
[root@master local]# cd flume/
[root@master flume]# tar -zxvf apache-flume-1.8.0-bin.tar.gz 
[root@master flume]# mv apache-flume-1.8.0-bin flume-1.8.0

# 环境变量
[root@master flume]# vim /etc/profile
export JAVA_HOME=/usr/local/java/jdk1.8.0_191
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HIVE_HOME=/usr/local/hive/hive-2.3.4
export ZOOKEEPER_HOME=/usr/local/zookeeper/zookeeper-3.4.10
export KAFKA_HOME=/usr/local/kafka/kafka-2.1.0
export SCALA_HOME=/usr/local/scala/scala-2.11.8
export FLUME_HOME=/usr/local/flume/flume-1.8.0
export FLUME_CONF_DIR=$FLUME_HOME/conf
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$ZOOKEEPER_HOME/bin:$KAFKA_HOME/bin:$SCALA_HOME/bin:$FLUME_HOME/bin

[root@master flume]# source /etc/profile

2. Modify the flume-conf.properties file

[root@master flume]# cd flume-1.8.0/conf
[root@master conf]# mv flume-conf.properties.template flume-conf.properties

# 在文件最后加上下面的内容
[root@master conf]# vim flume-conf.properties 
#agent1表示代理名称
agent1.sources=source1
agent1.sinks=sink1
agent1.channels=channel1
#配置source1
agent1.sources.source1.type=spooldir
agent1.sources.source1.spoolDir=/usr/local/flume/logs
agent1.sources.source1.channels=channel1
agent1.sources.source1.fileHeader = false
agent1.sources.source1.interceptors = i1
agent1.sources.source1.interceptors.i1.type = timestamp
#配置channel1
agent1.channels.channel1.type=file
agent1.channels.channel1.checkpointDir=/usr/local/flume/logs_tmp_cp
agent1.channels.channel1.dataDirs=/usr/local/flume/logs_tmp
#配置sink1
agent1.sinks.sink1.type=hdfs
agent1.sinks.sink1.hdfs.path=hdfs://master:9000/logs
agent1.sinks.sink1.hdfs.fileType=DataStream
agent1.sinks.sink1.hdfs.writeFormat=TEXT
agent1.sinks.sink1.hdfs.rollInterval=1
agent1.sinks.sink1.channel=channel1
agent1.sinks.sink1.hdfs.filePrefix=%Y-%m-%d


# 我们看到上面的配置文件中代理 agent1.sources.source1.spoolDir 监听的文件夹是/usr/local/flume/logs,所以我们要手动创建一下
[root@master conf]# cd ../..
[root@master flume]# mkdir logs

# 上面的配置文件中 agent1.sinks.sink1.hdfs.path=hdfs://master:9000/logs下,即将监听到的/usr/local/flume/logs下的文件自动上传到hdfs的/logs下,所以我们要手动创建hdfs下的目录
[root@master flume]# hdfs dfs -mkdir /logs 

3. Test

# 启动服务
[root@master flume]# flume-ng agent -n agent1 -c conf -f /usr/local/flume/flume-1.8.0/conf/flume-conf.properties -Dflume.root.logger=DEBUG,console

# 先看下hdfs的logs目录下,目前什么都没有
[root@master flume]# hdfs dfs -ls -R /

Insert picture description here

# 我们在/usr/local/flume/logs随便创建个文件
[root@master flume]# cd logs
[root@master logs]# vim flume_test.txt
hello world !
guang
浙江工商大学

# 然后我们发现hdfs的logs下自动上传了我们刚刚创建的文件
[root@master logs]# hdfs dfs -ls -R /

Insert picture description here

[root@master logs]# hdfs dfs -cat  /logs/2018-12-31.1546242551842
hello world !
guang
浙江工商大学

So far, the flume configuration is over.

Seven, Hbase installation and configuration

– Introduction:
HBase – Hadoop Database, is a highly reliable, high-performance, column-oriented, and scalable distributed storage system. Hadoop HDFS provides high-reliability underlying storage support for HBase, Hadoop MapReduce provides high-performance computing capabilities for HBase, and Zookeeper provides stable services and failover mechanisms for HBase.

1. Environment configuration

创建目录,将下载好的压缩包上传并解压
[root@master local]# mkdir hbase
[root@master local]# cd hbase
[root@master hbase]# tar -zxvf hbase-2.1.1-bin.tar.gz

# 环境变量
[root@master hbase]# vim /etc/profile
export JAVA_HOME=/usr/local/java/jdk1.8.0_191
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HIVE_HOME=/usr/local/hive/hive-2.3.4
export ZOOKEEPER_HOME=/usr/local/zookeeper/zookeeper-3.4.10
export KAFKA_HOME=/usr/local/kafka/kafka-2.1.0
export SCALA_HOME=/usr/local/scala/scala-2.11.8
export FLUME_HOME=/usr/local/flume/flume-1.8.0
export FLUME_CONF_DIR=$FLUME_HOME/conf
export HBASE_HOME=/usr/local/hbase/hbase-2.1.1
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$ZOOKEEPER_HOME/bin:$KAFKA_HOME/bin:$SCALA_HOME/bin:$FLUME_HOME/bin:$HBASE_HOME/bin

[root@master hbase]# source /etc/profile

2. Modify the hbase-env.sh file

[root@master hbase]# cd hbase-2.1.1/conf
[root@master conf]# vim hbase-env.sh 
export JAVA_HOME=/usr/local/java/jdk1.8.0_191
export HBASE_LOG_DIR=${HBASE_HOME}/logs 
export HBASE_MANAGES_ZK=false

3. Modify the hbase-site.xml file

[root@master conf]# vim hbase-site.xml 
<configuration>
<property> 
    <name>hbase.rootdir</name> 
    <value>hdfs://master:9000/hbase</value> 
  </property> 
  <property> 
    <name>hbase.cluster.distributed</name> 
    <value>true</value> 
  </property> 
  <property> 
    <name>hbase.zookeeper.quorum</name> 
    <value>master,slave1,slave2</value> 
  </property> 
  <property> 
    <name>hbase.zookeeper.property.dataDir</name> 
    <value>/usr/local/zookeeper/zookeeper-3.4.10/data</value> 
  </property> 
  <property>
    <name>hbase.tmp.dir</name>
    <value>/usr/local/hbase/data/tmp</value>
  </property>
  <property> 
    <name>hbase.master</name> 
    <value>hdfs://master:60000</value> 
  </property>
  <property>
    <name>hbase.master.info.port</name>
    <value>16010</value>
  </property>
  <property>
    <name>hbase.regionserver.info.port</name>
    <value>16030</value>
  </property>
</configuration>

4. Modify the regionservers file

[root@master conf]# vim regionservers 
master
slave1
slave2

5. Configuration of the other two child nodes

# 把上面配置好的hbase整个文件夹传过去
[root@master conf]# cd ../../..
[root@master local]# scp -r hbase [email protected]:/usr/local/
[root@master local]# scp -r hbase [email protected]:/usr/local/

# 别忘在另外两个节点也要在/etc/profile下配置环境变量并source一下使生效!
# 在所有节点上都手动创建/usr/local/hbase/data/tmp目录,也就是上面配置文件中hbase.tmp.dir属性的值,用来保存临时文件的。

6. Test

# 注意:测试Hbase之前,zookeeper和hadoop需要提前启动起来
[root@master local]# cd hbase/hbase-2.1.1
[root@master hbase-2.1.1]# bin/start-hbase.sh   
[root@master hbase-2.1.1]# jps
# 正确结果:主节点上显示:HMaster / 子节点上显示:HRegionServer

Visit on the host browser: http://192.168.185.150:16010
Insert picture description here
So far, the Hbase configuration is over.

Eight, Spark installation and configuration

– Introduction:
Apache Spark is a fast and general-purpose computing engine designed for large-scale data processing. It is a general-purpose parallel framework similar to Hadoop MapReduce. Spark has the advantages of Hadoop MapReduce, but is different from MapReduce-Job intermediate output results can be stored in memory, so there is no need to read and write HDFS, so Spark is better suited for data mining and machine learning needs Iterative MapReduce algorithm. Spark is actually a supplement to Hadoop, which can run in parallel in the Hadoop file system.

1. Environment configuration

# 创建目录,将下载好的压缩包上传并解压
[root@master local]# mkdir spark
[root@master local]# cd spark
[root@master spark]# tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz 
[root@master spark]# mv spark-2.4.0-bin-hadoop2.7 spark-2.4.0

# 配置环境变量
[root@master spark]# vim /etc/profile
export JAVA_HOME=/usr/local/java/jdk1.8.0_191
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HIVE_HOME=/usr/local/hive/hive-2.3.4
export ZOOKEEPER_HOME=/usr/local/zookeeper/zookeeper-3.4.10
export KAFKA_HOME=/usr/local/kafka/kafka-2.1.0
export SCALA_HOME=/usr/local/scala/scala-2.11.8
export FLUME_HOME=/usr/local/flume/flume-1.8.0
export FLUME_CONF_DIR=$FLUME_HOME/conf
export HBASE_HOME=/usr/local/hbase/hbase-2.1.1
export SPARK_HOME=/usr/local/spark/spark-2.4.0
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$ZOOKEEPER_HOME/bin:$KAFKA_HOME/bin:$SCALA_HOME/bin:$FLUME_HOME/bin:$HBASE_HOME/bin:$SPARK_HOME/bin

[root@master spark]# source /etc/profile

2. Modify the spark-env.sh file

[root@master spark]# cd spark-2.4.0/conf/
[root@master conf]# mv spark-env.sh.template spark-env.sh
[root@master conf]# vim spark-env.sh 
export JAVA_HOME=/usr/local/java/jdk1.8.0_191
export SCALA_HOME=/usr/local/scala/scala-2.11.8
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.7/etc/hadoop

3. Modify the slaves file

[root@master conf]# mv slaves.template slaves
[root@master conf]# vim slaves 
master
slave1
slave2

4. Operate on the remaining two child nodes

# 把上面配置好的spark整个文件夹传过去
[root@master conf]# cd ../../..
[root@master local]# scp -r spark [email protected]:/usr/local/
[root@master local]# scp -r spark [email protected]:/usr/local/

# 别忘在另外两个节点也要在/etc/profile下配置环境变量并source一下使生效!

5. Start

[root@master local]# cd spark/spark-2.4.0/                
[root@master spark-2.4.0]# sbin/start-all.sh 

After the startup is complete, access the interface in the host browser: http://192.168.185.150:8080/
Insert picture description here
OK is successful, so far, the Spark configuration is over! Now let's test and run a sample code for calculating pi that comes with spark:

[root@master spark-2.4.0]# ./bin/spark-submit  \
--class  org.apache.spark.examples.SparkPi  \
--master  local  \
examples/jars/spark-examples_2.11-2.4.0.jar

We can find the calculation result in the console output:
Insert picture description here

to sum up

The above is the detailed explanation of the steps to build a big data environment based on Hadoop (installation and configuration of Hadoop, Hive, Zookeeper, Kafka, Flume, Hbase, Spark, etc.). Be sure to operate it patiently, don't be nervous when you encounter problems, take your time, come on!
It’s long enough to write this article, it's a New Year's gift for 2019, take a break!

Guess you like

Origin blog.csdn.net/litianquan/article/details/108566150