Linux virtual machine: building a basic environment for big data clusters (Hadoop, Spark, Flink, Hive, Zookeeper, Kafka, Nginx)

Basic information: Centos-7.9, Java-1.8, Python-3.9, Scala-2.12, Hadoop-3.2.1, Spark-3.1.2, Flink-1.13.1, Hive-3.1.3, Zookeeper-3.8.0, Kafka -3.2.0, Nginx-1.23.1

All installation configurations are based on personal learning configurations, please specify the configurations for production environment installation

1. Relevant file download address

  • Cents-7.9
    • http://mirrors.aliyun.com/centos/7.9.2009/isos/x86_64
  • Java-1.8
    • https://www.oracle.com/java/technologies/downloads/#java8
  • Python-3.9
    • https://www.python.org/ftp/python/3.9.6/Python-3.9.6.tgz
  • Scala-2.12
    • https://www.scala-lang.org/download/2.12.12.html
  • Hadoop-3.2.1
    • http://archive.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
  • Spark-3.1.2
    • http://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
  • Flink-1.13.1
    • http://archive.apache.org/dist/flink/flink-1.13.1/flink-1.13.1-bin-scala_2.12.tgz
  • Hive-3.1.3
    • http://archive.apache.org/dist/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
  • Zookeeper-3.8.0
    • http://archive.apache.org/dist/zookeeper/zookeeper-3.8.0/apache-zookeeper-3.8.0-bin.tar.gz
  • Kafka-3.2.0
    • http://archive.apache.org/dist/kafka/3.2.0/kafka_2.12-3.2.0.tgz
  • Nginx-1.23.1
    • https://nginx.org/download/nginx-1.23.1.tar.gz

2. Virtual machine basic configuration

  • Modify static IP
    • vi /etc/sysconfig/network-scripts/ifcfg-eth0
    • Restart the network after modification
      • systemctl restart network
    • Relevant configuration is modified according to your own machine
BOOTPROTO="static"
ONBOOT="yes"
GATEWAY="10.211.55.1"
IPADDR="10.211.55.101"
NETMASK="255.255.255.0"
DNS1="114.114.114.114"
DNS2="8.8.8.8"
  • create user
    • create
      • useradd -m ac_cluster
    • password
      • passwd ac_cluster
    • sudo permissions
      • vi /etc/sudoers
      • Add corresponding user data under the root configuration
  • Modify yum source
    • configuration location
      • /etc/yum.repos.d
    • download wget
      • sudo yum -y install wget
    • Get the repo file
      • wget http://mirrors.aliyun.com/repo/Centos-7.repo
    • Backup the original repo file
      • mv CentOS-Base.repo CentOS-Base.repo.bak
    • change name
      • mv Centos-7.repo CentOS-Base.repo
    • to refresh
      • yum clean all
      • yum makecache
  • download vim
    • yum -y install vim
  • modify hostname
    • vim /etc/hostname
    • reboot
  • turn off firewall
    • systemctl stop firewalld
    • systemctl disable firewalld
  • Modify Domain Name Mapping
    • vim /etc/hosts
  • Configure ssh password-free
    • ssh-keygen-t rsa
      • Enter three times
    • ssh-copy-id hybrid01
      • According to the modification of the sub-node configuration, there are several sub-nodes to execute several times
  • Configure Time Synchronization
    • yum -y install ntpdate
    • ntpdate ntp1.aliyun.com
    • Can be configured to automatically perform time synchronization
      • crontab -e */1 * * * * sudo /usr/sbin/ntpdate ntp1.aliyun.com

3. Language environment installation

1. Java environment installation

  • After downloading the installation package, extract it to the specified directory
    • tar -zxvf xxx -C /xx/xx
  • wget one-click installation
    • wget --no-check-certificate --no-cookies --header “Cookie: oraclelicense=accept-securebackup-cookie” http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz
  • Environment configuration
    • /etc/profile or ~/.bash_profile in the user directory
    • After changing, remember to source
export JAVA_HOME=/xx/xx
export PATH=$JAVA_HOME/bin:$PATH

2. Python environment installation

  • Download source package or wget download
    • wget https://www.python.org/ftp/python/3.9.6/Python-3.9.6.tgz
  • Unzip to the specified directory
    • tar -zxvf xxx -C /xx/xx
  • Depend on environment installation
    • sudo yum -y install vim unzip net-tools && sudo yum -y install wget && sudo yum -y install bzip2 && sudo yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel && sudo yum -y install libglvnd-glx && sudo yum -y install gcc gcc-c++
  • pre-configured
    • ./configure --prefix=/xxx/program/python3
  • Compile and install
    • make && make install
  • Configure environment variables or put the python3 soft link in /usr/bin

3. Scala environment installation

  • After downloading the installation package, extract it to the specified directory
    • tar -zxvf xxx -C /xx/xx
  • Environment configuration
    • /etc/profile or ~/.bash_profile in the user directory
    • After changing, remember to source
export SCALA_HOME=/xx/xx
export PATH=$SCALA_HOME/bin:$PATH

4. Big data component installation

1. Hadoop cluster installation

  • decompress
    • tar -zxvf xx -C /xx/xx
  • Enter the Hadoop directory to modify the files under etc/hadoop
  • Modify hadoop-env.sh
    • export JAVA_HOME=/xxx
  • Modify core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hybrid01:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/xxx/runtime/hadoop_repo</value>
    </property>
</configuration>
  • Modify hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hybrid01:50090</value>
    </property>
</configuration>
  • Modify mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
  • Modify yarn-site.xml
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.env-whitelist</name>
    <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>hybrid02</value>
  </property>
</configuration>
  • Modify the workers file configuration datanode
    • Write the hostname of each datanode node
  • Modify the hdfs startup script: start-dfs.sh, stop-dfs.sh
    • Add the following content
HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
  • Modify yarn start and stop scripts: start-yarn.sh, stop-yarn.sh
    • Add the following content
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
  • Environment configuration
    • /etc/profile or ~/.bash_profile in the user directory
    • After changing, remember to source
HADOOP_HOME=/xxx/hadoop-3.2.1
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_CLASSPATH=`hadoop classpath`
  • Distribute each node
    • scp -r hadoop-xxx user@hybrid01:$PWD
  • format namenode
    • hdfs namenode -format
  • Start the cluster
    • direct boot
      • start-all.sh
    • Daemon starts
      • hadoop-daemons.sh start/stop namenode/datanode/secondarynamenode
    • start yarn
      • start-yarn.sh

2. MySQL installation

Other components will use MySQL for configuration. Here, install MySQL first, mainly using Docker to install, and I am too lazy to install the installation package.

  • Docker installation
    • sudo yum -y install yum-utils device-mapper-persistent-data lvm2
    • sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
    • sudo yum -y install docker-ce docker-ce-cli containerd.io
    • sudo service docker start
    • systemctl enable docker
    • If you can’t use it directly, you can remove it directly if you need sudo
      • sudo gpasswd -a your username docker
      • newgrp docker
  • MySQL installation
    • Create a mount directory: data, conf
    • docker pull mysql:5.7
    • docker run -d --name=mysql -p 3306:3306 --restart=always --privileged=true -v /xxx/metadata/docker/mysql/data:/var/lib/mysql -v /xxx/metadata/docker/mysql/conf:/etc/mysql/conf.d -e MYSQL_ROOT_PASSWORD=123456 mysql:5.7
    • Add remote login (Docker seems to come with it)
      • Create a remote user: CREATE USER 'root'@'%' IDENTIFIED WITH mysql_native_password BY '123456';
      • Open permission: GRANT ALL PRIVILEGES ON . TO 'root'@'%';
    • Modify byte encoding
      • alter database <database name> character set utf8;
      • reboot
    • MySQL driver download
      • wget http://ftp.ntu.edu.tw/MySQL/Downloads/Connector-J/mysql-connector-java-5.1.48.tar.gz

3. Spark installation

Here I did not directly install the Spark cluster, usually submit it to Hadoop Yarn for execution, here just unzip the configuration environment variables

  • decompress
    • tar -zxvf xxx -C /xx/xx
  • Environment configuration
    • /etc/profile or ~/.bash_profile in the user directory
    • After changing, remember to source
SPARK_HOME=/xxx/xx
export PATH=$SPARK_HOME/bin:$PATH
  • Configure the Hive environment
    • Put Hive to hive-site.xml under Spark to conf
    • Copy the MySQL driver to Spark into jars
    • test
      • start spark-shell
      • spark.sql(“show databases”).show()

4. Flink installation

  • decompress
    • tar -zxvf flink-xxx -C /xx/xx
  • Modify flink-conf.yaml
    • Modify localhost as the hostname
  • Modify the workers hostname
  • Modify the masters host name
  • distribution cluster
    • scp -r flink-xxx user:hybrid01:$PWD
  • Environment configuration
    • /etc/profile or ~/.bash_profile in the user directory
    • After changing, remember to source
FLINK_HOME=/xxx/xx
export PATH=$FLINK_HOME/bin:$PATH
  • start up
    • start-cluster.sh

5. Hive installation

  • decompress
    • tar -zxvf hive-xxx -C /xx/xx
  • hive-env.sh
    • HADOOP_HOME=/xx/hadoop
    • export HIVE_CONF_DIR=/xx/hive/conf
  • hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <!-- jdbc 连接的 URL -->
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://hybrid03:3306/hive?createDatabaseIfNotExist=true&amp;amp;useSSL=false</value>
  </property>
  <!-- jdbc 连接的 Driver-->
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <!-- jdbc 连接的 username-->
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
  </property>
  <!-- jdbc 连接的 password -->
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
  </property>
  <!-- Hive 默认在 HDFS 的工作目录 -->
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
  </property>
  <!-- 指定存储元数据要连接的地址 -->
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://hybrid03:9083</value>
  </property>
</configuration>
  • MySQL driver
    • Copy the driver downloaded when installing MySQL to lib
  • guava package problem
    • Copy guava-27.0-jre.jar from share/hadoop/commons/lib under Hadoop to lib under Hive
  • Initialize the database
    • schematool -dbType mysql -initSchema
  • Environment configuration
    • /etc/profile or ~/.bash_profile in the user directory
    • After changing, remember to source
HIVE_HOME=/xxx/xx
export PATH=$HIVE_HOME/bin:$PATH
  • start service
    • nohup hive --service metastore &
  • start interaction
    • hive

6. Zookeeper installation

The subsequent operations are similar, so I won’t write them in detail.

  • decompress
    • tar -zxvf xxx -C /xx/xx
  • zoo.cfg
dataDir=/acware/data/zookeeper
dataLogDir=/acware/logs/zookeeper
server.1=hybrid01:2888:3888
server.2=hybrid02:2888:3888
server.3=hybrid03:2888:3888
  • Create data and log directories
  • Create and assign myid under the dataDir directory
  • Distribute the package to each node and create the corresponding directory and myid
  • Configuration Environment
  • start up
    • zkServer.sh start

7. Kafka installation

  • decompress
  • Configuration Environment
  • Create log directory
  • Modify server.properties
#broker的全局唯一编号,不能重复
broker.id=0
#kafka运行日志存放的路径
log.dirs=log.dirs=/acware/logs/kafka
#配置连接Zookeeper集群地址
zookeeper.connect=hybrido1:2181,hybrid02:2181,hybrid03:2181
  • Distribution cluster, modify borker.id
  • start up
    • kafka-server-start.sh $KAFKA_HOME/conf/server.properties
  • Notice
    • You need to set delete.topic.enable=true in server.properties to completely delete the Topic, otherwise it is just marked for deletion

8. Nginx installation

  • Environment dependent configuration
    • yum -y install gcc zlib zlib-devel pcre-devel openssl openssl-devel
  • Official website download package or wget download
    • wget https://nginx.org/download/nginx-1.22.1.tar.gz
  • Precompiled
    • I personally use other module accessories, and generally only need --prefix for installation
    • ./configure --prefix=/xxx/nginx-1.22.1 --with-openssl=/xxxopenssl --with-http_stub_status_module --with-http_ssl_module --with-http_realip_module --with-stream --with-http_auth_request_module
  • Compile and install
    • make && make install
  • Configuration Environment
  • start up
    • nginx

5. Problems in the process

1. The environment configuration is wrong and the command is lost

  • Add temporary command
    • export PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

2. To be updated

Guess you like

Origin blog.csdn.net/baidu_40468340/article/details/128892621