AWS EC2 build a Hadoop cluster and Spark

Foreword

Benpian demonstrates how to use the AWS EC2 cloud service to build clusters. Of course, building only in the case of a fully distributed cluster of computers, there are several other ways: one is to build local multiple virtual machines, the benefit is free and easy to manipulate, the downside is that the higher the virtual machine host configuration requirements, I was an ordinary notebook, open two three virtual machines it can not afford; another solution is to use the AWS EMR , Amazon designed cluster platform can quickly start the cluster, and has a high flexibility and expandability sex, can easily increase the machine. However, the drawback is that only use the default software, as shown below:

If you want to install additional software, you need to use Bootstrap script, see https://docs.aws.amazon.com/zh_cn/emr/latest/ManagementGuide/emr-plan-software.html?shortFooter=true , but this is not an easy thing, I remember before Tencent would like to install on top of Angel is life and death are not installed go up. In addition, if you turn off a cluster on EMR, then the files and configurations are not saved, all you want to reset the next time, it shows more suitable for single-use scenarios.

In summary, if you use EC2 pure manual build, is neither subject to local resource limits, also has a high flexibility, you can freely configure the installation software. The downside is to manually set up to spend more time, and in the cloud on the operation and in some places the local operation is not the same, as long as the wrong step may be necessary stuck for a long time, in view of the Internet to build a little information in this regard with EC2, Therefore, here to write an article to the main flow recorded.

Create EC2 instances

Below started, here set up three machines (instances), a primary node (master node), made from two nodes (slaves node). First, create an instance, select the Ubuntu Server 16.04 LTS (HVM)instance type select inexpensive t2.medium. If this is the first time, do not choose the type of price is too high, or in case of operational errors can not afford the monthly bill.

In step 3, because to open three machines at the same time, Number of Instancesyou can directly select 3. But if each were to open it, the following Subnet must select the same area, or can not communicate between machines, details see https://docs.aws.amazon.com/zh_cn/AWSEC2/latest/UserGuide/using-regions zones.html--availability .

Step 4 to set the size of the hard drive, if you take a cluster could not move, even if the other software installed, you may need to increase capacity here, I was increased to 15 GB:

5 and 6 can be directly Next, after Step 7 Launch select or create a new key pair, you can get the created three instances, there may be provided the name of notes, such as the master, slave01, slave02 like:

Open three terminal windows, ssh connection three instances, such as ssh -i xxxx.pem [email protected]where xxxx.pemis the key to your local name, ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.comis the external DNS host name of this instance, each instance is different. We should explain, because it is open and local differences virtual machine at: EC2 instances have public IP and private IP points, private IP for communication between cloud instances, and public IP is for you communication between the local machine and example, and therefore is used herein ssh connection public IP (DNS). In the following step to build the cluster also need to fill public and private IP, careful not to fill backwards. See the difference on the two https://docs.aws.amazon.com/zh_cn/AWSEC2/latest/UserGuide/using-instance-addressing.html?shortFooter=true#using-instance-addressing-common .

New users to install Java environment

The following case study to master node. After landing instance, the default user is ubuntu, you first need to create a user hadoop:

$ sudo useradd -m hadoop -s /bin/bash   # 增加 hadoop用户
$ sudo passwd hadoop                    # 设置密码,需要输入两次
$ sudo adduser hadoop sudo              # 为 hadoop 用户增加管理员权限
$ su hadoop                             # 切换到 hadoop 用户,需要输入密码
$ sudo apt-get update                   # 更新 apt 源

After this is done, end user name will be changed to hadoop, and /homethe directory will generate a hadoop another folder.

Hadoop relies on Java environment, so the next step is to install the JDK, directly from the official website to download, here is the next Linux x64version jdk-8u231-linux-x64.tar.gz, use scp remote transmission to the master machine. Note that this can only be transmitted to the user under ubuntu, reached under the hadoop user may be prompted to insufficient permissions.

$ scp -i xxx.pem jdk-8u231-linux-x64.tar.gz [email protected]:/home/ubuntu/  # 本地执行该命令

Benpian assume that all software installed in the /usr/libdirectory:

$ sudo mv /home/ubuntu/jdk-8u231-linux-x64.tar.gz /home/hadoop         # 将文件移动到 hadoop 用户下
$ sudo tar -zxf /home/hadoop/jdk-8u231-linux-x64.tar.gz -C /usr/lib/   # 把JDK文件解压到/usr/lib目录下
$ sudo mv /usr/lib/jdk1.8.0_231  /usr/lib/java                         # 重命名java文件夹
$ vim ~/.bashrc                                                        # 配置环境变量,貌似EC2只能使用 vim

Add the following:

export JAVA_HOME=/usr/lib/java
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
$ source ~/.bashrc   # 让配置文件生效
$ java -version    # 查看 Java 是否安装成功

If the following prompt appears, said the installation was successful:

After the master node of the above steps, the slave node completes the same two steps (hadoop new user, install the Java environment)

Network Configuration

This step is the Master and Slave in order to facilitate network communication node, first configuration is determined prior hadoop user login. First, modify the host name of each node, perform sudo vim /etc/hostname, on the master node in the ip-xxx-xx-xx-xxchange Master. Similar to other nodes, change Slave01 on slave01 node, the node is slave02 Slave02.

Is then performed sudo vim /etc/hoststo change the IP node mapping for their own use, to master node, for example, to add the red area information, the IP address noted herein that the above private IP:

Followed by addition of the same information in two hosts slave nodes. After reboot it, entering the hadoop user can see the change machine name (becomes the Master):

Ec2 for example, the need to configure security groups (Security groups), so to access each instance:

Select the scribe region, because I was also set up three instances, the security groups are the same, if not established simultaneously, which may be three configurations.

After entering the click Inboundand then point Edit, click Add Rule, select the inside All Traffic, then save and exit:

After three examples are set, you need to ping each other at the test. If the ping fails, the back will not be successful:

$ ping Master -c 3   # 分别在3台机器上执行这三个命令
$ ping Slave01 -c 3
$ ping Slave02 -c 3



Next, install the SSH server, SSH is a network protocol used to encrypt the log between computers. After installing SSH, let the Master node SSH password can not log in to each Slave nodes, perform the Master node:

$ sudo apt-get install openssh-server
$ ssh localhost                                         # 使用 ssh 登陆本机,需要输入 yes 和 密码
$ exit                                                  # 退出刚才的 ssh localhost, 注意不要退出hadoop用户
$ cd ~/.ssh/                                            # 若没有该目录,请先执行一次ssh localhost
$ ssh-keygen -t rsa                                     # 利用 ssh-keygen 生成密钥,会有提示,疯狂按回车就行
$ cat ./id_rsa.pub >> ./authorized_keys                 # 将密钥加入授权
$ scp ~/.ssh/id_rsa.pub Slave01:/home/hadoop/           # 将密钥传到 Slave01 节点
$ scp ~/.ssh/id_rsa.pub Slave02:/home/hadoop/           # 将密钥传到 Slave02 节点

And then in Slave01 Slave02 node, the added public key authorization ssh:

$ mkdir ~/.ssh       # 如果不存在该文件夹需先创建,若已存在则忽略
$ cat ~/id_rsa.pub >> ~/.ssh/authorized_keys

Thus, at the Master node may not SSH password to each Slave node, the Master node may be performed on the following commands into the test, as shown below becomes Slave01, and then exitreturnable to Master:

So far the network configuration is complete.

Hadoop installation

Go to the mirror station https://archive.apache.org/dist/hadoop/core/ download, I downloaded hadoop-2.8.4.tar.gz. Performed on the Master node:

$ sudo tar -zxf /home/ubuntu/hadoop-2.8.4.tar.gz -C /usr/lib     # 解压到/usr/lib中
$ cd /usr/lib/
$ sudo mv ./hadoop-2.8.4/ ./hadoop                               # 将文件夹名改为hadoop
$ sudo chown -R hadoop ./hadoop                                  # 修改文件权限

Hadoop directory to the environmental variables, so that you can use hadoop, HDFS commands directly in any directory. Execution vim ~/.bashrc, add the line:

export PATH=$PATH:/usr/lib/hadoop/bin:/usr/lib/hadoop/sbin

After the execution save source ~/.bashrcthe configuration.

After completing the Hadoop start modifying the configuration file (this is also the way to configure the Yarn), the first implementation cd /usr/lib/hadoop/etc/hadoop, a total of six need to be modified - hadoop-env.sh, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml.

1, the file hadoop-env.shin the export JAVA_HOME=${JAVA_HOME}modification is export JAVA_HOME=/usr/lib/javathat the Java installation path.

2, file slavesthe inside localhost instead Slave01 and Slave02.

3, core-site.xmlread as follows:

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://Master:9000</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>file:/usr/lib/hadoop/tmp</value>
                <description>Abase for other temporary directories.</description>
        </property>
</configuration>

4, hdfs-site.xmlread as follows:

<configuration>
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>Master:50090</value>
        </property>
        <property>
                <name>dfs.replication</name>
                <value>2</value>
        </property>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>file:/usr/lib/hadoop/tmp/dfs/name</value>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>file:/usr/lib/hadoop/tmp/dfs/data</value>
        </property>
</configuration>

5, the file mapred-site.xml(you may need to rename the default file name mapred-site.xml.template):

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>Master:10020</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.webapp.address</name>
                <value>Master:19888</value>
        </property>
</configuration>

6, the file yarn-site.xml:

<configuration>
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>Master</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
</configuration>

Once configured, will be on the Master /usr/lib/hadoopfolder to each slave node. Performed on the Master node:

$ cd /usr/lib
$ tar -zcf ~/hadoop.master.tar.gz ./hadoop   # 先压缩再复制
$ scp ~/hadoop.master.tar.gz Slave01:/home/hadoop
$ scp ~/hadoop.master.tar.gz Slave02:/home/hadoop

It was performed on two slave nodes:

$ sudo tar -zxf ~/hadoop.master.tar.gz -C /usr/lib
$ sudo chown -R hadoop /usr/lib/hadoop

After installation is complete, you need to perform the first start formatting NameNode in the Master node:

$ hdfs namenode -format       # 首次运行需要执行初始化,之后不需要

If successful, you will see "successfully formatted" and prompt "Exitting with status 0", and if it is "Exitting with status 1" is wrong.



Then you can start Hadoop and Yarn, start needs to be performed on the Master node:

$ start-dfs.sh
$ start-yarn.sh
$ mr-jobhistory-daemon.sh start historyserver

Command jpscan see the process initiated by each node. Correctly, the Master node can be seen in NameNode, ResourceManager, SecondrryNameNode, JobHistoryServer process, as shown below:

Slave nodes may be seen in DataNode and NodeManager process, as shown below:

Command hdfs dfsadmin -reportto view the status of the cluster, which Live datanodes (2)indicates that two nodes are up from normal, if it is 0 indicates unsuccessful:

Hadoop can be viewed by the following three addresses web UI, which ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.comis an external DNS host name of this example, 50070、8088、19888are the default port hadoop, yarn, JobHistoryServer of:

ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.com:50070
ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.com:8088
ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.com:19888

Examples of the implementation of the Hadoop Distributed

$ hadoop fs -mkdir -p /user/hadoop   # 在hdfs上创建hadoop账户
$ hadoop fs -mkdir input
$ hadoop fs -put /usr/lib/hadoop/etc/hadoop/*.xml input  # 将hadoop配置文件复制到hdfs中
$ hadoop jar /usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'  # 运行实例

If successful you can see the following output:

Finally, close the Hadoop cluster need to execute the following command:

$ stop-yarn.sh
$ stop-dfs.sh
$ mr-jobhistory-daemon.sh stop historyserver

Installation Spark

Go to the mirror station https://archive.apache.org/dist/spark/ download, as has previously been the Hadoop installation, so I downloaded the free version of Hadoop, that is spark-2.3.3-bin-without-hadoop.tgz. Performed on the Master node:

$ sudo tar -zxf /home/ubuntu/spark-2.3.3-bin-without-hadoop.tgz -C /usr/lib  # 解压到/usr/lib中
$ cd /usr/lib/
$ sudo mv ./spark-2.3.3-bin-without-hadoop/ ./spark  # 将文件夹名改为spark
$ sudo chown -R hadoop ./spark                        # 修改文件权限

The spark directory to environment variable is vim ~/.bashrcadded as follows:

export SPARK_HOME=/usr/lib/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

After the execution save source ~/.bashrcthe configuration.

Then you need to configure two files, perform cd /usr/lib/spark/conf.

1, the configuration slavesfile

mv slaves.template slaves  # 将slaves.template重命名为slaves

slaves file settings from the node. Edit slavescontent, the default content from localhost replace two node names:

Slave01
Slave02

2, the configuration spark-env.shfile

mv spark-env.sh.template spark-env.sh

Edit spark-env.shadd the following:

export SPARK_DIST_CLASSPATH=$(/usr/lib/hadoop/bin/hadoop classpath)
export HADOOP_CONF_DIR=/usr/lib/hadoop/etc/hadoop
export SPARK_MASTER_IP=172.31.40.68   # 注意这里填的是Master节点的私有IP 
export JAVA_HOME=/usr/lib/java

Once configured, will be on the Master /usr/lib/sparkfolder to each slave node. Performed on the Master node:

$ cd /usr/lib
$ tar -zcf ~/spark.master.tar.gz ./spark
$ scp ~/spark.master.tar.gz Slave01:/home/hadoop
$ scp ~/spark.master.tar.gz Slave02:/home/hadoop

And then were performed on two slave nodes:

$ sudo tar -zxf ~/spark.master.tar.gz -C /usr/lib
$ sudo chown -R hadoop /usr/lib/spark

Spark cluster before you start, make sure to start the Hadoop cluster:

$ start-dfs.sh
$ start-yarn.sh
$ mr-jobhistory-daemon.sh start historyserver
$ start-master.sh  # 启动 spark 主节点
$ start-slaves.sh  # 启动 spark 从节点

You can ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.com:8080access the spark web UI.

Spark distributed instance execution

1, submit JAR package from the command line:

$ spark-submit --class org.apache.spark.examples.SparkPi --master spark://Master:7077 /usr/lib/spark/examples/jars/spark-examples_2.11-2.3.3.jar 100 2>&1 | grep "Pi is roughly" 

Description result follows the success:

2, connected to the program run by the remote IDEA:

IDEA can be written in native code, submitted to the remote machine to perform the cloud, so more convenient debugging. Note that Masterthe addresses on the public cloud machine IP address. In a following WordVecprogram example, to convert the sentence vector form:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

object Word2Vec {
  def main(args: Array[String]) {
    Logger.getLogger("org").setLevel(Level.ERROR)  // 控制输出信息
    Logger.getLogger("com").setLevel(Level.ERROR)

    val conf = new SparkConf()
      .setMaster("spark://ec2-54-190-51-132.us-west-2.compute.amazonaws.com:7077")  // 填公有DNS或公有IP地址都可以
      .setAppName("Word2Vec")
      .set("spark.cores.max", "4")
      .set("spark.executor.memory", "2g")
    val sc = new SparkContext(conf)

    val spark = SparkSession
      .builder
      .appName("Word2Vec")
      .getOrCreate()

    val documentDF = spark.createDataFrame(Seq(
      "Hi I heard about Spark".split(" "),
      "I wish Java could use case classes".split(" "),
      "Logistic regression models are neat".split(" ")
    ).map(Tuple1.apply)).toDF("text")

    val word2Vec = new Word2Vec()
      .setInputCol("text")
      .setOutputCol("result")
      .setVectorSize(3)
      .setMinCount(0)
    val model = word2Vec.fit(documentDF)

    val result = model.transform(documentDF)
    result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
      println(s"Text: [${text.mkString(", ")}] => \nVector: $features\n") }
  }
}

IDEA console output:

Close Spark and Hadoop cluster has the following command:

$ stop-master.sh
$ stop-slaves.sh
$ stop-yarn.sh
$ stop-dfs.sh
$ mr-jobhistory-daemon.sh stop historyserver

Of course, the last and most important, after use do not forget to close the instance of EC2, or will incur costs of 24 hours a day.

Guess you like

Origin www.linuxidc.com/Linux/2019-12/161833.htm