Spark combat - to build our Spark distributed architecture

Spark distributed architecture

As we know, spark were strong, in addition to powerful data processing capabilities, another advantage is that a good distributed architecture. As an example of Spark combat - to find 500 million visits, the most visited people , I used to try to find four spark node 500 million visits, the number of the most frequent ID. This is a time-consuming process even more than 40 minutes, for a program, the 40 minutes that the result is simply unbearable. But in the large data processing, which in turn granted. Of course, in practice not possible to allow their programs in a simple process only takes five hundred million visits in so much time, so consider a distributed architecture. (PS: Of course the request processing example of 500 million, we actually used is actually four nodes, three machine pseudo-distributed architecture)

Distributed architecture benefits

Or in the above Spark combat - to find 500 million visits, the most visited people as an example, if we have four nodes (actually three machine) processing forty minutes, if we use the same configuration of one thousand nodes on, in theory, this time will be shortened to 2.4 minutes. If we increase the performance of nodes, such as memory, CPU performance, auditing, etc. The theoretical values will be reduced to a smaller value. However, in practice this never reach the theoretical value, which is obvious, because the spark scheduling, distribution, networks, and other time consuming, so this value will be larger than 2.4 minutes. However, compared to the 47 minutes, this value has a very satisfactory. In order to achieve the performance of spark, we often tune it, which is something.

Our Spark platform to build from scratch

1, ready centeros environment

In order to build a true cluster environment, and to achieve high-availability architecture, we can at least prepare three virtual machines as cluster nodes. So I bought three Ali cloud servers, as our cluster node.

Hostname IP RAM CPU
master 172.19.101.111 4G 1 Nuclear
slave1 172.19.77.91 4G 1 Nuclear
slave2 172.19.131.1 4G 1 Nuclear

Notes that, master is the master node and slave name suggests is a slave, nature is the main node work. In fact, in our cluster, master and slave is not so clear distinction, because in fact they are "working hard." Of course, when building a cluster, we still have to clear this concept.

2, download jdk

  • 1, download jdk1.8 tar.gz package
wget https://download.oracle.com/otn-pub/java/jdk/8u201-b09/42970487e3af4f5aa5bca3f542482c60/jdk-8u201-linux-x64.tar.gz

  • 2, extract
tar -zxvf jdk-8u201-linux-x64.tar.gz

Get after decompression

 

 

  • 3, configure the environment variables

Modify profile

vi /etc/profile

Add the following

export JAVA_HOME=/usr/local/java1.8/jdk1.8.0_201
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=${JAVA_HOME}/bin:$PATH 

 

 

source to take effect

source /etc/profile

Check whether the entry into force

java -version

 

 

See Figure contents said that it has succeeded.

以上操作三台虚拟机一模一样!
以上操作三台虚拟机一模一样!
以上操作三台虚拟机一模一样!

3, installation zookeeper

  • Download package zookeeper
wget https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz

 

 

  • Decompression
tar -zxvf zookeeper-3.4.13.tar.gz

 

 

  • Enter zookeeper configuration directory
cd zookeeper-3.4.13/conf
  • Copy the profile template
cp zoo_sample.cfg zoo.cfg

 

 

  • After the modified copy content zoo.cfg
dataDir=/home/hadoop/data/zkdata
dataLogDir=/home/hadoop/log/zklog

server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888

 

 

  • Configuration environment variable
export ZOOKEEPER_HOME=/usr/local/zookeeper/zookeeper-3.4.13
export PATH=$PATH:$ZOOKEEPER_HOME/bin 

 

 

  • The environment variables to take effect
source /etc/profile
  • Notes that this statement in front of the configuration file, the configuration of the data directory
dataDir=/home/hadoop/data/zkdata
  • We manually create the directory, and into which
cd /home/hadoop/data/zkdata/
echo 3 > myid

 

 

  • You should pay particular attention to this
echo 1 > myid
  • It is for this configuration, so the master, we echo 1, and for slave1 is echo 2, for slave2 is the echo 3
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888

 

 

  • After configuring the start-up test
zkServer.sh start

 

 

  • After you start to see if a successful start
zkServer.sh status

 

 

以上操作三台虚拟机都要进行!只有echo 不一样
以上操作三台虚拟机都要进行!只有echo 不一样
以上操作三台虚拟机都要进行!只有echo 不一样
  • After starting in the master view state

 

 

  • After starting in salve1 view state

 

 

There's Mode is not the same, this is the election mechanism zookeeper, as to how this mechanism to run, here refrain. Follow-up will be special instructions. So far, zookeeper cluster has been set up to complete

4, install hadoop

  • 1, by wget download hadoop-2.7.7.tar.gz
wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
  • 2, download, unzip

Extracting a hadoop-2.7.7 directory

tar -zxvf hadoop-2.7.7

 

 

  • 3, configure hadoop environment variables

Modify profile

vi /etc/profile
  • Increase hadoop environment variables
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin: 

 

 

  • The environment variables to take effect
source /etc/profile
  • Once you've configured to check whether the entry into force
 hadoop version

 

 

  • Entering hadoop-2.7.7 / etc / hadoop in

  • Edit core-site.xml

vi core-site.xml 
  • Increased configuration
<configuration>
    <!-- 指定hdfs的nameservice为myha01 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://myha01/</value>
    </property>

    <!-- 指定hadoop临时目录 -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hadoop/data/hadoopdata/</value>
    </property>

    <!-- 指定zookeeper地址 -->
    <property>
        <name>ha.zookeeper.quorum</name>
        <value>master:2181,slave1:2181,slave2:2181</value>
    </property>

    <!-- hadoop链接zookeeper的超时时长设置 -->
    <property>
        <name>ha.zookeeper.session-timeout.ms</name>
        <value>1000</value>
        <description>ms</description>
    </property>
</configuration>

 

 

  • Copy mapred-site.xml.template
cp mapred-site.xml.template mapred-site.xml

 

 

  • Edit mapred-site.xml
vi mapred-site.xml
  • Add the following
<configuration>
    <!-- 指定mr框架为yarn方式 -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>

    <!-- 指定mapreduce jobhistory地址 -->
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>master:10020</value>
    </property>

    <!-- 任务历史服务器的web地址 -->
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>master:19888</value>
    </property>
</configuration>
  • Edit hdfs-site.xml
vi hdfs-site.xml 
  • Add the following
<configuration>

    <!-- 指定副本数 -->
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

    <!-- 配置namenode和datanode的工作目录-数据存储目录 -->
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/home/hadoop/data/hadoopdata/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/hadoop/data/hadoopdata/dfs/data</value>
    </property>

    <!-- 启用webhdfs -->
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>

    <!--指定hdfs的nameservice为myha01,需要和core-site.xml中的保持一致
                 dfs.ha.namenodes.[nameservice id]为在nameservice中的每一个NameNode设置唯一标示符。
        配置一个逗号分隔的NameNode ID列表。这将是被DataNode识别为所有的NameNode。
        例如,如果使用"myha01"作为nameservice ID,并且使用"nn1"和"nn2"作为NameNodes标示符
    -->
    <property>
        <name>dfs.nameservices</name>
        <value>myha01</value>
    </property>

    <!-- myha01下面有两个NameNode,分别是nn1,nn2 -->
    <property>
        <name>dfs.ha.namenodes.myha01</name>
        <value>nn1,nn2</value>
    </property>

    <!-- nn1的RPC通信地址 -->
    <property>
        <name>dfs.namenode.rpc-address.myha01.nn1</name>
        <value>master:9000</value>
    </property>

 

 

  • Edit yarn-site.xml
vi yarn-site.xml 
  • Add the following
<configuration>
  <!-- 开启RM高可用 -->
    <property>
        <name>yarn.resourcemanager.ha.enabled</name>
        <value>true</value>
    </property>

    <!-- 指定RM的cluster id -->
    <property>
        <name>yarn.resourcemanager.cluster-id</name>
        <value>yrc</value>
    </property>

    <!-- 指定RM的名字 -->
    <property>
        <name>yarn.resourcemanager.ha.rm-ids</name>
        <value>rm1,rm2</value>
    </property>

    <!-- 分别指定RM的地址 -->
    <property>
        <name>yarn.resourcemanager.hostname.rm1</name>
        <value>slave1</value>
    </property>

    <property>
        <name>yarn.resourcemanager.hostname.rm2</name>
        <value>slave2</value>
    </property>

    <!-- 指定zk集群地址 -->
    <property>
        <name>yarn.resourcemanager.zk-address</name>
        <value>master:2181,slave1:2181,slave2:2181</value>
    </property>

    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>

    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>86400</value>
    </property>

    <!-- 启用自动恢复 -->
    <property>
        <name>yarn.resourcemanager.recovery.enabled</name>
        <value>true</value>
    </property>

    <!-- 制定resourcemanager的状态信息存储在zookeeper集群上 -->
    <property>
        <name>yarn.resourcemanager.store.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
    </property>
</configuration>

 

 

  • Last edited salves
master
slave1
slave2

 

 

以上操作三台虚拟机一模一样!
以上操作三台虚拟机一模一样!
以上操作三台虚拟机一模一样!
  • Then you can start hadoop

  • First start journalnode on three nodes, the nodes must remember three operations

hadoop-daemon.sh start journalnode
  • After the completion of the operation with a view jps command, you can see

 

 

  • 其中QuorumPeerMain是zookeeper,JournalNode则是我启动的内容

  • 接着对主节点的namenode进行格式化

hadoop namenode -format

 

 

  • 注意标红色方框的地方

     

  • 完成格式化后查看/home/hadoop/data/hadoopdata目录下的内容

 

 

  • 目录中的内容拷贝到slave1上,slave1是我们的备用节点,我们需要他来支撑高可用模式,当master宕机的时候,slave1马上能够顶替其继续工作。
cd..
scp -r hadoopdata/ root@slave1:hadoopdata/

 

 

  • 这样就确保了主备节点都保持一样的格式化内容

接着就可以启动hadoop

  • 首先在master节点启动HDFS
start-dfs.sh 

 

 

  • 接着启动start-yarn.sh ,注意start-yarn.sh需要在slave2中启动
start-yarn.sh 

 

 

  • 分别用jps查看三个主机

master

 

 

slave1

 

 

slave2

 

 

  • 这里注意到master和slave1都有namenode,实际上只有一个是active状态的,另一个则是standby状态。如何证实呢,我们 在浏览器中输入master:50700,可以访问

 

 

  • 在浏览器中输入slave1:50700,可以访问

 

 

  • 另一种方式,是查看我们配置的两个节点
hdfs haadmin -getServiceState nn1
hdfs haadmin -getServiceState nn2

 

 

5、spark安装

  • 下载spark
wget http://mirrors.shu.edu.cn/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
  • 解压
tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz

 

 

  • 进入spark的配置目录
cd spark-2.4.0-bin-hadoop2.7/conf
  • 拷贝配置文件spark-env.sh.template
cp spark-env.sh.template spark-env.sh

 

 

编辑spark-env.sh

vi spark-env.sh
  • 增加内容
export JAVA_HOME=/usr/local/java1.8/jdk1.8.0_201

export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.7/etc/hadoop export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=master:2181,slave1:2181,slave2:2181 -Dspark.deploy.zookeeper.dir=/spark" export SPARK_WORKER_MEMORY=300m export SPARK_WORKER_CORES=1 

 

 

其中java的环境变量、hadoop环境变量请从系统环境变量中拷贝,后面SPARK_WORKER_MEMORY是spark运行的内存,SPARK_WORKER_CORES是spark使用的CPU核数

以上操作三台虚拟机一模一样!
以上操作三台虚拟机一模一样!
以上操作三台虚拟机一模一样!
  • 配置系统环境变量
 vi /etc/profile
  • 增加内容
export SPARK_HOME=/usr/local/spark/spark-2.4.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin 

 

 

  • 拷贝slaves.template 文件
cp slaves.template slaves
  • 使环境变量生效
source  /etc/profile
  • 编辑slaves
vi slaves
  • 增加内容
master
slave1
slave2

 

 

  • 最后我们启动spark,注意即便配置了spark的环境变量,由于start-all.sh和hadoop的start-all.sh冲突,因此我们必须进入到spark的启动目录下,才能执行启动所有的操作。

  • 进入启动目录

cd spark-2.4.0-bin-hadoop2.7/sbin
  • 执行启动
./start-all.sh 

 

 

  • 执行完成后,用jps查看三个节点下的状态

  • master:

 

 

  • slave1:

 

 

  • slave2:

 

 

Notes that three nodes have a worker process spark, the only master of them have Master process.

Access master: 8080

 

 

At this point we have the formal spark environment.

6, try to use

Since we have configured the environment variables, you can enter it directly spark-shell start.

 spark-shell 

 

 

Here we entered the spark-shell.

Then encoded

val lise = List(1,2,3,4,5)
val data = sc.parallelize(lise)
data.foreach(println)

 

`

 

Or we enter the spark-python

pyspark

 

 

View sparkContext

 

 

Guess you like

Origin www.cnblogs.com/tjp40922/p/12177898.html