spark cluster installation and integration into hadoop cluster hadoop2.7.7 distributed cluster installation and configuration

Foreword

  Recently engaged in hadoop + spark + python, so we set up a local hadoop environment, basic environment to build address hadoop2.7.7 distributed cluster installation and configuration

  This blog is mainly explained, if built and integrated into the spark cluster hadoop

Installation Process

  Installation spark need to install scala attention during the installation process needs to correspond with the scala version spark, spark hadoop also keep the corresponding version, specifically the download page can be viewed at the official website of spark

Download and install sacla

httpswwwscalalangorgfilesarchivescala21112 .tgz 
tar zxf scala 21112 .tgz

Move and modify permissions

chown hduser:hduser -R scala-2.11.11
mv /root/scala-2.11.11 /usr/local/scala

Configuration environment variable

vim .bashrc
#scala var
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin

The installation is complete you can enter, such as interactive page by scala

Precautions

Note: Spark and hadoop version must match each other, because the Spark will read the Hadoop HDFS and must be able to execute the program in Hadoop YARN, it must be installed in accordance with our current version of Hadoop to select 
the author here is hadoop2. 7.7 so I chose Built-is the Pre for the Apache Hadoop 2.7 and later

Download and install the spark

http://mirror.bit.edu.cn/apache/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz
tar zxf spark-2.3.3-bin-hadoop2.7.tgz

Move and modify permissions

chown hduser:hduser spark-2.3.3-bin-hadoop2.7
mv spark-2.3.3-bin-hadoop2.7 /usr/local/spark

Configuration environment variable

vim .bashrc
#spark var
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

Into the spark interaction page

The default is python2.7.x version, the current version is older, you can modify pyspark to select a different version (provided that the current version of python other server installation)

修改master下的spark-env.sh  #没有这个文件可以cp spark-env.sh.template spark-env.sh
在最后一行添加如下
export PYSPARK_PYTHON=/usr/bin/python3
修改master下的spark bin目录下pyspark
将文本中
    PYSPARK_PYTHON=python
改为
    PYSPARK_PYTHON=python3

#取消INFO信息打印
复制conf目录下的log4j模本文件到log4j.properties
将文本中
    log4j.rootCategory=INFO, console
改为
    log4j.rootCategory=WARN, console

测试与效果图

本地运行spark

pyspark  --master local[4]

    spark 读取本地文件,所有节点都必须存在该文件
    textFile=sc.textFile("file:/usr/local/spark/README.md")
    spark 读取hdfs文件
    textFile2=sc.textFile("hdfs://hadoop-master-001:9000/wordcount/input/LICENSE.txt")

Hadoop YARN运行spark

HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
    textFile = sc.textFile("hdfs://hadoop-master-001:9000/wordcount/input/LICENSE.txt")
    textFile.count()

spark Standalone Cluster运行

编辑spark-env.sh #spark_home/conf
    export SPARK_MASTER=hadoop-master-001            //设置master的ip或域名
    export SPARK_WORKER_CORES=1                        //设置每个worker使用的CPU核心
    export SPARK_WORKER_MEMORY=512m                    //设置每个worker使用的内存
    export SPARK_WORKER_INSTANCES=4                    //设置实例数

将master环境中的spark目录打包并分别远程传输到所有slave节点中.

设置spark Standalone Cluster 服务器(master环境)
    vim /usr/local/spark/conf/slaves  添加ip或域名
    hadoop-data-001
    hadoop-data-002
    hadoop-data-003

启动与关闭

/usr/local/spark/sbin/start-all.sh

/usr/local/spark/sbin/stop-all.sh

pyspark --master spark://hadoop-master-001:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m
    textFile = sc.textFile("file:/usr/local/spark/README.md")
    textFile.count()
    注意 当在cluster模式下,如yarn-client或spark standalone 读取本地文件时,因为程序是分不到不同的服务器,所以必须确认所有机器都有该文件,否则会发生错误.
    建议 最好在cluster读取hdfs文件,这样不会出现文件
    text2=sc.textFile("hdfs://hadoop-master-001:9000/wordcount/input/LICENSE.txt")
     text2.count() 

spark web ui

异常处理

hadoop yarn运行pyspark时异常信息:
ERROR SparkContext: Error initializing SparkContext. org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master

解决方式
查看http://hadoop-master-001:8088/cluster/app/ 最新任务点击history 查看信息
"Diagnostics: Container [pid=29708,containerID=container_1563435447194_0007_02_000001] is running beyond virtual memory limits. Current usage: 55.6 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container."

修改所有节点的yarn-site.xml,添加如下
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
主节点执行stop-yarn.sh, start-yarn.sh 重启所有节点yarn

 

Guess you like

Origin www.cnblogs.com/charles1ee/p/11240158.html