Foreword
Recently engaged in hadoop + spark + python, so we set up a local hadoop environment, basic environment to build address hadoop2.7.7 distributed cluster installation and configuration
This blog is mainly explained, if built and integrated into the spark cluster hadoop
Installation Process
Installation spark need to install scala attention during the installation process needs to correspond with the scala version spark, spark hadoop also keep the corresponding version, specifically the download page can be viewed at the official website of spark
Download and install sacla
httpswwwscalalangorgfilesarchivescala21112 .tgz tar zxf scala 21112 .tgz
Move and modify permissions
chown hduser:hduser -R scala-2.11.11
mv /root/scala-2.11.11 /usr/local/scala
Configuration environment variable
vim .bashrc
#scala var
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
The installation is complete you can enter, such as interactive page by scala
Precautions
Note: Spark and hadoop version must match each other, because the Spark will read the Hadoop HDFS and must be able to execute the program in Hadoop YARN, it must be installed in accordance with our current version of Hadoop to select
the author here is hadoop2. 7.7 so I chose Built-is the Pre for the Apache Hadoop 2.7 and later
Download and install the spark
http://mirror.bit.edu.cn/apache/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz
tar zxf spark-2.3.3-bin-hadoop2.7.tgz
Move and modify permissions
chown hduser:hduser spark-2.3.3-bin-hadoop2.7
mv spark-2.3.3-bin-hadoop2.7 /usr/local/spark
Configuration environment variable
vim .bashrc
#spark var
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
Into the spark interaction page
The default is python2.7.x version, the current version is older, you can modify pyspark to select a different version (provided that the current version of python other server installation)
修改master下的spark-env.sh #没有这个文件可以cp spark-env.sh.template spark-env.sh
在最后一行添加如下
export PYSPARK_PYTHON=/usr/bin/python3
修改master下的spark bin目录下pyspark
将文本中
PYSPARK_PYTHON=python
改为
PYSPARK_PYTHON=python3
#取消INFO信息打印
复制conf目录下的log4j模本文件到log4j.properties
将文本中
log4j.rootCategory=INFO, console
改为
log4j.rootCategory=WARN, console
测试与效果图
本地运行spark
pyspark --master local[4]
spark 读取本地文件,所有节点都必须存在该文件
textFile=sc.textFile("file:/usr/local/spark/README.md")
spark 读取hdfs文件
textFile2=sc.textFile("hdfs://hadoop-master-001:9000/wordcount/input/LICENSE.txt")
Hadoop YARN运行spark
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
textFile = sc.textFile("hdfs://hadoop-master-001:9000/wordcount/input/LICENSE.txt")
textFile.count()
spark Standalone Cluster运行
编辑spark-env.sh #spark_home/conf
export SPARK_MASTER=hadoop-master-001 //设置master的ip或域名
export SPARK_WORKER_CORES=1 //设置每个worker使用的CPU核心
export SPARK_WORKER_MEMORY=512m //设置每个worker使用的内存
export SPARK_WORKER_INSTANCES=4 //设置实例数
将master环境中的spark目录打包并分别远程传输到所有slave节点中.
设置spark Standalone Cluster 服务器(master环境)
vim /usr/local/spark/conf/slaves 添加ip或域名
hadoop-data-001
hadoop-data-002
hadoop-data-003
启动与关闭
/usr/local/spark/sbin/start-all.sh
/usr/local/spark/sbin/stop-all.sh
pyspark --master spark://hadoop-master-001:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m
textFile = sc.textFile("file:/usr/local/spark/README.md")
textFile.count()
注意 当在cluster模式下,如yarn-client或spark standalone 读取本地文件时,因为程序是分不到不同的服务器,所以必须确认所有机器都有该文件,否则会发生错误.
建议 最好在cluster读取hdfs文件,这样不会出现文件
text2=sc.textFile("hdfs://hadoop-master-001:9000/wordcount/input/LICENSE.txt")
text2.count()
spark web ui
异常处理
hadoop yarn运行pyspark时异常信息:
ERROR SparkContext: Error initializing SparkContext. org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master
解决方式
查看http://hadoop-master-001:8088/cluster/app/ 最新任务点击history 查看信息
"Diagnostics: Container [pid=29708,containerID=container_1563435447194_0007_02_000001] is running beyond virtual memory limits. Current usage: 55.6 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container."
修改所有节点的yarn-site.xml,添加如下
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
主节点执行stop-yarn.sh, start-yarn.sh 重启所有节点yarn