Spark/Hadoop/Zeppelin Upgrade(2)
1 Install Hadoop 2.6.4
> wget http://mirrors.ibiblio.org/apache/hadoop/common/hadoop-2.6.4/hadoop-2.6.4-src.tar.gz
Fail on the annotation package
> mvn package -Pdist,native -DskipTests -Dtar
Maybe because the version of java, Cmake or other packages, so I will choose to directly download the binary of hadoop 2.6.4.
> wget http://mirror.nexcess.net/apache/hadoop/common/hadoop-2.6.4/hadoop-2.6.4.tar.gz
Configure and set up as the same as 2.7.2
> cat core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ubuntu-master:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/opt/hadoop/temp</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
</configuration>
Edit hadoop-env.sh
export JAVA_HOME="/opt/jdk"
> cat hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>ubuntu-master:9001</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
> cat slaves
ubuntu-dev1
ubuntu-dev2
> cat yarn-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>ubuntu-master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>ubuntu-master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>ubuntu-master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>ubuntu-master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>ubuntu-master:8088</value>
</property>
</configuration>
> mkdir /opt/hadoop/temp
> mkdir -p /opt/hadoop/dfs/data
> mkdir -p /opt/hadoop/dfs/name
Same things on the ubuntu-dev1 and ubuntu-dev2
hadoop is done.
1 HDFS
cd /opt/hadoop
sbin/start-dfs.sh
http://ubuntu-master:50070/dfshealth.html#tab-overview
2 YARN
cd /opt/hadoop
sbin/start-yarn.sh
http://ubuntu-master:8088/cluster
2 Installation of Spark
Build the Spark with MAVEN
> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.4 -Phive -DskipTests clean package
Build the Spark with SBT
> build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.4 -Phive assembly
Here is the command to build the binary
> ./make-distribution.sh --name spark-1.6.1 --tgz -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.4 -Phive
Build Success. I get this binary file spark-1.6.1-bin-spark-1.6.1.tgz
Spark YARN Setting
On ubuntu-master
>cat conf/spark-env.sh
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
This command will start the shell
> MASTER=yarn-client bin/spark-shell
We can also use spark-submit to submit my job to the remote.
http://sillycat.iteye.com/blog/2103457
3 Zeppelin Installation
http://sillycat.iteye.com/blog/2286997
> git clone https://github.com/apache/incubator-zeppelin.git
> git checkout tags/v0.5.6
> mvn clean package -DskipTests -Pspark-1.6 -Dspark.version=1.6.1 -Phadoop-2.6 -Dhadoop.version=2.6.4
> mvn clean package -Pbuild-distr -DskipTests -Pspark-1.6 -Dspark.version=1.6.1 -Phadoop-2.6 -Dhadoop.version=2.6.4
Build success. The binary will be generate here. /home/carl/install/incubator-zeppelin/zeppelin-distribution/target
Unzip and Check the Configure
> cat zeppelin-env.sh
# export HADOOP_CONF_DIR
# yarn-site.xml is located in configuration directory in HADOOP_CONF_DIR.
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop/"
# export SPARK_HOME
# (required) When it is defined, load it instead of Zeppelin embedded Spark libraries
export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
# export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"
Start the Server
> bin/zeppelin-daemon.sh start
The visit console
http://ubuntu-master:8080/#/
Error Message:
ERROR [2016-04-01 13:58:49,540] ({qtp1232306490-35} NotebookServer.java[onMessage]:162) - Can't handle message
org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.cancel(RemoteInterpreter.java:248)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.cancel(LazyOpenInterpreter.java:99)
at org.apache.zeppelin.notebook.Paragraph.jobAbort(Paragraph.java:229)
at org.apache.zeppelin.scheduler.Job.abort(Job.java:232)
at org.apache.zeppelin.socket.NotebookServer.cancelParagraph(NotebookServer.java:695)
More Error Message in file less zeppelin-carl-ubuntu-master.out
16/04/01 14:10:40 WARN netty.NettyRpcEndpointRef: Error sending message [message = RemoveExecutor(1,Container killed by YARN for exceeding memory limits. 2.1 GB of 2.1 GB virtual memory used. Consider boosting spark.yarn.executor.memoryOverhead.)] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
On the hadoop slaves in logging yarn-carl-nodemanager-ubuntu-dev2.log
2016-04-01 15:28:54,525 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 2229 for container-id container
_1459541332549_0002_02_000001: 124.6 MB of 1 GB physical memory used; 2.1 GB of 2.1 GB virtual memory used
Solution:
http://www.wdong.org/wordpress/blog/2015/01/08/spark-on-yarn-where-have-all-my-memory-gone/
http://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits
This configuration in yarn-site.xml fixed the problem.
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
After we start the task in zeppelin, we can visit the spark context from this console
http://ubuntu-master:4040/
References:
http://sillycat.iteye.com/blog/2286997
Spark/Hadoop/Zeppelin Upgrade(2)
猜你喜欢
转载自sillycat.iteye.com/blog/2288141
今日推荐
周排行