Spark/Hadoop/Zeppelin Upgrade(2)

Spark/Hadoop/Zeppelin Upgrade(2)

1 Install Hadoop 2.6.4
> wget http://mirrors.ibiblio.org/apache/hadoop/common/hadoop-2.6.4/hadoop-2.6.4-src.tar.gz

Fail on the annotation package
> mvn package -Pdist,native -DskipTests -Dtar

Maybe because the version of java, Cmake or other packages, so I will choose to directly download the binary of hadoop 2.6.4.
> wget http://mirror.nexcess.net/apache/hadoop/common/hadoop-2.6.4/hadoop-2.6.4.tar.gz

Configure and set up as the same as 2.7.2
> cat core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ubuntu-master:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/opt/hadoop/temp</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
</configuration>

Edit hadoop-env.sh
export JAVA_HOME="/opt/jdk"

> cat hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>ubuntu-master:9001</value>
</property>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/opt/hadoop/dfs/name</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/opt/hadoop/dfs/data</value>
</property>
<property>
    <name>dfs.replication</name>
    <value>2</value>
</property>
<property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>
</configuration>

> cat slaves
ubuntu-dev1
ubuntu-dev2

> cat yarn-site.xml
<?xml version="1.0"?>
<configuration>
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
    <name>yarn.resourcemanager.address</name>
    <value>ubuntu-master:8032</value>
</property>
<property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>ubuntu-master:8030</value>
</property>
<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>ubuntu-master:8031</value>
</property>
<property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>ubuntu-master:8033</value>
</property>
<property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>ubuntu-master:8088</value>
</property>
</configuration>

> mkdir /opt/hadoop/temp

> mkdir -p /opt/hadoop/dfs/data

> mkdir -p /opt/hadoop/dfs/name

Same things on the ubuntu-dev1 and ubuntu-dev2

hadoop is done.
1 HDFS
cd /opt/hadoop
sbin/start-dfs.sh

http://ubuntu-master:50070/dfshealth.html#tab-overview

2 YARN
cd /opt/hadoop
sbin/start-yarn.sh

http://ubuntu-master:8088/cluster

2 Installation of Spark
Build the Spark with MAVEN
> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.4 -Phive -DskipTests clean package

Build the Spark with SBT
> build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.4 -Phive assembly

Here is the command to build the binary
> ./make-distribution.sh --name spark-1.6.1 --tgz -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.4 -Phive

Build Success. I get this binary file spark-1.6.1-bin-spark-1.6.1.tgz

Spark YARN Setting
On ubuntu-master
>cat conf/spark-env.sh
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop

This command will start the shell
> MASTER=yarn-client bin/spark-shell

We can also use spark-submit to submit my job to the remote.
http://sillycat.iteye.com/blog/2103457

3 Zeppelin Installation
http://sillycat.iteye.com/blog/2286997

> git clone https://github.com/apache/incubator-zeppelin.git

> git checkout tags/v0.5.6

> mvn clean package -DskipTests -Pspark-1.6 -Dspark.version=1.6.1 -Phadoop-2.6 -Dhadoop.version=2.6.4

> mvn clean package -Pbuild-distr -DskipTests -Pspark-1.6 -Dspark.version=1.6.1 -Phadoop-2.6 -Dhadoop.version=2.6.4

Build success. The binary will be generate here. /home/carl/install/incubator-zeppelin/zeppelin-distribution/target

Unzip and Check the Configure
> cat zeppelin-env.sh

# export HADOOP_CONF_DIR

# yarn-site.xml is located in configuration directory in HADOOP_CONF_DIR.
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop/"

# export SPARK_HOME
# (required) When it is defined, load it instead of Zeppelin embedded Spark libraries
export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
# export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"

Start the Server
> bin/zeppelin-daemon.sh start

The visit console
http://ubuntu-master:8080/#/

Error Message:
ERROR [2016-04-01 13:58:49,540] ({qtp1232306490-35} NotebookServer.java[onMessage]:162) - Can't handle message
org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException
        at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.cancel(RemoteInterpreter.java:248)
        at org.apache.zeppelin.interpreter.LazyOpenInterpreter.cancel(LazyOpenInterpreter.java:99)
        at org.apache.zeppelin.notebook.Paragraph.jobAbort(Paragraph.java:229)
        at org.apache.zeppelin.scheduler.Job.abort(Job.java:232)
        at org.apache.zeppelin.socket.NotebookServer.cancelParagraph(NotebookServer.java:695)

More Error Message in file less zeppelin-carl-ubuntu-master.out

16/04/01 14:10:40 WARN netty.NettyRpcEndpointRef: Error sending message [message = RemoveExecutor(1,Container killed by YARN for exceeding memory limits. 2.1 GB of 2.1 GB virtual memory used. Consider boosting spark.yarn.executor.memoryOverhead.)] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
        at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)

org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
        at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)

On the hadoop slaves in logging yarn-carl-nodemanager-ubuntu-dev2.log

2016-04-01 15:28:54,525 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 2229 for container-id container
_1459541332549_0002_02_000001: 124.6 MB of 1 GB physical memory used; 2.1 GB of 2.1 GB virtual memory used

Solution:
http://www.wdong.org/wordpress/blog/2015/01/08/spark-on-yarn-where-have-all-my-memory-gone/

http://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits

This configuration in yarn-site.xml fixed the problem.
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

After we start the task in zeppelin, we can visit the spark context from this console
http://ubuntu-master:4040/

References:
http://sillycat.iteye.com/blog/2286997

Spark/Hadoop/Zeppelin Upgrade(2)

猜你喜欢