Pseudo-distributed Spark + Hive on Spark structures

  Spark big data platform to use for some time, but most of them are used in the experiment and used to build up and build a fully distributed through Spark also been set up pseudo-distributed for testing. Now write essays again, record it had set up over the environment, so after themselves forgotten. And also to beginners and people have dug pits used as a reference.

   Hive Hive on Spark is running on the Spark, Spark execution engine is used, rather than the default MapReduce.

  You can access the official website of resources Hive ON the Spark: the Getting Started .

First, install the basic environment

1.1 Java1.8 environment to build

  1) Download and unzip jdk1.8:

# tar -zxvf  jdk-8u201-linux-i586.tar.gz  -C /usr/local

  2) Add the Java environment variables, add in / etc / profile in:

export JAVA_HOME=/usr/local/jdk1.8.0_201 
export PATH=$PATH:$JAVA_HOME/bin
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib

  3) After saving refresh environment variables:

# source /etc/profile

  4) Check whether the configuration Java, have successfully configured as shown in FIG.

# java -version

 

1.2 Scala environment to build 

  1) Download the installation package and extract Scala

# tar -zxf scala-2.11.12.tgz -C /usr/local

  2) Add the environment variable Scala, add in / etc / profile in:

export SCALA_HOME=/usr/local/scala-2.11.12
export PATH=${SCALA_HOME}/bin:$PATH

  3) Save refresh environment variables

# source /etc/profile

  4) Check whether Scala configuration is successful, the success of the configuration will be shown below .

# scala -version

 

1.3 Maven installation

  1) Download and install Maven

# Tar zxf-maven apache, 3.6 . 1 bin. tar gz -C / usr / local

  2) added to the environment variable

export MAVEN_HOME=/usr/local/maven-3.6.1
export PATH=$JAVA_HOME/bin:$MAVEN_HOME/bin:$PATH 

  3) Save refresh environment variables

# source /etc/profile

  4) Check whether the configuration Maven, have successfully configured as shown in FIG.

# mvn -version

  5) Replace the central warehouse Mirror Mirror Ali cloud central warehouse

# vim /usr/local/maven-3.6.1/conf/settings.xml

  Find mirrors element, add a child element mirror in it:

   <!-- 阿里云中央仓库 -->
    <mirror>
        <id>nexus-aliyun</id>
        <mirrorOf>*</mirrorOf>
        <name>Nexus aliyun</name>
        <url>http://maven.aliyun.com/nexus/content/groups/public</url>
    </mirror>

  Add the following results: 

 

 

 Two, Spark2.3.3 source compiler

   From the official document Hive on Spark Spark only use a specific version of the test, so a given version of Hive can only guarantee the use of a particular version of the Spark. Other versions of Spark may apply to a given version of Hive, but this is not guaranteed. Here is the version of Hive and its corresponding list of compatible versions of the Spark.

  In this article, Xiao Bian build version: Hive 3.1.1, Spark 2.3.3, where the default Hive has been successfully installed.

  1) Download and unzip Spark source code

# wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.3.3/spark-2.3.3.tgz   
# tar -zxf spark-2.3.3.tgz

  2)编译Spark源码

  下面是参考Spark官方文档给出的教程而定制的命令,因为Spark要结合Hadoop(伪分布式Hadoop部署可以参考我之前的文档)与Hive一起使用,下面命令是Spark自带的Maven编译的脚本:

# ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7"

   也可以直接通过Maven命令进行编译:

# ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -DskipTests clean package

  出现下图所示也就代表着编译成功:

  将编译后的Spark压缩包解压到/usr/local路径并改名:

# tar -zxf spark-2.3.3-bin-hadoop2-without-hive.tgz -C  /usr/local
# mv spark-2.3.3-bin-hadoop2-without-hive spark-2.3.3

  3)配置伪分布式Spark

  配置SPARK_HOME环境变量后并刷新:

export SPARK_HOME=/usr/local/spark-2.3.3
export PATH=$PATH:$SPARK_HOME/bin

  进入Spark根目录下conf目录并生成slaves文件:

# cd $SPARK_HOME/conf
# cp slaves.template slaves  //复制模板生成slaves文件,伪分布式不用修改该文件

  接下来修改spark-env.sh文件,修改前先复制后重命名:

# cp spark-env.sh.template spark-env.sh
# vim spark-env.sh

  添加如下内容:

export JAVA_HOME=/usr/local/jdk1.8.0_201
export SCALA_HOME=/usr/local/scala-2.11.12
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_WORKER_MEMORY=2048m
export SPARK_MASTER_IP=hadoop
export SPARK_WORKER_CORES=2
export SPARK_HOME=/usr/local/spark-2.3.3
export SPARK_LIBRARY_PATH=/usr/local/spark-2.3.3/lib
export SPARK_DIST_CLASSPATH=${hadoop classpath} //hadoop classpath在终端上输入即可查看

  

  4)启动Spark

  第一步,启动之前要保证Hadoop启动成功,先使用jps看下进程信息:

  

  五个进程都启动并没有自动断开,说明Hadoop启动成功。

  第二步,启动Spark:

     进入Spark的sbin目录下执行start-all.sh启动Spark,启动后,通过jps查看最新的进程信息:

   

  访问http://ip:8080

  

  从页面可以看到一个Worker节点信息。

  通过访问http://ip:4040进入spark-shell web控制台页面(需先使用命令./bin/spark-shell启动SparkContext),出现下面的Web界面信息:

  

  如果某台机器上运行多个SparkContext,它的Web端口会自动连续加一,例如4041,4042,4043等。为了浏览持久的事件日志,设置spark.eventLog.enabled就可以了。

  5)验证Spark是否配置成功

  注意:在启动Spark之前,要确保Hadoop集群和YARN均已启动

    • 在$SPARK_HOME目录下启动Spark:
# $SPARK_HOME/sbin/start-all.sh
    • 在$SPARK_HOME目录下,提交计算Pi的任务,验证Spark是否能正常工作,运行如下命令:
# ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client ./examples/jars/spark-examples_2.11-2.3.3.jar 10

  若无报错,并且算出Pi的值,说明Spark集群能正常工作。

  

  6)关闭Spark

  进入Spark目录,执行:

# cd $SPARK_HOME
# ./sbin/stop-all.sh

  7)关闭Hadoop

  进入Hadoop目录,执行:

# cd $HADOOP_HOME
# ./sbin/stop-yarn.sh
# ./sbin/stop-dfs.sh

  (./sbin/stop-all.sh也可以执行上述的操作,但有警告该命令已被丢弃,应使用上面的两个命令代替)

 

 三、Hive on Spark

  1)将编译好的Spark依赖添加到$HIVE_HOME/lib目录下

# cp $SPARK_HOME/jars/* $HIVE_HOME/lib

  2)配置hive-site.xml

   配置的内容与spark-defaults.conf相同,只是形式不一样,以下内容是追加到hive-site.xml文件中的,并且注意前两个配置,如果不设置hive的spark引擎用不了,在后面会有详细的错误说明。

<property>
  <name>hive.execution.engine</name>
  <value>spark</value>
</property>

<property>
  <name>hive.enable.spark.execution.engine</name>
  <value>true</value>
</property>
<property> <name>spark.home</name> <value>/usr/local/spark-2.3.1-bin-hadoop2.7</value> </property> <property> <name>spark.master</name> <value>yarn-client</value> </property> <property> <name>spark.eventLog.enabled</name> <value>true</value> </property> <property> <name>spark.eventLog.dir</name> <value>hdfs://Goblin01:8020/spark-log</value> </property> <property> <name>spark.serializer</name> <value>org.apache.spark.serializer.KryoSerializer</value> </property> <property> <name>spark.executor.memeory</name> <value>1g</value> </property> <property> <name>spark.driver.memeory</name> <value>1g</value> </property> <property> <name>spark.executor.extraJavaOptions</name> <value>-XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"</value> </property>

   3)验证Hive on Spark是否可用

   命令行输入hive,进入hive CLI:

  

  set hive.execution.engine=spark; (将执行引擎设为Spark,默认是mr,退出hive CLI后,会回滚到默认设置。若想让执行引擎默认为Spark,需要在hive-site.xml里设置)

  接下来执行一条创建测试表语句:

hive> create table test(ts BIGINT,line STRING); 

  然后执行一条查询语句:

hive> select count(*) from test;

  

  若上述整个过程都没有报错,并出现正确结果,则Hive on Spark搭建成功。

 

 四、遇到的问题

1. get rid of POM not found warning for org.eclipse.m2e:lifecycle-mapping

  stackoverflow印度阿三们的解决方案已成功解决上述的问题:参考网址:https://stackoverflow.com/questions/7905501/get-rid-of-pom-not-found-warning-for-org-eclipse-m2elifecycle-mapping/

2. Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile

  这报错主要出现在Spark-SQL编译出错,原因在maven本地仓库中scala依赖冲突,第一次编译的时候没有配置scala版本,默认用了2.10版本,这次是编译成功的,但后面再编译的时候,我选择了2.11版本,然后在spark-sql模块编译失败,然后去google找解决方案,链接如下所示: https://github.com/davidB/scala-maven-plugin/issues/215

  通过以下命令删除maven本地仓库(默认路径)的scala依赖:

# rm -r ~/.m2/repository/org/scala-lang/scala-reflect/2.1*

  如果编译还无法成功,则在源码根目录pom.xml文件添加依赖:

<dependency>
    <groupId>net.alchim31.maven</groupId>
  <artifactId>scala-maven-plugin</artifactId>
    <version>3.2.0</version>
</dependency>

3. Error: A JNI error has occurred, please check your installation and try again

  原因:启动编译好的Spark,出现如上的错误,是因为没有在spark-env.sh导入hadoop classpath

  解决方案:在shell终端上输入hadoop classpath:

  然后再spark-env.sh添加上去:

 4. 启动Hive时报错,缺少spark-assembly-*.jar

  

  其主要的原因是:在hive.sh的文件中,发现了这样的命令,原来初始当spark存在的时候,进行spark中相关的JAR包的加载。而自从spark升级到2.0.0之后,原有的lib的整个大JAR包已经被分散的小JAR包的替代,所以肯定没有办法找到这个spark-assembly的JAR包。这就是问题所在。

  

  解决方案:将这个spark-assembly-*.jar`替换成jars/*.jar,就不会出现这样的问题。

   

 

 参考资料:http://spark.apache.org/docs/2.3.3/building-spark.html

        https://www.cnblogs.com/xinfang520/p/7763328.html

        https://blog.csdn.net/m0_37065162/article/details/81015096

        https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

Guess you like

Origin www.cnblogs.com/luengmingbiao/p/11216383.html