Spark各种问题集锦[持续更新]

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/strongyoung88/article/details/52201622

1、Initial job has not accepted any resources

16/08/13 17:05:42 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/08/13 17:05:57 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

这个信息就是告诉我们,初始化的作业不能接受到任何资源,spark只会寻找两件资源:Cores和Memory。所以,出现这个信息,肯定是这两种资源不够,我们可以打开Spark UI界面看看情况:
这里写图片描述

从图中可以发现,cores已经被用完了,也就是有其他任务正在占用这些资源,也或者是spark-shell,所以,才会出现上述警告信息。

参考:
http://www.datastax.com/dev/blog/common-spark-troubleshooting


2、Exception in thread “main” java.lang.ClassNotFoundException

Exception in thread "main" java.lang.ClassNotFoundException: Main
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
    at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:56)
    at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)

当我们在提交spark-submit的时候,经常会遇到这个异常,但导致这个异常的原因真的很多,比如,在你的JAR包中,真的没有这个类,这个异常与其他找不到类的异常有个区别,区别在于,这里找不到类,是找不到主类,而不是找不到其他引用的类,如果找不到其他引用的类的话,很可能是类路径有问题,或没引入相应的类库,这里是没有找到主类,当时我也很奇怪,同样在一个JAR里,为什么有的主类可以找到,有些主类无法找到,后面发现当我用package把那个主类放在某个包下面时,这个主类就无法找到了,然后把这个主类放到源代码的根目录下,就能找到,所以,主类找不到的解决方法可以试试把主类放到源代码的根目录下,至少,我的情况是这样的,然后成功解决了,毕竟,每个人遇到的情况不一样,所以,good luck to you!

解决方法:
把主类放到源代码的根目录,即src下。

3、When running with master ‘yarn’ either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment

hadoop@master:~$ ./shell/spark-submit.sh 
16/09/03 10:35:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/03 10:35:46 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /jar/edu-cloud-assembly-1.0.jar
16/09/03 10:35:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
    at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:251)
    at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:228)
    at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:109)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

解决方法:
编辑$SPARK_HOME/conf/spark-env.sh文件

hadoop@master:~$ vi spark-1.6.0-bin-hadoop2.4/conf/spark-env.sh

加入以下行:

HADOOP_CONF_DIR=/home/hadoop/hadoop-2.4.0/etc/hadoop/

然后,将集群上的这个文件都更新。

4、awaitResult Exception

Exception in thread "main" org.apache.spark.SparkException: Exception thrown in awaitResult

问题原因:

解决方法:
将默认的配置调大,默认为300s,具体如下:

spark.conf.set("spark.sql.broadcastTimeout", 1200)

5、Exception in thread “main” org.apache.spark.sql.AnalysisException: Both sides of this join are outside the broadcasting threshold and computing it could be prohibitively expensive. To explicitly enable it, please set spark.sql.crossJoin.enabled = true

18/01/09 20:25:33 INFO FileSourceStrategy: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Both sides of this join are outside the broadcasting threshold and computing it could be prohibitively expensive. To explicitly enable it, please set spark.sql.crossJoin.enabled = true;
    at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doPrepare(BroadcastNestedLoopJoinExec.scala:345)
    at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:199)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:134)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:323)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
	at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2193)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
    at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2192)
	at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2197)
    at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2197)
    at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2197)
	at org.apache.spark.sql.Dataset.collect(Dataset.scala:2173)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/01/09 20:25:34 INFO SparkContext: Invoking stop() from shutdown hook

解决方法:

扫描二维码关注公众号,回复: 5402869 查看本文章
set spark.sql.crossJoin.enabled = true;

猜你喜欢

转载自blog.csdn.net/strongyoung88/article/details/52201622