1. IDEA下,打包spark jar包:
在projectName/project/目录下,新建assembly.sbt文件,添加内容:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.5")
build.sbt文件:
name := "ScalaTest" version := "0.1" scalaVersion := "2.11.8" libraryDependencies += "mysql" % "mysql-connector-java" % "8.0.11" % "compile" libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0" % "provided" libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.3.0" % "provided" libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.3.0" % "compile" libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.10.0" % "compile"
jar包依赖冲突导致,
assemblyMergeStrategy in assembly := { case PathList("javax", "servlet", xs @ _*) => MergeStrategy.last case PathList("javax", "activation", xs @ _*) => MergeStrategy.last case PathList("org", "apache", xs @ _*) => MergeStrategy.last case PathList("com", "google", xs @ _*) => MergeStrategy.last case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last case PathList("com", "codahale", xs @ _*) => MergeStrategy.last case PathList("com", "yammer", xs @ _*) => MergeStrategy.last case "about.html" => MergeStrategy.rename case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last case "META-INF/mailcap" => MergeStrategy.last case "META-INF/mimetypes.default" => MergeStrategy.last case "plugin.properties" => MergeStrategy.last case "log4j.properties" => MergeStrategy.last case x => val oldStrategy = (assemblyMergeStrategy in assembly).value oldStrategy(x) }
Run -> Edit Configurations -> 新建一个sbt Task,Tasks填写:assembly
Run -> Run 'assembly'
2. 从本地向一个外部集群提交任务,报错:
Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
修改conf/spark-env.sh:
export SPARK_LOCAL_IP=localhost (或本地ip)
3. 资源不够问题:
运行一个spark任务,如果资源不够,那么会一直打印:
WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
如果资源不足,spark任务不会立即结束,而是继续等待,如果有stream任务,那么这种任务会一直占着资源,不释放,会导致其他任务一直累积在spark cluster中,却不被执行。
查看 sparkmaster:8080,可以看到core或内存不够,如果是core不足,解决方案:
用sparkConf.set("spark.cores.max", "2");
参考http://wenda.chinahadoop.cn/question/2433
http://spark.apache.org/docs/latest/spark-standalone.html Resource Scheduling 章节