背景:
因为是在已经搭好的Maven项目上进行开发,环境是集群环境,不需要再配置,只是讲一下写时遇到的坑。
1. 因为是采用idea开发,直接在maven项目上新建一个文件夹SparkWordCount和文件SparkWordCount.scala,利用maven打包后,spark-submit提交任务后,一直报错:
19/02/20 19:34:23 ERROR yarn.ApplicationMaster: Uncaught exception: java.lang.ClassNotFoundException: com.yixia.bigdata.etl.SparkWordCount.SparkWordCount at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
后来发现,原来是代码里面没加包名:package com.yixia.bigdata.etl.SparkWordCount,所以找不到主类
2. 包名问题解决后,又报错
19/02/20 19:42:38 INFO yarn.ApplicationMaster: Waiting for spark context initialization... Exception in thread "Driver" java.lang.NullPointerException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:552) 19/02/20 19:42:38 ERROR yarn.ApplicationMaster: Uncaught exception: java.lang.IllegalStateException: SparkContext is null but app is still running! at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:355) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:197)
后来发现,主类是用class创建的,而不是object创建的。因为object是scala的静态类,而main方法作为整个程序的入口,必须是静态的,只能用object来修饰,而不能用class
3. 统计词频的逻辑,首先把文件读进去,然后过滤掉空字符,再利用map对每个词频进行统计,最后reduceByKey即可,利用saveAsTestFile进行保存。这里没有用split()函数,主要是因为我的测试数据一行就一个字符,所以不需要split(); 为了保证实例的资源回收,必须写sc.stop()
详细的实例编程代码如下:
package com.yixia.bigdata.etl.SparkWordCount
import org.apache.spark.{SparkConf, SparkContext}
object SparkWordCount {
def main(args:Array[String]) : Unit ={
val sc = new SparkContext()
val input = args(0) //hdfs://nameservice1/user/matrix/zy
val output = args(1)
val data = sc.textFile(input)
val wordcount=data.filter(it => it!="").map(word => (word,1)).reduceByKey(_+_)
wordcount.saveAsTextFile(output)
sc.stop()
}
}
提交到集群上的代码如下:
spark-submit \
--class com.yixia.bigdata.etl.SparkWordCount.SparkWordCount \
--name com.yixia.bigdata.etl.SparkWordCount.SparkWordCount \
--keytab ~/.matrix.keytab \
--principal matrix \
--master yarn \
--deploy-mode cluster \
--num-executors 50 \
--queue matrix \
./etl-1.0-SNAPSHOT-jar-with-dependencies.jar hdfs://nameservice1/user/matrix/zy hdfs://nameservice1/user/matrix/zy1234
提交到yarn集群上,既可以跑出来结果了,结果存在hdfs上面