Scala版本
scala-2.10.4
说明:
之前搭建环境一直不成功,原因可能是使用了Scala-2.11.4版本导致的。Spark的官方网站明确的说Spark-1.2.0不支持Scala2.11.4版本:
Note: Scala 2.11 users should download the Spark source package and build with Scala 2.11 support.
Spark版本:
spark-1.2.0-bin-hadoop2.4.tgz
配置环境变量
export SCALA_HOME=/home/hadoop/spark1.2.0/scala-2.10.4 export PATH=$SCALA_HOME/bin:$PATH export SPARK_HOME=/home/hadoop/spark1.2.0/spark-1.2.0-bin-hadoop2.4 export PATH=$SPARK_HOME/bin:$PATH
搭建Intellij Idea开发Spark程序的环境
1. 下载安装Scala插件
2. 创建 Scala的Non-SBT项目
3. 导入Spark的jar包
spark-1.2.0-bin-hadoop2.4
4.编写wordcount例子代码
package spark.examples import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object SparkWordCount { def main(args: Array[String]) { ///注意setMaster("local")这行代码,表明Spark以local运行(注意local与standalone模式的区别) val conf = new SparkConf().setAppName("SparkWordCount").setMaster("local") val sc = new SparkContext(conf) val rdd = sc.textFile("file:///home/hadoop/spark1.2.0/word.txt") rdd.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).map(x => (x._2, x._1)).sortByKey(false).map(x => (x._2, x._1)).saveAsTextFile("file:///home/hadoop/spark1.2.0/WordCountResult") sc.stop } }
控制台日志:
15/01/14 22:06:34 WARN Utils: Your hostname, hadoop-Inspiron-3521 resolves to a loopback address: 127.0.1.1; using 192.168.0.111 instead (on interface eth1) 15/01/14 22:06:34 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/01/14 22:06:35 INFO SecurityManager: Changing view acls to: hadoop 15/01/14 22:06:35 INFO SecurityManager: Changing modify acls to: hadoop 15/01/14 22:06:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 15/01/14 22:06:36 INFO Slf4jLogger: Slf4jLogger started 15/01/14 22:06:36 INFO Remoting: Starting remoting 15/01/14 22:06:36 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:53624] 15/01/14 22:06:36 INFO Utils: Successfully started service 'sparkDriver' on port 53624. 15/01/14 22:06:36 INFO SparkEnv: Registering MapOutputTracker 15/01/14 22:06:36 INFO SparkEnv: Registering BlockManagerMaster 15/01/14 22:06:36 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20150114220636-4826 15/01/14 22:06:36 INFO MemoryStore: MemoryStore started with capacity 461.7 MB 15/01/14 22:06:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/01/14 22:06:37 INFO HttpFileServer: HTTP File server directory is /tmp/spark-19683393-0315-498c-9b72-9c6a13684f44 15/01/14 22:06:37 INFO HttpServer: Starting HTTP Server 15/01/14 22:06:38 INFO Utils: Successfully started service 'HTTP file server' on port 53231. 15/01/14 22:06:43 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/01/14 22:06:43 INFO SparkUI: Started SparkUI at http://hadoop-Inspiron-3521.local:4040 15/01/14 22:06:43 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://[email protected]:53624/user/HeartbeatReceiver 15/01/14 22:06:44 INFO NettyBlockTransferService: Server created on 46971 15/01/14 22:06:44 INFO BlockManagerMaster: Trying to register BlockManager 15/01/14 22:06:44 INFO BlockManagerMasterActor: Registering block manager localhost:46971 with 461.7 MB RAM, BlockManagerId(<driver>, localhost, 46971) 15/01/14 22:06:44 INFO BlockManagerMaster: Registered BlockManager 15/01/14 22:06:44 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=484127539 15/01/14 22:06:44 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 461.5 MB) 15/01/14 22:06:45 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=484127539 15/01/14 22:06:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 461.5 MB) 15/01/14 22:06:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:46971 (size: 22.2 KB, free: 461.7 MB) 15/01/14 22:06:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/01/14 22:06:45 INFO SparkContext: Created broadcast 0 from textFile at SparkWordCount.scala:40 15/01/14 22:06:45 INFO FileInputFormat: Total input paths to process : 1 15/01/14 22:06:45 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/01/14 22:06:45 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/01/14 22:06:45 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/01/14 22:06:45 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/01/14 22:06:45 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/01/14 22:06:46 INFO SparkContext: Starting job: saveAsTextFile at SparkWordCount.scala:43 15/01/14 22:06:46 INFO DAGScheduler: Registering RDD 3 (map at SparkWordCount.scala:43) 15/01/14 22:06:46 INFO DAGScheduler: Registering RDD 5 (map at SparkWordCount.scala:43) 15/01/14 22:06:46 INFO DAGScheduler: Got job 0 (saveAsTextFile at SparkWordCount.scala:43) with 1 output partitions (allowLocal=false) 15/01/14 22:06:46 INFO DAGScheduler: Final stage: Stage 2(saveAsTextFile at SparkWordCount.scala:43) 15/01/14 22:06:46 INFO DAGScheduler: Parents of final stage: List(Stage 1) 15/01/14 22:06:46 INFO DAGScheduler: Missing parents: List(Stage 1) 15/01/14 22:06:46 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[3] at map at SparkWordCount.scala:43), which has no missing parents 15/01/14 22:06:46 INFO MemoryStore: ensureFreeSpace(3560) called with curMem=186397, maxMem=484127539 15/01/14 22:06:46 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.5 KB, free 461.5 MB) 15/01/14 22:06:46 INFO MemoryStore: ensureFreeSpace(2528) called with curMem=189957, maxMem=484127539 15/01/14 22:06:46 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.5 KB, free 461.5 MB) 15/01/14 22:06:46 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:46971 (size: 2.5 KB, free: 461.7 MB) 15/01/14 22:06:46 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/01/14 22:06:46 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838 15/01/14 22:06:46 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[3] at map at SparkWordCount.scala:43) 15/01/14 22:06:46 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/01/14 22:06:46 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1292 bytes) 15/01/14 22:06:46 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/01/14 22:06:46 INFO HadoopRDD: Input split: file:/home/hadoop/spark1.2.0/word.txt:0+29 15/01/14 22:06:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1895 bytes result sent to driver 15/01/14 22:06:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 323 ms on localhost (1/1) 15/01/14 22:06:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/01/14 22:06:46 INFO DAGScheduler: Stage 0 (map at SparkWordCount.scala:43) finished in 0.350 s 15/01/14 22:06:46 INFO DAGScheduler: looking for newly runnable stages 15/01/14 22:06:46 INFO DAGScheduler: running: Set() 15/01/14 22:06:46 INFO DAGScheduler: waiting: Set(Stage 1, Stage 2) 15/01/14 22:06:46 INFO DAGScheduler: failed: Set() 15/01/14 22:06:46 INFO DAGScheduler: Missing parents for Stage 1: List() 15/01/14 22:06:46 INFO DAGScheduler: Missing parents for Stage 2: List(Stage 1) 15/01/14 22:06:46 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[5] at map at SparkWordCount.scala:43), which is now runnable 15/01/14 22:06:46 INFO MemoryStore: ensureFreeSpace(2992) called with curMem=192485, maxMem=484127539 15/01/14 22:06:46 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.9 KB, free 461.5 MB) 15/01/14 22:06:46 INFO MemoryStore: ensureFreeSpace(2158) called with curMem=195477, maxMem=484127539 15/01/14 22:06:46 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.1 KB, free 461.5 MB) 15/01/14 22:06:46 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:46971 (size: 2.1 KB, free: 461.7 MB) 15/01/14 22:06:46 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 15/01/14 22:06:46 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838 15/01/14 22:06:46 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MappedRDD[5] at map at SparkWordCount.scala:43) 15/01/14 22:06:46 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 15/01/14 22:06:46 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1045 bytes) 15/01/14 22:06:46 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 15/01/14 22:06:46 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 15/01/14 22:06:46 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 12 ms 15/01/14 22:06:46 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1000 bytes result sent to driver 15/01/14 22:06:46 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 110 ms on localhost (1/1) 15/01/14 22:06:46 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/01/14 22:06:46 INFO DAGScheduler: Stage 1 (map at SparkWordCount.scala:43) finished in 0.106 s 15/01/14 22:06:46 INFO DAGScheduler: looking for newly runnable stages 15/01/14 22:06:46 INFO DAGScheduler: running: Set() 15/01/14 22:06:46 INFO DAGScheduler: waiting: Set(Stage 2) 15/01/14 22:06:46 INFO DAGScheduler: failed: Set() 15/01/14 22:06:46 INFO DAGScheduler: Missing parents for Stage 2: List() 15/01/14 22:06:46 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[8] at saveAsTextFile at SparkWordCount.scala:43), which is now runnable 15/01/14 22:06:47 INFO MemoryStore: ensureFreeSpace(112880) called with curMem=197635, maxMem=484127539 15/01/14 22:06:47 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 110.2 KB, free 461.4 MB) 15/01/14 22:06:47 INFO MemoryStore: ensureFreeSpace(67500) called with curMem=310515, maxMem=484127539 15/01/14 22:06:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 65.9 KB, free 461.3 MB) 15/01/14 22:06:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:46971 (size: 65.9 KB, free: 461.6 MB) 15/01/14 22:06:47 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0 15/01/14 22:06:47 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:838 15/01/14 22:06:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 2 (MappedRDD[8] at saveAsTextFile at SparkWordCount.scala:43) 15/01/14 22:06:47 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks 15/01/14 22:06:47 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1056 bytes) 15/01/14 22:06:47 INFO Executor: Running task 0.0 in stage 2.0 (TID 2) 15/01/14 22:06:47 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 15/01/14 22:06:47 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 15/01/14 22:06:47 INFO FileOutputCommitter: Saved output of task 'attempt_201501142206_0002_m_000000_2' to file:/home/hadoop/spark1.2.0/WordCountResult/_temporary/0/task_201501142206_0002_m_000000 15/01/14 22:06:47 INFO SparkHadoopWriter: attempt_201501142206_0002_m_000000_2: Committed 15/01/14 22:06:47 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 824 bytes result sent to driver 15/01/14 22:06:47 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 397 ms on localhost (1/1) 15/01/14 22:06:47 INFO DAGScheduler: Stage 2 (saveAsTextFile at SparkWordCount.scala:43) finished in 0.399 s 15/01/14 22:06:47 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/01/14 22:06:47 INFO DAGScheduler: Job 0 finished: saveAsTextFile at SparkWordCount.scala:43, took 1.241181 s 15/01/14 22:06:47 INFO SparkUI: Stopped Spark web UI at http://hadoop-Inspiron-3521.local:4040 15/01/14 22:06:47 INFO DAGScheduler: Stopping DAGScheduler 15/01/14 22:06:48 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 15/01/14 22:06:48 INFO MemoryStore: MemoryStore cleared 15/01/14 22:06:48 INFO BlockManager: BlockManager stopped 15/01/14 22:06:48 INFO BlockManagerMaster: BlockManagerMaster stopped 15/01/14 22:06:48 INFO SparkContext: Successfully stopped SparkContext 15/01/14 22:06:48 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/01/14 22:06:48 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. Process finished with exit code 0
调整日志级别
从上面的输出中,可以看到,Spark默认是输出INFO级别的日志,为了查看全部的日志,可以设置Spark的日志输出,办法是在wordcount项目的源代码根目录创建一个log4j.properties文件,其中的内容
log4j.rootCategory=DEBUG, file log4j.appender.file=org.apache.log4j.ConsoleAppender #如果要把日志输出到某个文件中,则使用FileAppender #log4j.appender.file=org.apache.log4j.FileAppender #log4j.appender.file.file=spark.log log4j.appender.file.append=false log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n # Ignore messages below warning level from Jetty, because it's a bit verbose log4j.logger.org.eclipse.jetty=WARN org.eclipse.jetty.LEVEL=WARN在Windows上搭建Spark的开发环境报错,
15/01/17 16:17:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/01/17 16:17:04 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.<init>(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.<init>(SparkHadoopUtil.scala:43) at org.apache.spark.deploy.SparkHadoopUtil$.<init>(SparkHadoopUtil.scala:202) at org.apache.spark.deploy.SparkHadoopUtil$.<clinit>(SparkHadoopUtil.scala) at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784) at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:105) at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:180) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159) at org.apache.spark.SparkContext.<init>(SparkContext.scala:232) at spark.examples.SparkWordCount$.main(SparkWordCount.scala:39) at spark.examples.SparkWordCount.main(SparkWordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)解决办法:
1. 解压一个Hadoop,在环境变量里HADOOP_HOME
HADOOP_HOME=E:/devsoftware/hadoop-2.5.2/hadoop-2.5.2如果环境变量不好使,则在代码中添加如下语句
System.setProperty("hadoop.home.dir", "E:/devsoftware/hadoop-2.5.2/hadoop-2.5.2");2. 将winutils.exe拷贝至bin目录下
E:/devsoftware/hadoop-2.5.2/hadoop-2.5.2/bin下载地址:
http://download.csdn.net/download/zxyacb2012/8193465不需要 hadoop.dll文件
下一步
经过几天的折腾,终于打破僵局,可以调试Spark的源代码了,接下来加快步伐,提升Spark的修为!