spark-basic demo from book 'learning spark'

  after a heavy cost time(primary at download huge number of jars),the first example from book 'learning spark' is run through.

  the source code is very simple

/**
 * Illustrates flatMap + countByValue for wordcount.
 */
package com.oreilly.learningsparkexamples.scala

import org.apache.spark._
import org.apache.spark.SparkContext._

object WordCount {
    def main(args: Array[String]) {
      val master = args.length match {
        case x: Int if x > 0 => args(0)
        case _ => "local"
      }
      println("**spark-home :" + System.getenv("SPARK_HOME")) //null if not set
      val sc = new SparkContext(master, "WordCount", System.getenv("SPARK_HOME"))
      val input = args.length match {
        case x: Int if x > 1 => sc.textFile(args(1))
        case _ => sc.parallelize(List("pandas", "i like pandas")) //-else generate a list as input data
      }
      val words = input.flatMap(line => line.split(" "))
      args.length match {
        case x: Int if x > 2 => {
          val counts = words.map(word => (word, 1)).reduceByKey{case (x,y) => x + y} //-same as xxByKey((x,y)=> x+y)
          counts.saveAsTextFile(args(2))
        }
        case _ => { //-else count by words number
          val wc = words.countByValue()
          println(wc.mkString(","))
        }
      }
    }
}

and the project outline will figure like this:



 

  below is the running logs in local mode:

JHLinMacBook:learning-spark-src userxx$ spark-submit --verbose --class com.oreilly.learningsparkexamples.scala.WordCount target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar 
Using properties file: null
Parsed arguments:
  master                  local[*]
  deployMode              null
  executorMemory          null
  executorCores           null
  totalExecutorCores      null
  propertiesFile          null
  driverMemory            null
  driverCores             null
  driverExtraClassPath    null
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise               false
  queue                   null
  numExecutors            null
  files                   null
  pyFiles                 null
  archives                null
  mainClass               com.oreilly.learningsparkexamples.scala.WordCount
  primaryResource         file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar
  name                    com.oreilly.learningsparkexamples.scala.WordCount
  childArgs               []
  jars                    null
  packages                null
  repositories            null
  verbose                 true

Spark properties used, including those specified through
 --conf and those from the properties file null:
  

    
Main class:
com.oreilly.learningsparkexamples.scala.WordCount
Arguments:

System properties:
SPARK_SUBMIT -> true
spark.app.name -> com.oreilly.learningsparkexamples.scala.WordCount
spark.jars -> file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar
spark.master -> local[*]
Classpath elements:
file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar


/Users/userxx/Cloud/Spark/spark-1.4.1-bin-hadoop2.4
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/09/22 23:20:13 INFO SparkContext: Running Spark version 1.4.1
2015-09-22 23:20:13.411 java[918:1903] Unable to load realm info from SCDynamicStore
15/09/22 23:20:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/22 23:20:13 WARN Utils: Your hostname, JHLinMacBook resolves to a loopback address: 127.0.0.1; using 192.168.1.144 instead (on interface en0)
15/09/22 23:20:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/09/22 23:20:13 INFO SecurityManager: Changing view acls to: userxx
15/09/22 23:20:13 INFO SecurityManager: Changing modify acls to: userxx
15/09/22 23:20:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(userxx); users with modify permissions: Set(userxx)
15/09/22 23:20:14 INFO Slf4jLogger: Slf4jLogger started
15/09/22 23:20:14 INFO Remoting: Starting remoting
15/09/22 23:20:14 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:49613]
15/09/22 23:20:14 INFO Utils: Successfully started service 'sparkDriver' on port 49613.
15/09/22 23:20:14 INFO SparkEnv: Registering MapOutputTracker
15/09/22 23:20:14 INFO SparkEnv: Registering BlockManagerMaster
15/09/22 23:20:14 INFO DiskBlockManager: Created local directory at /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/blockmgr-acf8bb83-2c13-4df4-8301-30a8a78ebcc6
15/09/22 23:20:14 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/09/22 23:20:15 INFO HttpFileServer: HTTP File server directory is /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/httpd-38c60499-b2cf-4fe1-b94f-f3d04fcae1b3
15/09/22 23:20:15 INFO HttpServer: Starting HTTP Server
15/09/22 23:20:15 INFO Utils: Successfully started service 'HTTP file server' on port 49614.
15/09/22 23:20:15 INFO SparkEnv: Registering OutputCommitCoordinator
15/09/22 23:20:15 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/09/22 23:20:15 INFO SparkUI: Started SparkUI at http://192.168.1.144:4040
15/09/22 23:20:15 INFO SparkContext: Added JAR file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar at http://192.168.1.144:49614/jars/learning-spark-examples_2.10-0.0.1.jar with timestamp 1442935215534
15/09/22 23:20:15 INFO Executor: Starting executor ID driver on host localhost
15/09/22 23:20:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49616.
15/09/22 23:20:15 INFO NettyBlockTransferService: Server created on 49616
15/09/22 23:20:15 INFO BlockManagerMaster: Trying to register BlockManager
15/09/22 23:20:15 INFO BlockManagerMasterEndpoint: Registering block manager localhost:49616 with 265.4 MB RAM, BlockManagerId(driver, localhost, 49616)
15/09/22 23:20:15 INFO BlockManagerMaster: Registered BlockManager
15/09/22 23:20:16 INFO SparkContext: Starting job: countByValue at WordCount.scala:28
15/09/22 23:20:16 INFO DAGScheduler: Registering RDD 3 (countByValue at WordCount.scala:28)
15/09/22 23:20:16 INFO DAGScheduler: Got job 0 (countByValue at WordCount.scala:28) with 1 output partitions (allowLocal=false)
15/09/22 23:20:16 INFO DAGScheduler: Final stage: ResultStage 1(countByValue at WordCount.scala:28)
15/09/22 23:20:16 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
15/09/22 23:20:16 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
15/09/22 23:20:16 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at countByValue at WordCount.scala:28), which has no missing parents
15/09/22 23:20:16 INFO MemoryStore: ensureFreeSpace(3072) called with curMem=0, maxMem=278302556
15/09/22 23:20:16 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.0 KB, free 265.4 MB)
15/09/22 23:20:16 INFO MemoryStore: ensureFreeSpace(1755) called with curMem=3072, maxMem=278302556
15/09/22 23:20:16 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1755.0 B, free 265.4 MB)
15/09/22 23:20:16 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:49616 (size: 1755.0 B, free: 265.4 MB)
15/09/22 23:20:16 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874
15/09/22 23:20:16 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at countByValue at WordCount.scala:28)
15/09/22 23:20:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/09/22 23:20:16 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1465 bytes)
15/09/22 23:20:16 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/09/22 23:20:16 INFO Executor: Fetching http://192.168.1.144:49614/jars/learning-spark-examples_2.10-0.0.1.jar with timestamp 1442935215534
15/09/22 23:20:17 INFO Utils: Fetching http://192.168.1.144:49614/jars/learning-spark-examples_2.10-0.0.1.jar to /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/userFiles-a8ac42d1-cf52-4428-8075-2090b6ed6c85/fetchFileTemp4094197159668194166.tmp
15/09/22 23:20:17 INFO Executor: Adding file:/private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/userFiles-a8ac42d1-cf52-4428-8075-2090b6ed6c85/learning-spark-examples_2.10-0.0.1.jar to class loader
15/09/22 23:20:17 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 879 bytes result sent to driver
15/09/22 23:20:17 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 300 ms on localhost (1/1)
15/09/22 23:20:17 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
15/09/22 23:20:17 INFO DAGScheduler: ShuffleMapStage 0 (countByValue at WordCount.scala:28) finished in 0.332 s
15/09/22 23:20:17 INFO DAGScheduler: looking for newly runnable stages
15/09/22 23:20:17 INFO DAGScheduler: running: Set()
15/09/22 23:20:17 INFO DAGScheduler: waiting: Set(ResultStage 1)
15/09/22 23:20:17 INFO DAGScheduler: failed: Set()
15/09/22 23:20:17 INFO DAGScheduler: Missing parents for ResultStage 1: List()
15/09/22 23:20:17 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at countByValue at WordCount.scala:28), which is now runnable
15/09/22 23:20:17 INFO MemoryStore: ensureFreeSpace(2304) called with curMem=4827, maxMem=278302556
15/09/22 23:20:17 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.3 KB, free 265.4 MB)
15/09/22 23:20:17 INFO MemoryStore: ensureFreeSpace(1371) called with curMem=7131, maxMem=278302556
15/09/22 23:20:17 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1371.0 B, free 265.4 MB)
15/09/22 23:20:17 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:49616 (size: 1371.0 B, free: 265.4 MB)
15/09/22 23:20:17 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
15/09/22 23:20:17 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at countByValue at WordCount.scala:28)
15/09/22 23:20:17 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/09/22 23:20:17 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1245 bytes)
15/09/22 23:20:17 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/09/22 23:20:17 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/09/22 23:20:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 11 ms
15/09/22 23:20:17 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1076 bytes result sent to driver
15/09/22 23:20:17 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 84 ms on localhost (1/1)
15/09/22 23:20:17 INFO DAGScheduler: ResultStage 1 (countByValue at WordCount.scala:28) finished in 0.084 s
15/09/22 23:20:17 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
15/09/22 23:20:17 INFO DAGScheduler: Job 0 finished: countByValue at WordCount.scala:28, took 0.702187 s
pandas -> 2,i -> 1,like -> 1
15/09/22 23:20:17 INFO SparkContext: Invoking stop() from shutdown hook
15/09/22 23:20:17 INFO SparkUI: Stopped Spark web UI at http://192.168.1.144:4040
15/09/22 23:20:17 INFO DAGScheduler: Stopping DAGScheduler
15/09/22 23:20:17 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/09/22 23:20:17 INFO Utils: path = /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/blockmgr-acf8bb83-2c13-4df4-8301-30a8a78ebcc6, already present as root for deletion.
15/09/22 23:20:17 INFO MemoryStore: MemoryStore cleared
15/09/22 23:20:17 INFO BlockManager: BlockManager stopped
15/09/22 23:20:17 INFO BlockManagerMaster: BlockManagerMaster stopped
15/09/22 23:20:17 INFO SparkContext: Successfully stopped SparkContext
15/09/22 23:20:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/09/22 23:20:17 INFO Utils: Shutdown hook called
15/09/22 23:20:17 INFO Utils: Deleting directory /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2

   as u see, i use 'verbose' mode to submit this application,so certain details are more clear.

tips:

  one thing u should know is that command spark-submit parameters are ugly:

spark-submit [options] <app jar | python file> [app arguments]

   so the app jar should follow the options(if any),then app args.

   e.g. submit a wordcount app to standalone-client mode spark ensemble:

 spark-submit  --master spark://gzsw-02:7077 --class org.apache.spark.examples.JavaWordCount --executor-memory 600m --total-executor-cores 16 --verbose --deploy-mode client /home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/lib/spark-examples-1.4.1-hadoop2.4.0.jar /home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/RELEASE 2 output-result

   param-2:minpartitions; output-result:output all the wc detailed result

supplement

  yep,spark can use do certain 'memory-computings',though,we have already used memory to do something before,but may be not to form some theorys yet.

猜你喜欢

转载自leibnitz.iteye.com/blog/2245419