大数据平台运维之Spark

Spark

54.在大数据平台部署Spark服务组件,打开LinuxShell启动spark-shell终端,将启动的程序进程信息显示如下。

[root@master ~]# spark-shell

17/05/07 08:44:34 WARN NativeCodeLoader: Unable toload native-hadoop library for your platform... using builtin-java classeswhere applicable

17/05/07 08:44:34 INFO SecurityManager: Changing viewacls to: root

17/05/07 08:44:34 INFO SecurityManager: Changingmodify acls to: root

17/05/07 08:44:34 INFO SecurityManager:SecurityManager: authentication disabled; ui acls disabled; users with viewpermissions: Set(root); users with modify permissions: Set(root)

17/05/07 08:44:34 INFO HttpServer: Starting HTTPServer

17/05/07 08:44:35 INFO Server: jetty-8.y.z-SNAPSHOT

17/05/07 08:44:35 INFO AbstractConnector: [email protected]:56474

17/05/07 08:44:35 INFO Utils: Successfully startedservice 'HTTP class server' on port 56474.

Welcome to

      ____              __

     /__/__  ___ _____/ /__

    _\ \/ _ \/ _`/ __/  '_/

   /___/.__/\_,_/_/ /_/\_\   version 1.6.2

      /_/

 

Using Scala version 2.10.5 (Java HotSpot(TM) 64-BitServer VM, Java 1.8.0_77)

Type in expressions to have them evaluated.

Type :help for more information.

17/05/07 08:44:38 INFO SparkContext: Running Sparkversion 1.6.2

17/05/07 08:44:38 INFO SecurityManager: Changing viewacls to: root

17/05/07 08:44:38 INFO SecurityManager: Changingmodify acls to: root

17/05/07 08:44:38 INFO SecurityManager:SecurityManager: authentication disabled; ui acls disabled; users with viewpermissions: Set(root); users with modify permissions: Set(root)

17/05/07 08:44:38 INFO Utils: Successfully startedservice 'sparkDriver' on port 54871.

17/05/07 08:44:39 INFO Slf4jLogger: Slf4jLoggerstarted

17/05/07 08:44:39 INFO Remoting: Starting remoting

17/05/07 08:44:39 INFO Remoting: Remoting started;listening on addresses :[akka.tcp://[email protected]:58418]

17/05/07 08:44:39 INFO Utils: Successfully startedservice 'sparkDriverActorSystem' on port 58418.

17/05/07 08:44:39 INFO SparkEnv: RegisteringMapOutputTracker

17/05/07 08:44:39 INFO SparkEnv: RegisteringBlockManagerMaster

17/05/07 08:44:39 INFO DiskBlockManager: Created localdirectory at /tmp/blockmgr-e80f6bc5-835b-42b5-9c14-d14b6330aeb6

17/05/07 08:44:39 INFO MemoryStore: MemoryStorestarted with capacity 511.1 MB

17/05/07 08:44:39 INFO SparkEnv: RegisteringOutputCommitCoordinator

17/05/07 08:44:39 INFO Server: jetty-8.y.z-SNAPSHOT

17/05/07 08:44:39 INFO AbstractConnector: [email protected]:4040

17/05/07 08:44:39 INFO Utils: Successfully startedservice 'SparkUI' on port 4040.

17/05/07 08:44:39 INFO SparkUI: Bound SparkUI to0.0.0.0, and started at http://10.0.0.14:4040

17/05/07 08:44:39 INFO Executor: Starting executor IDdriver on host localhost

17/05/07 08:44:39 INFO Executor: Using REPL class URI:http://10.0.0.14:56474

17/05/07 08:44:39 INFO Utils: Successfully startedservice 'org.apache.spark.network.netty.NettyBlockTransferService' on port52300.

17/05/07 08:44:39 INFO NettyBlockTransferService:Server created on 52300

17/05/07 08:44:39 INFO BlockManagerMaster: Trying toregister BlockManager

17/05/07 08:44:39 INFO BlockManagerMasterEndpoint:Registering block manager localhost:52300 with 511.1 MB RAM,BlockManagerId(driver, localhost, 52300)

17/05/07 08:44:39 INFO BlockManagerMaster: RegisteredBlockManager

17/05/07 08:44:40 WARN DomainSocketFactory: Theshort-circuit local reads feature cannot be used because libhadoop cannot beloaded.

17/05/07 08:44:41 INFO EventLoggingListener: Loggingevents to hdfs:///spark-history/local-1494146679706

17/05/07 08:44:41 INFO SparkILoop: Created sparkcontext..

Spark context available as sc.

17/05/07 08:44:42 INFO HiveContext: Initializingexecution hive, version 1.2.1

17/05/07 08:44:42 INFO ClientWrapper: Inspected Hadoopversion: 2.7.1.2.4.3.0-227

17/05/07 08:44:42 INFO ClientWrapper: Loadedorg.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.7.1.2.4.3.0-227

17/05/07 08:44:42 INFO HiveMetaStore: 0: Opening rawstore with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore

17/05/07 08:44:42 INFO ObjectStore: ObjectStore,initialize called

17/05/07 08:44:42 INFO Persistence: Propertyhive.metastore.integral.jdo.pushdown unknown - will be ignored

17/05/07 08:44:42 INFO Persistence: Propertydatanucleus.cache.level2 unknown - will be ignored

17/05/07 08:44:42 WARN Connection: BoneCP specifiedbut not present in CLASSPATH (or one of dependencies)

17/05/07 08:44:43 WARN Connection: BoneCP specifiedbut not present in CLASSPATH (or one of dependencies)

17/05/07 08:45:00 INFO ObjectStore: Setting MetaStoreobject pin classes withhive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"

17/05/07 08:45:02 INFO Datastore: The class"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as"embedded-only" so does not have its own datastore table.

17/05/07 08:45:02 INFO Datastore: The class"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as"embedded-only" so does not have its own datastore table.

17/05/07 08:45:12 INFO Datastore: The class"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as"embedded-only" so does not have its own datastore table.

17/05/07 08:45:12 INFO Datastore: The class"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as"embedded-only" so does not have its own datastore table.

17/05/07 08:45:15 INFO MetaStoreDirectSql: Usingdirect SQL, underlying DB is DERBY

17/05/07 08:45:15 INFO ObjectStore: InitializedObjectStore

17/05/07 08:45:16 WARN ObjectStore: Versioninformation not found in metastore. hive.metastore.schema.verification is notenabled so recording the schema version 1.2.0

17/05/07 08:45:16 WARN ObjectStore: Failed to getdatabase default, returning NoSuchObjectException

17/05/07 08:45:17 INFO HiveMetaStore: Added admin rolein metastore

17/05/07 08:45:17 INFO HiveMetaStore: Added publicrole in metastore

17/05/07 08:45:18 INFO HiveMetaStore: No user is addedin admin role, since config is empty

17/05/07 08:45:18 INFO HiveMetaStore: 0:get_all_databases

17/05/07 08:45:18 INFO audit: ugi=root  ip=unknown-ip-addr      cmd=get_all_databases

17/05/07 08:45:18 INFO HiveMetaStore: 0:get_functions: db=default pat=*

17/05/07 08:45:18 INFO audit: ugi=root  ip=unknown-ip-addr      cmd=get_functions: db=default pat=*

17/05/07 08:45:18 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri"is tagged as "embedded-only" so does not have its own datastoretable.

17/05/07 08:45:20 INFO SessionState: Created localdirectory: /tmp/7ccacc49-6cf5-44f4-8567-2b24c42f5323_resources

17/05/07 08:45:20 INFO SessionState: Created HDFSdirectory: /tmp/hive/root/7ccacc49-6cf5-44f4-8567-2b24c42f5323

17/05/07 08:45:20 INFO SessionState: Created localdirectory: /tmp/root/7ccacc49-6cf5-44f4-8567-2b24c42f5323

17/05/07 08:45:20 INFO SessionState: Created HDFS directory:/tmp/hive/root/7ccacc49-6cf5-44f4-8567-2b24c42f5323/_tmp_space.db

17/05/07 08:45:21 INFO HiveContext: default warehouselocation is /user/hive/warehouse

17/05/07 08:45:21 INFO HiveContext: InitializingHiveMetastoreConnection version 1.2.1 using Spark classes.

17/05/07 08:45:21 INFO ClientWrapper: Inspected Hadoopversion: 2.7.1.2.4.3.0-227

17/05/07 08:45:21 INFO ClientWrapper: Loadedorg.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.7.1.2.4.3.0-227

17/05/07 08:45:21 INFO metastore: Trying to connect tometastore with URI thrift://slaver1:9083

17/05/07 08:45:21 INFO metastore: Connected tometastore.

17/05/07 08:45:22 INFO SessionState: Created localdirectory: /tmp/dd7d304a-45bb-4573-91d1-3fb013004624_resources

17/05/07 08:45:22 INFO SessionState: Created HDFSdirectory: /tmp/hive/root/dd7d304a-45bb-4573-91d1-3fb013004624

17/05/07 08:45:22 INFO SessionState: Created localdirectory: /tmp/root/dd7d304a-45bb-4573-91d1-3fb013004624

17/05/07 08:45:22 INFO SessionState: Created HDFSdirectory: /tmp/hive/root/dd7d304a-45bb-4573-91d1-3fb013004624/_tmp_space.db

17/05/07 08:45:22 INFO SparkILoop: Created sql context(with Hive support)..

SQL context available as sqlContext.

 

scala>

 

55.启动spark-shell后,在scala中加载数据“1,2,3,4 ,5,6,7,8,9,10”,求这些数据的2倍乘积能够被3整除的数字,并通过toDebugString方法来查看RDD的谱系。将以上操作命令和结果信息显示如下。

scala> val num=sc.parallelize(1 to 10)

num: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[0] at parallelize at <console>:27

 

scala> val doublenum = num.map(_*2)

doublenum: org.apache.spark.rdd.RDD[Int] =MapPartitionsRDD[1] at map at <console>:29

 

scala> val threenum = doublenum.filter(_ % 3 == 0)

threenum: org.apache.spark.rdd.RDD[Int] =MapPartitionsRDD[2] at filter at <console>:31

 

scala> threenum.collect

17/05/07 08:48:51 INFO SparkContext: Starting job:collect at <console>:34

17/05/07 08:48:51 INFO DAGScheduler: Got job 0(collect at <console>:34) with 4 output partitions

17/05/07 08:48:51 INFO DAGScheduler: Final stage:ResultStage 0 (collect at <console>:34)

17/05/07 08:48:51 INFO DAGScheduler: Parents of finalstage: List()

17/05/07 08:48:51 INFO DAGScheduler: Missing parents:List()

17/05/07 08:48:51 INFO DAGScheduler: SubmittingResultStage 0 (MapPartitionsRDD[2] at filter at <console>:31), which hasno missing parents

17/05/07 08:48:51 INFO MemoryStore: Block broadcast_0stored as values in memory (estimated size 2.2 KB, free 2.2 KB)

17/05/07 08:48:51 INFO MemoryStore: Blockbroadcast_0_piece0 stored as bytes in memory (estimated size 1334.0 B, free 3.5KB)

17/05/07 08:48:51 INFO BlockManagerInfo: Addedbroadcast_0_piece0 in memory on localhost:52300 (size: 1334.0 B, free: 511.1MB)

17/05/07 08:48:51 INFO SparkContext: Created broadcast0 from broadcast at DAGScheduler.scala:1008

17/05/07 08:48:51 INFO DAGScheduler: Submitting 4missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at<console>:31)

17/05/07 08:48:51 INFO TaskSchedulerImpl: Adding taskset 0.0 with 4 tasks

17/05/07 08:48:51 INFO TaskSetManager: Starting task0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2078 bytes)

17/05/07 08:48:51 INFO TaskSetManager: Starting task1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2078 bytes)

17/05/07 08:48:51 INFO TaskSetManager: Starting task2.0 in stage 0.0 (TID 2, localhost, partition 2,PROCESS_LOCAL, 2078 bytes)

17/05/07 08:48:51 INFO TaskSetManager: Starting task3.0 in stage 0.0 (TID 3, localhost, partition 3,PROCESS_LOCAL, 2135 bytes)

17/05/07 08:48:51 INFO Executor: Running task 1.0 instage 0.0 (TID 1)

17/05/07 08:48:51 INFO Executor: Running task 3.0 instage 0.0 (TID 3)

17/05/07 08:48:51 INFO Executor: Running task 0.0 instage 0.0 (TID 0)

17/05/07 08:48:51 INFO Executor: Running task 2.0 instage 0.0 (TID 2)

17/05/07 08:48:51 INFO Executor: Finished task 1.0 instage 0.0 (TID 1). 902 bytes result sent to driver

17/05/07 08:48:51 INFO Executor: Finished task 0.0 instage 0.0 (TID 0). 898 bytes result sent to driver

17/05/07 08:48:51 INFO Executor: Finished task 3.0 instage 0.0 (TID 3). 902 bytes result sent to driver

17/05/07 08:48:51 INFO Executor: Finished task 2.0 instage 0.0 (TID 2). 902 bytes result sent to driver

17/05/07 08:48:51 INFO TaskSetManager: Finished task2.0 in stage 0.0 (TID 2) in 62 ms on localhost (1/4)

17/05/07 08:48:51 INFO TaskSetManager: Finished task1.0 in stage 0.0 (TID 1) in 65 ms on localhost (2/4)

17/05/07 08:48:51 INFO TaskSetManager: Finished task3.0 in stage 0.0 (TID 3) in 63 ms on localhost (3/4)

17/05/07 08:48:51 INFO TaskSetManager: Finished task0.0 in stage 0.0 (TID 0) in 95 ms on localhost (4/4)

17/05/07 08:48:51 INFO TaskSchedulerImpl: RemovedTaskSet 0.0, whose tasks have all completed, from pool

17/05/07 08:48:51 INFO DAGScheduler: ResultStage 0(collect at <console>:34) finished in 0.120 s

17/05/07 08:48:51 INFO DAGScheduler: Job 0 finished:collect at <console>:34, took 0.298116 s

res0: Array[Int] = Array(6, 12, 18)

 

scala> threenum.toDebugString

res1: String =

(4) MapPartitionsRDD[2] at filter at<console>:31 []

 |  MapPartitionsRDD[1] at map at<console>:29 []

 |  ParallelCollectionRDD[0] at parallelize at<console>:27 []

 

56.启动spark-shell后,在scala中加载Key-Value数据“("A",1),("B",2),("C",3),("A",4),("B",5), ("C",4), ("A",3), ("A",9),("B",4), ("D",5) ”,将这些数据以Key为基准进行升序排序,并以Key为基准进行分组。将以上操作命令和结果信息显示如下。

scala> valkv1=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5),("C",4),("A",3), ("A",9), ("B",4), ("D",5)))

kv1: org.apache.spark.rdd.RDD[(String, Int)] =ParallelCollectionRDD[0] at parallelize at <console>:27

 

scala> kv1.sortByKey().collect

17/05/07 11:18:38 INFO SparkContext: Starting job:sortByKey at <console>:30

17/05/07 11:18:38 INFO DAGScheduler: Got job 0(sortByKey at <console>:30) with 4 output partitions

17/05/07 11:18:38 INFO DAGScheduler: Final stage:ResultStage 0 (sortByKey at <console>:30)

17/05/07 11:18:38 INFO DAGScheduler: Parents of finalstage: List()

17/05/07 11:18:38 INFO DAGScheduler: Missing parents:List()

17/05/07 11:18:38 INFO DAGScheduler: SubmittingResultStage 0 (MapPartitionsRDD[2] at sortByKey at <console>:30), whichhas no missing parents

17/05/07 11:18:38 INFO MemoryStore: Block broadcast_0stored as values in memory (estimated size 2.3 KB, free 2.3 KB)

17/05/07 11:18:38 INFO MemoryStore: Blockbroadcast_0_piece0 stored as bytes in memory (estimated size 1408.0 B, free 3.7KB)

17/05/07 11:18:38 INFO BlockManagerInfo: Addedbroadcast_0_piece0 in memory on localhost:33645 (size: 1408.0 B, free: 511.1MB)

17/05/07 11:18:38 INFO SparkContext: Created broadcast0 from broadcast at DAGScheduler.scala:1008

17/05/07 11:18:38 INFO DAGScheduler: Submitting 4missing tasks from ResultStage 0 (MapPartitionsRDD[2] at sortByKey at<console>:30)

17/05/07 11:18:38 INFO TaskSchedulerImpl: Adding taskset 0.0 with 4 tasks

17/05/07 11:18:38 INFO TaskSetManager: Starting task0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2199 bytes)

17/05/07 11:18:38 INFO TaskSetManager: Starting task1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2219 bytes)

17/05/07 11:18:38 INFO TaskSetManager: Starting task2.0 in stage 0.0 (TID 2, localhost, partition 2,PROCESS_LOCAL, 2199 bytes)

17/05/07 11:18:38 INFO TaskSetManager: Starting task3.0 in stage 0.0 (TID 3, localhost, partition 3,PROCESS_LOCAL, 2219 bytes)

17/05/07 11:18:38 INFO Executor: Running task 2.0 instage 0.0 (TID 2)

17/05/07 11:18:38 INFO Executor: Running task 1.0 instage 0.0 (TID 1)

17/05/07 11:18:38 INFO Executor: Running task 0.0 instage 0.0 (TID 0)

17/05/07 11:18:38 INFO Executor: Running task 3.0 instage 0.0 (TID 3)

17/05/07 11:18:38 INFO Executor: Finished task 2.0 instage 0.0 (TID 2). 1165 bytes result sent to driver

17/05/07 11:18:38 INFO Executor: Finished task 0.0 instage 0.0 (TID 0). 1165 bytes result sent to driver

17/05/07 11:18:38 INFO Executor: Finished task 1.0 instage 0.0 (TID 1). 1169 bytes result sent to driver

17/05/07 11:18:38 INFO Executor: Finished task 3.0 instage 0.0 (TID 3). 1169 bytes result sent to driver

17/05/07 11:18:38 INFO TaskSetManager: Finished task0.0 in stage 0.0 (TID 0) in 101 ms on localhost (1/4)

17/05/07 11:18:38 INFO TaskSetManager: Finished task1.0 in stage 0.0 (TID 1) in 69 ms on localhost (2/4)

17/05/07 11:18:38 INFO TaskSetManager: Finished task2.0 in stage 0.0 (TID 2) in 70 ms on localhost (3/4)

17/05/07 11:18:38 INFO TaskSetManager: Finished task3.0 in stage 0.0 (TID 3) in 69 ms on localhost (4/4)

17/05/07 11:18:38 INFO TaskSchedulerImpl: RemovedTaskSet 0.0, whose tasks have all completed, from pool

17/05/07 11:18:38 INFO DAGScheduler: ResultStage 0(sortByKey at <console>:30) finished in 0.122 s

17/05/07 11:18:38 INFO DAGScheduler: Job 0 finished:sortByKey at <console>:30, took 0.273612 s

17/05/07 11:18:38 INFO SparkContext: Starting job:collect at <console>:30

17/05/07 11:18:38 INFO DAGScheduler: Registering RDD 0(parallelize at <console>:27)

17/05/07 11:18:38 INFO DAGScheduler: Got job 1 (collectat <console>:30) with 4 output partitions

17/05/07 11:18:38 INFO DAGScheduler: Final stage:ResultStage 2 (collect at <console>:30)

17/05/07 11:18:38 INFO DAGScheduler: Parents of finalstage: List(ShuffleMapStage 1)

17/05/07 11:18:38 INFO DAGScheduler: Missing parents:List(ShuffleMapStage 1)

17/05/07 11:18:38 INFO DAGScheduler: SubmittingShuffleMapStage 1 (ParallelCollectionRDD[0] at parallelize at<console>:27), which has no missing parents

17/05/07 11:18:38 INFO MemoryStore: Block broadcast_1 storedas values in memory (estimated size 2.3 KB, free 6.0 KB)

17/05/07 11:18:38 INFO MemoryStore: Blockbroadcast_1_piece0 stored as bytes in memory (estimated size 1434.0 B, free 7.4KB)

17/05/07 11:18:38 INFO BlockManagerInfo: Addedbroadcast_1_piece0 in memory on localhost:33645 (size: 1434.0 B, free: 511.1MB)

17/05/07 11:18:38 INFO SparkContext: Created broadcast1 from broadcast at DAGScheduler.scala:1008

17/05/07 11:18:38 INFO DAGScheduler: Submitting 4missing tasks from ShuffleMapStage 1 (ParallelCollectionRDD[0] at parallelizeat <console>:27)

17/05/07 11:18:38 INFO TaskSchedulerImpl: Adding taskset 1.0 with 4 tasks

17/05/07 11:18:38 INFO TaskSetManager: Starting task0.0 in stage 1.0 (TID 4, localhost, partition 0,PROCESS_LOCAL, 2188 bytes)

17/05/07 11:18:38 INFO TaskSetManager: Starting task1.0 in stage 1.0 (TID 5, localhost, partition 1,PROCESS_LOCAL, 2208 bytes)

17/05/07 11:18:38 INFO TaskSetManager: Starting task2.0 in stage 1.0 (TID 6, localhost, partition 2,PROCESS_LOCAL, 2188 bytes)

17/05/07 11:18:38 INFO TaskSetManager: Starting task3.0 in stage 1.0 (TID 7, localhost, partition 3,PROCESS_LOCAL, 2208 bytes)

17/05/07 11:18:38 INFO Executor: Running task 3.0 instage 1.0 (TID 7)

17/05/07 11:18:38 INFO Executor: Running task 0.0 instage 1.0 (TID 4)

17/05/07 11:18:38 INFO Executor: Running task 2.0 instage 1.0 (TID 6)

17/05/07 11:18:38 INFO Executor: Running task 1.0 instage 1.0 (TID 5)

17/05/07 11:18:38 INFO Executor: Finished task 2.0 instage 1.0 (TID 6). 1161 bytes result sent to driver

17/05/07 11:18:38 INFO Executor: Finished task 0.0 instage 1.0 (TID 4). 1161 bytes result sent to driver

17/05/07 11:18:38 INFO Executor: Finished task 1.0 instage 1.0 (TID 5). 1161 bytes result sent to driver

17/05/07 11:18:38 INFO TaskSetManager: Finished task2.0 in stage 1.0 (TID 6) in 61 ms on localhost (1/4)

17/05/07 11:18:38 INFO Executor: Finished task 3.0 instage 1.0 (TID 7). 1161 bytes result sent to driver

17/05/07 11:18:38 INFO TaskSetManager: Finished task3.0 in stage 1.0 (TID 7) in 62 ms on localhost (2/4)

17/05/07 11:18:38 INFO TaskSetManager: Finished task1.0 in stage 1.0 (TID 5) in 64 ms on localhost (3/4)

17/05/07 11:18:38 INFO TaskSetManager: Finished task0.0 in stage 1.0 (TID 4) in 71 ms on localhost (4/4)

17/05/07 11:18:38 INFO TaskSchedulerImpl: RemovedTaskSet 1.0, whose tasks have all completed, from pool

17/05/07 11:18:38 INFO DAGScheduler: ShuffleMapStage 1(parallelize at <console>:27) finished in 0.072 s

17/05/07 11:18:38 INFO DAGScheduler: looking for newlyrunnable stages

17/05/07 11:18:38 INFO DAGScheduler: running: Set()

17/05/07 11:18:38 INFO DAGScheduler: waiting:Set(ResultStage 2)

17/05/07 11:18:38 INFO DAGScheduler: failed: Set()

17/05/07 11:18:38 INFO DAGScheduler: SubmittingResultStage 2 (ShuffledRDD[3] at sortByKey at <console>:30), which has nomissing parents

17/05/07 11:18:38 INFO MemoryStore: Block broadcast_2stored as values in memory (estimated size 2.9 KB, free 10.3 KB)

17/05/07 11:18:38 INFO MemoryStore: Blockbroadcast_2_piece0 stored as bytes in memory (estimated size 1778.0 B, free12.1 KB)

17/05/07 11:18:38 INFO BlockManagerInfo: Addedbroadcast_2_piece0 in memory on localhost:33645 (size: 1778.0 B, free: 511.1MB)

17/05/07 11:18:38 INFO SparkContext: Created broadcast2 from broadcast at DAGScheduler.scala:1008

17/05/07 11:18:38 INFO DAGScheduler: Submitting 4missing tasks from ResultStage 2 (ShuffledRDD[3] at sortByKey at<console>:30)

17/05/07 11:18:38 INFO TaskSchedulerImpl: Adding taskset 2.0 with 4 tasks

17/05/07 11:18:38 INFO TaskSetManager: Starting task0.0 in stage 2.0 (TID 8, localhost, partition 0,NODE_LOCAL, 1894 bytes)

17/05/07 11:18:38 INFO TaskSetManager: Starting task1.0 in stage 2.0 (TID 9, localhost, partition 1,NODE_LOCAL, 1894 bytes)

17/05/07 11:18:38 INFO TaskSetManager: Starting task2.0 in stage 2.0 (TID 10, localhost, partition 2,NODE_LOCAL, 1894 bytes)

17/05/07 11:18:38 INFO TaskSetManager: Starting task3.0 in stage 2.0 (TID 11, localhost, partition 3,NODE_LOCAL, 1894 bytes)

17/05/07 11:18:38 INFO Executor: Running task 0.0 instage 2.0 (TID 8)

17/05/07 11:18:38 INFO Executor: Running task 2.0 instage 2.0 (TID 10)

17/05/07 11:18:38 INFO Executor: Running task 1.0 instage 2.0 (TID 9)

17/05/07 11:18:38 INFO Executor: Running task 3.0 instage 2.0 (TID 11)

17/05/07 11:18:38 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:18:38 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:18:38 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:18:38 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:18:38 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 6 ms

17/05/07 11:18:38 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 4 ms

17/05/07 11:18:38 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 3 ms

17/05/07 11:18:38 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 8 ms

17/05/07 11:18:38 INFO Executor: Finished task 1.0 instage 2.0 (TID 9). 1347 bytes result sent to driver

17/05/07 11:18:38 INFO Executor: Finished task 0.0 instage 2.0 (TID 8). 1370 bytes result sent to driver

17/05/07 11:18:38 INFO TaskSetManager: Finished task1.0 in stage 2.0 (TID 9) in 63 ms on localhost (1/4)

17/05/07 11:18:38 INFO TaskSetManager: Finished task0.0 in stage 2.0 (TID 8) in 68 ms on localhost (2/4)

17/05/07 11:18:38 INFO Executor: Finished task 3.0 instage 2.0 (TID 11). 1307 bytes result sent to driver

17/05/07 11:18:38 INFO Executor: Finished task 2.0 instage 2.0 (TID 10). 1327 bytes result sent to driver

17/05/07 11:18:39 INFO TaskSetManager: Finished task3.0 in stage 2.0 (TID 11) in 67 ms on localhost (3/4)

17/05/07 11:18:39 INFO TaskSetManager: Finished task2.0 in stage 2.0 (TID 10) in 68 ms on localhost (4/4)

17/05/07 11:18:39 INFO DAGScheduler: ResultStage 2(collect at <console>:30) finished in 0.076 s

17/05/07 11:18:39 INFO TaskSchedulerImpl: RemovedTaskSet 2.0, whose tasks have all completed, from pool

17/05/07 11:18:39 INFO DAGScheduler: Job 1 finished:collect at <console>:30, took 0.199282 s

res0: Array[(String, Int)] = Array((A,1), (A,4),(A,3), (A,9), (B,2), (B,5), (B,4), (C,3), (C,4), (D,5))

 

scala> kv1.groupByKey().collect

17/05/07 11:19:19 INFO SparkContext: Starting job:collect at <console>:30

17/05/07 11:19:19 INFO DAGScheduler: Registering RDD 0(parallelize at <console>:27)

17/05/07 11:19:19 INFO DAGScheduler: Got job 2(collect at <console>:30) with 4 output partitions

17/05/07 11:19:19 INFO DAGScheduler: Final stage:ResultStage 4 (collect at <console>:30)

17/05/07 11:19:19 INFO DAGScheduler: Parents of finalstage: List(ShuffleMapStage 3)

17/05/07 11:19:19 INFO DAGScheduler: Missing parents:List(ShuffleMapStage 3)

17/05/07 11:19:19 INFO DAGScheduler: SubmittingShuffleMapStage 3 (ParallelCollectionRDD[0] at parallelize at<console>:27), which has no missing parents

17/05/07 11:19:19 INFO MemoryStore: Block broadcast_3stored as values in memory (estimated size 2.9 KB, free 2.9 KB)

17/05/07 11:19:19 INFO MemoryStore: Block broadcast_3_piece0stored as bytes in memory (estimated size 1641.0 B, free 4.5 KB)

17/05/07 11:19:19 INFO BlockManagerInfo: Addedbroadcast_3_piece0 in memory on localhost:33645 (size: 1641.0 B, free: 511.1MB)

17/05/07 11:19:19 INFO SparkContext: Created broadcast3 from broadcast at DAGScheduler.scala:1008

17/05/07 11:19:19 INFO DAGScheduler: Submitting 4missing tasks from ShuffleMapStage 3 (ParallelCollectionRDD[0] at parallelizeat <console>:27)

17/05/07 11:19:19 INFO TaskSchedulerImpl: Adding taskset 3.0 with 4 tasks

17/05/07 11:19:19 INFO TaskSetManager: Starting task0.0 in stage 3.0 (TID 12, localhost, partition 0,PROCESS_LOCAL, 2188 bytes)

17/05/07 11:19:19 INFO TaskSetManager: Starting task1.0 in stage 3.0 (TID 13, localhost, partition 1,PROCESS_LOCAL, 2208 bytes)

17/05/07 11:19:19 INFO TaskSetManager: Starting task2.0 in stage 3.0 (TID 14, localhost, partition 2,PROCESS_LOCAL, 2188 bytes)

17/05/07 11:19:19 INFO TaskSetManager: Starting task3.0 in stage 3.0 (TID 15, localhost, partition 3,PROCESS_LOCAL, 2208 bytes)

17/05/07 11:19:19 INFO Executor: Running task 1.0 instage 3.0 (TID 13)

17/05/07 11:19:19 INFO Executor: Running task 2.0 instage 3.0 (TID 14)

17/05/07 11:19:19 INFO Executor: Running task 3.0 instage 3.0 (TID 15)

17/05/07 11:19:19 INFO Executor: Running task 0.0 instage 3.0 (TID 12)

17/05/07 11:19:19 INFO Executor: Finished task 3.0 instage 3.0 (TID 15). 1161 bytes result sent to driver

17/05/07 11:19:19 INFO TaskSetManager: Finished task3.0 in stage 3.0 (TID 15) in 16 ms on localhost (1/4)

17/05/07 11:19:19 INFO Executor: Finished task 1.0 instage 3.0 (TID 13). 1161 bytes result sent to driver

17/05/07 11:19:19 INFO TaskSetManager: Finished task1.0 in stage 3.0 (TID 13) in 22 ms on localhost (2/4)

17/05/07 11:19:19 INFO Executor: Finished task 0.0 instage 3.0 (TID 12). 1161 bytes result sent to driver

17/05/07 11:19:19 INFO Executor: Finished task 2.0 instage 3.0 (TID 14). 1161 bytes result sent to driver

17/05/07 11:19:19 INFO TaskSetManager: Finished task2.0 in stage 3.0 (TID 14) in 25 ms on localhost (3/4)

17/05/07 11:19:19 INFO TaskSetManager: Finished task0.0 in stage 3.0 (TID 12) in 27 ms on localhost (4/4)

17/05/07 11:19:19 INFO TaskSchedulerImpl: RemovedTaskSet 3.0, whose tasks have all completed, from pool

17/05/07 11:19:19 INFO DAGScheduler: ShuffleMapStage 3(parallelize at <console>:27) finished in 0.028 s

17/05/07 11:19:19 INFO DAGScheduler: looking for newlyrunnable stages

17/05/07 11:19:19 INFO DAGScheduler: running: Set()

17/05/07 11:19:19 INFO DAGScheduler: waiting:Set(ResultStage 4)

17/05/07 11:19:19 INFO DAGScheduler: failed: Set()

17/05/07 11:19:19 INFO DAGScheduler: SubmittingResultStage 4 (ShuffledRDD[4] at groupByKey at <console>:30), which hasno missing parents

17/05/07 11:19:19 INFO MemoryStore: Block broadcast_4stored as values in memory (estimated size 3.9 KB, free 8.4 KB)

17/05/07 11:19:19 INFO MemoryStore: Blockbroadcast_4_piece0 stored as bytes in memory (estimated size 2.1 KB, free 10.5KB)

17/05/07 11:19:19 INFO BlockManagerInfo: Added broadcast_4_piece0in memory on localhost:33645 (size: 2.1 KB, free: 511.1 MB)

17/05/07 11:19:19 INFO SparkContext: Created broadcast4 from broadcast at DAGScheduler.scala:1008

17/05/07 11:19:19 INFO DAGScheduler: Submitting 4missing tasks from ResultStage 4 (ShuffledRDD[4] at groupByKey at<console>:30)

17/05/07 11:19:19 INFO TaskSchedulerImpl: Adding taskset 4.0 with 4 tasks

17/05/07 11:19:19 INFO TaskSetManager: Starting task0.0 in stage 4.0 (TID 16, localhost, partition 0,NODE_LOCAL, 1894 bytes)

17/05/07 11:19:19 INFO TaskSetManager: Starting task1.0 in stage 4.0 (TID 17, localhost, partition 1,NODE_LOCAL, 1894 bytes)

17/05/07 11:19:19 INFO TaskSetManager: Starting task2.0 in stage 4.0 (TID 18, localhost, partition 2,NODE_LOCAL, 1894 bytes)

17/05/07 11:19:19 INFO TaskSetManager: Starting task3.0 in stage 4.0 (TID 19, localhost, partition 3,NODE_LOCAL, 1894 bytes)

17/05/07 11:19:19 INFO Executor: Running task 3.0 instage 4.0 (TID 19)

17/05/07 11:19:19 INFO Executor: Running task 0.0 instage 4.0 (TID 16)

17/05/07 11:19:19 INFO Executor: Running task 1.0 instage 4.0 (TID 17)

17/05/07 11:19:19 INFO Executor: Running task 2.0 instage 4.0 (TID 18)

17/05/07 11:19:19 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:19:19 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:19:19 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:19:19 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:19:19 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:19:19 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:19:19 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:19:19 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:19:19 INFO Executor: Finished task 2.0 instage 4.0 (TID 18). 1743 bytes result sent to driver

17/05/07 11:19:19 INFO Executor: Finished task 1.0 instage 4.0 (TID 17). 1743 bytes result sent to driver

17/05/07 11:19:19 INFO Executor: Finished task 3.0 instage 4.0 (TID 19). 1689 bytes result sent to driver

17/05/07 11:19:19 INFO Executor: Finished task 0.0 instage 4.0 (TID 16). 1680 bytes result sent to driver

17/05/07 11:19:19 INFO TaskSetManager: Finished task1.0 in stage 4.0 (TID 17) in 20 ms on localhost (1/4)

17/05/07 11:19:19 INFO TaskSetManager: Finished task3.0 in stage 4.0 (TID 19) in 21 ms on localhost (2/4)

17/05/07 11:19:19 INFO TaskSetManager: Finished task0.0 in stage 4.0 (TID 16) in 23 ms on localhost (3/4)

17/05/07 11:19:19 INFO TaskSetManager: Finished task2.0 in stage 4.0 (TID 18) in 25 ms on localhost (4/4)

17/05/07 11:19:19 INFO TaskSchedulerImpl: RemovedTaskSet 4.0, whose tasks have all completed, from pool

17/05/07 11:19:19 INFO DAGScheduler: ResultStage 4(collect at <console>:30) finished in 0.026 s

17/05/07 11:19:19 INFO DAGScheduler: Job 2 finished:collect at <console>:30, took 0.072603 s

res1: Array[(String, Iterable[Int])] =Array((D,CompactBuffer(5)), (A,CompactBuffer(1, 4, 3, 9)), (B,CompactBuffer(2,5, 4)), (C,CompactBuffer(3, 4)))

 

57.启动spark-shell后,在scala中加载Key-Value数据“("A",1),("B",3),("C",5),("D",4),("B",7),("C",4),("E",5),("A",8),("B",4),("D",5) ”,将这些数据以Key 为基准进行升序排序,并对相同的Key 进行Value 求和计算。将以上操作命令和结果信息显示如下。

scala> valkv2=sc.parallelize(List(("A",1),("B",3),("C",5),("D",4),("B",7),("C",4),("E",5),("A",8),("B",4),("D",5)))

kv2: org.apache.spark.rdd.RDD[(String, Int)] =ParallelCollectionRDD[5] at parallelize at <console>:27

 

scala> kv2.sortByKey().collect

17/05/07 11:24:22 INFO SparkContext: Starting job:sortByKey at <console>:30

17/05/07 11:24:22 INFO DAGScheduler: Got job 3(sortByKey at <console>:30) with 4 output partitions

17/05/07 11:24:22 INFO DAGScheduler: Final stage:ResultStage 5 (sortByKey at <console>:30)

17/05/07 11:24:22 INFO DAGScheduler: Parents of finalstage: List()

17/05/07 11:24:22 INFO DAGScheduler: Missing parents:List()

17/05/07 11:24:22 INFO DAGScheduler: SubmittingResultStage 5 (MapPartitionsRDD[7] at sortByKey at <console>:30), whichhas no missing parents

17/05/07 11:24:22 INFO MemoryStore: Block broadcast_5stored as values in memory (estimated size 2.3 KB, free 12.8 KB)

17/05/07 11:24:22 INFO MemoryStore: Blockbroadcast_5_piece0 stored as bytes in memory (estimated size 1398.0 B, free14.1 KB)

17/05/07 11:24:22 INFO BlockManagerInfo: Addedbroadcast_5_piece0 in memory on localhost:33645 (size: 1398.0 B, free: 511.1MB)

17/05/07 11:24:22 INFO SparkContext: Created broadcast5 from broadcast at DAGScheduler.scala:1008

17/05/07 11:24:22 INFO DAGScheduler: Submitting 4missing tasks from ResultStage 5 (MapPartitionsRDD[7] at sortByKey at<console>:30)

17/05/07 11:24:22 INFO TaskSchedulerImpl: Adding taskset 5.0 with 4 tasks

17/05/07 11:24:22 INFO TaskSetManager: Starting task0.0 in stage 5.0 (TID 20, localhost, partition 0,PROCESS_LOCAL, 2199 bytes)

17/05/07 11:24:22 INFO TaskSetManager: Starting task1.0 in stage 5.0 (TID 21, localhost, partition 1,PROCESS_LOCAL, 2219 bytes)

17/05/07 11:24:22 INFO TaskSetManager: Starting task2.0 in stage 5.0 (TID 22, localhost, partition 2,PROCESS_LOCAL, 2199 bytes)

17/05/07 11:24:22 INFO TaskSetManager: Starting task3.0 in stage 5.0 (TID 23, localhost, partition 3,PROCESS_LOCAL, 2219 bytes)

17/05/07 11:24:22 INFO Executor: Running task 0.0 instage 5.0 (TID 20)

17/05/07 11:24:22 INFO Executor: Finished task 0.0 instage 5.0 (TID 20). 1165 bytes result sent to driver

17/05/07 11:24:22 INFO Executor: Running task 1.0 instage 5.0 (TID 21)

17/05/07 11:24:22 INFO Executor: Finished task 1.0 instage 5.0 (TID 21). 1169 bytes result sent to driver

17/05/07 11:24:22 INFO TaskSetManager: Finished task0.0 in stage 5.0 (TID 20) in 12 ms on localhost (1/4)

17/05/07 11:24:22 INFO TaskSetManager: Finished task 1.0in stage 5.0 (TID 21) in 10 ms on localhost (2/4)

17/05/07 11:24:22 INFO Executor: Running task 3.0 instage 5.0 (TID 23)

17/05/07 11:24:22 INFO Executor: Running task 2.0 instage 5.0 (TID 22)

17/05/07 11:24:22 INFO Executor: Finished task 3.0 instage 5.0 (TID 23). 1169 bytes result sent to driver

17/05/07 11:24:22 INFO TaskSetManager: Finished task3.0 in stage 5.0 (TID 23) in 14 ms on localhost (3/4)

17/05/07 11:24:22 INFO Executor: Finished task 2.0 instage 5.0 (TID 22). 1165 bytes result sent to driver

17/05/07 11:24:22 INFO TaskSetManager: Finished task2.0 in stage 5.0 (TID 22) in 16 ms on localhost (4/4)

17/05/07 11:24:22 INFO DAGScheduler: ResultStage 5(sortByKey at <console>:30) finished in 0.020 s

17/05/07 11:24:22 INFO TaskSchedulerImpl: RemovedTaskSet 5.0, whose tasks have all completed, from pool

17/05/07 11:24:22 INFO DAGScheduler: Job 3 finished:sortByKey at <console>:30, took 0.027469 s

17/05/07 11:24:22 INFO SparkContext: Starting job:collect at <console>:30

17/05/07 11:24:22 INFO DAGScheduler: Registering RDD 5(parallelize at <console>:27)

17/05/07 11:24:22 INFO DAGScheduler: Got job 4(collect at <console>:30) with 4 output partitions

17/05/07 11:24:22 INFO DAGScheduler: Final stage:ResultStage 7 (collect at <console>:30)

17/05/07 11:24:22 INFO DAGScheduler: Parents of finalstage: List(ShuffleMapStage 6)

17/05/07 11:24:22 INFO DAGScheduler: Missing parents:List(ShuffleMapStage 6)

17/05/07 11:24:22 INFO DAGScheduler: SubmittingShuffleMapStage 6 (ParallelCollectionRDD[5] at parallelize at<console>:27), which has no missing parents

17/05/07 11:24:22 INFO MemoryStore: Block broadcast_6stored as values in memory (estimated size 2.3 KB, free 16.5 KB)

17/05/07 11:24:22 INFO MemoryStore: Blockbroadcast_6_piece0 stored as bytes in memory (estimated size 1437.0 B, free17.9 KB)

17/05/07 11:24:22 INFO BlockManagerInfo: Addedbroadcast_6_piece0 in memory on localhost:33645 (size: 1437.0 B, free: 511.1MB)

17/05/07 11:24:22 INFO SparkContext: Created broadcast6 from broadcast at DAGScheduler.scala:1008

17/05/07 11:24:22 INFO DAGScheduler: Submitting 4missing tasks from ShuffleMapStage 6 (ParallelCollectionRDD[5] at parallelizeat <console>:27)

17/05/07 11:24:22 INFO TaskSchedulerImpl: Adding taskset 6.0 with 4 tasks

17/05/07 11:24:22 INFO TaskSetManager: Starting task0.0 in stage 6.0 (TID 24, localhost, partition 0,PROCESS_LOCAL, 2188 bytes)

17/05/07 11:24:22 INFO TaskSetManager: Starting task1.0 in stage 6.0 (TID 25, localhost, partition 1,PROCESS_LOCAL, 2208 bytes)

17/05/07 11:24:22 INFO TaskSetManager: Starting task2.0 in stage 6.0 (TID 26, localhost, partition 2,PROCESS_LOCAL, 2188 bytes)

17/05/07 11:24:22 INFO TaskSetManager: Starting task3.0 in stage 6.0 (TID 27, localhost, partition 3,PROCESS_LOCAL, 2208 bytes)

17/05/07 11:24:22 INFO Executor: Running task 0.0 instage 6.0 (TID 24)

17/05/07 11:24:22 INFO Executor: Running task 2.0 instage 6.0 (TID 26)

17/05/07 11:24:22 INFO Executor: Running task 1.0 instage 6.0 (TID 25)

17/05/07 11:24:22 INFO Executor: Finished task 2.0 instage 6.0 (TID 26). 1161 bytes result sent to driver

17/05/07 11:24:22 INFO TaskSetManager: Finished task2.0 in stage 6.0 (TID 26) in 11 ms on localhost (1/4)

17/05/07 11:24:22 INFO Executor: Finished task 1.0 instage 6.0 (TID 25). 1161 bytes result sent to driver

17/05/07 11:24:22 INFO TaskSetManager: Finished task1.0 in stage 6.0 (TID 25) in 13 ms on localhost (2/4)

17/05/07 11:24:22 INFO Executor: Finished task 0.0 instage 6.0 (TID 24). 1161 bytes result sent to driver

17/05/07 11:24:22 INFO Executor: Running task 3.0 instage 6.0 (TID 27)

17/05/07 11:24:22 INFO TaskSetManager: Finished task0.0 in stage 6.0 (TID 24) in 16 ms on localhost (3/4)

17/05/07 11:24:22 INFO Executor: Finished task 3.0 instage 6.0 (TID 27). 1161 bytes result sent to driver

17/05/07 11:24:22 INFO TaskSetManager: Finished task3.0 in stage 6.0 (TID 27) in 19 ms on localhost (4/4)

17/05/07 11:24:22 INFO TaskSchedulerImpl: RemovedTaskSet 6.0, whose tasks have all completed, from pool

17/05/07 11:24:22 INFO DAGScheduler: ShuffleMapStage 6(parallelize at <console>:27) finished in 0.021 s

17/05/07 11:24:22 INFO DAGScheduler: looking for newlyrunnable stages

17/05/07 11:24:22 INFO DAGScheduler: running: Set()

17/05/07 11:24:22 INFO DAGScheduler: waiting:Set(ResultStage 7)

17/05/07 11:24:22 INFO DAGScheduler: failed: Set()

17/05/07 11:24:22 INFO DAGScheduler: SubmittingResultStage 7 (ShuffledRDD[8] at sortByKey at <console>:30), which has nomissing parents

17/05/07 11:24:22 INFO MemoryStore: Block broadcast_7stored as values in memory (estimated size 2.9 KB, free 20.8 KB)

17/05/07 11:24:22 INFO MemoryStore: Blockbroadcast_7_piece0 stored as bytes in memory (estimated size 1780.0 B, free22.6 KB)

17/05/07 11:24:22 INFO BlockManagerInfo: Addedbroadcast_7_piece0 in memory on localhost:33645 (size: 1780.0 B, free: 511.1MB)

17/05/07 11:24:22 INFO SparkContext: Created broadcast7 from broadcast at DAGScheduler.scala:1008

17/05/07 11:24:22 INFO DAGScheduler: Submitting 4missing tasks from ResultStage 7 (ShuffledRDD[8] at sortByKey at <console>:30)

17/05/07 11:24:22 INFO TaskSchedulerImpl: Adding taskset 7.0 with 4 tasks

17/05/07 11:24:22 INFO TaskSetManager: Starting task0.0 in stage 7.0 (TID 28, localhost, partition 0,NODE_LOCAL, 1894 bytes)

17/05/07 11:24:22 INFO TaskSetManager: Starting task1.0 in stage 7.0 (TID 29, localhost, partition 1,NODE_LOCAL, 1894 bytes)

17/05/07 11:24:22 INFO TaskSetManager: Starting task2.0 in stage 7.0 (TID 30, localhost, partition 2,NODE_LOCAL, 1894 bytes)

17/05/07 11:24:22 INFO TaskSetManager: Starting task3.0 in stage 7.0 (TID 31, localhost, partition 3,NODE_LOCAL, 1894 bytes)

17/05/07 11:24:22 INFO Executor: Running task 3.0 instage 7.0 (TID 31)

17/05/07 11:24:22 INFO Executor: Running task 1.0 instage 7.0 (TID 29)

17/05/07 11:24:22 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:24:22 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:24:22 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:24:22 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:24:22 INFO Executor: Running task 2.0 instage 7.0 (TID 30)

17/05/07 11:24:22 INFO Executor: Finished task 3.0 instage 7.0 (TID 31). 1307 bytes result sent to driver

17/05/07 11:24:22 INFO TaskSetManager: Finished task3.0 in stage 7.0 (TID 31) in 5 ms on localhost (1/4)

17/05/07 11:24:22 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:24:22 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:24:22 INFO Executor: Finished task 2.0 instage 7.0 (TID 30). 1327 bytes result sent to driver

17/05/07 11:24:22 INFO Executor: Running task 0.0 instage 7.0 (TID 28)

17/05/07 11:24:22 INFO Executor: Finished task 1.0 instage 7.0 (TID 29). 1327 bytes result sent to driver

17/05/07 11:24:22 INFO TaskSetManager: Finished task2.0 in stage 7.0 (TID 30) in 9 ms on localhost (2/4)

17/05/07 11:24:22 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:24:22 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:24:22 INFO Executor: Finished task 0.0 instage 7.0 (TID 28). 1390 bytes result sent to driver

17/05/07 11:24:22 INFO TaskSetManager: Finished task1.0 in stage 7.0 (TID 29) in 16 ms on localhost (3/4)

17/05/07 11:24:22 INFO TaskSetManager: Finished task0.0 in stage 7.0 (TID 28) in 17 ms on localhost (4/4)

17/05/07 11:24:22 INFO TaskSchedulerImpl: RemovedTaskSet 7.0, whose tasks have all completed, from pool

17/05/07 11:24:22 INFO DAGScheduler: ResultStage 7(collect at <console>:30) finished in 0.017 s

17/05/07 11:24:22 INFO DAGScheduler: Job 4 finished:collect at <console>:30, took 0.057174 s

res2: Array[(String, Int)] = Array((A,1), (A,8),(B,3), (B,7), (B,4), (C,5), (C,4), (D,4), (D,5), (E,5))

 

scala> kv2.reduceByKey(_+_).collect

17/05/07 11:25:07 INFO SparkContext: Starting job:collect at <console>:30

17/05/07 11:25:07 INFO DAGScheduler: Registering RDD 5(parallelize at <console>:27)

17/05/07 11:25:07 INFO DAGScheduler: Got job 5(collect at <console>:30) with 4 output partitions

17/05/07 11:25:07 INFO DAGScheduler: Final stage:ResultStage 9 (collect at <console>:30)

17/05/07 11:25:07 INFO DAGScheduler: Parents of finalstage: List(ShuffleMapStage 8)

17/05/07 11:25:07 INFO DAGScheduler: Missing parents:List(ShuffleMapStage 8)

17/05/07 11:25:07 INFO DAGScheduler: SubmittingShuffleMapStage 8 (ParallelCollectionRDD[5] at parallelize at<console>:27), which has no missing parents

17/05/07 11:25:07 INFO MemoryStore: Block broadcast_8stored as values in memory (estimated size 2.0 KB, free 24.6 KB)

17/05/07 11:25:07 INFO MemoryStore: Blockbroadcast_8_piece0 stored as bytes in memory (estimated size 1289.0 B, free25.8 KB)

17/05/07 11:25:07 INFO BlockManagerInfo: Addedbroadcast_8_piece0 in memory on localhost:33645 (size: 1289.0 B, free: 511.1MB)

17/05/07 11:25:07 INFO SparkContext: Created broadcast8 from broadcast at DAGScheduler.scala:1008

17/05/07 11:25:07 INFO DAGScheduler: Submitting 4missing tasks from ShuffleMapStage 8 (ParallelCollectionRDD[5] at parallelizeat <console>:27)

17/05/07 11:25:07 INFO TaskSchedulerImpl: Adding taskset 8.0 with 4 tasks

17/05/07 11:25:07 INFO TaskSetManager: Starting task0.0 in stage 8.0 (TID 32, localhost, partition 0,PROCESS_LOCAL, 2188 bytes)

17/05/07 11:25:07 INFO TaskSetManager: Starting task1.0 in stage 8.0 (TID 33, localhost, partition 1,PROCESS_LOCAL, 2208 bytes)

17/05/07 11:25:07 INFO TaskSetManager: Starting task2.0 in stage 8.0 (TID 34, localhost, partition 2,PROCESS_LOCAL, 2188 bytes)

17/05/07 11:25:07 INFO TaskSetManager: Starting task3.0 in stage 8.0 (TID 35, localhost, partition 3,PROCESS_LOCAL, 2208 bytes)

17/05/07 11:25:07 INFO Executor: Running task 1.0 instage 8.0 (TID 33)

17/05/07 11:25:07 INFO Executor: Running task 0.0 instage 8.0 (TID 32)

17/05/07 11:25:07 INFO Executor: Running task 2.0 instage 8.0 (TID 34)

17/05/07 11:25:07 INFO Executor: Running task 3.0 instage 8.0 (TID 35)

17/05/07 11:25:07 INFO Executor: Finished task 0.0 instage 8.0 (TID 32). 1161 bytes result sent to driver

17/05/07 11:25:07 INFO Executor: Finished task 2.0 instage 8.0 (TID 34). 1161 bytes result sent to driver

17/05/07 11:25:07 INFO Executor: Finished task 1.0 instage 8.0 (TID 33). 1161 bytes result sent to driver

17/05/07 11:25:07 INFO TaskSetManager: Finished task0.0 in stage 8.0 (TID 32) in 16 ms on localhost (1/4)

17/05/07 11:25:07 INFO TaskSetManager: Finished task2.0 in stage 8.0 (TID 34) in 15 ms on localhost (2/4)

17/05/07 11:25:07 INFO Executor: Finished task 3.0 instage 8.0 (TID 35). 1161 bytes result sent to driver

17/05/07 11:25:07 INFO TaskSetManager: Finished task1.0 in stage 8.0 (TID 33) in 17 ms on localhost (3/4)

17/05/07 11:25:07 INFO TaskSetManager: Finished task3.0 in stage 8.0 (TID 35) in 17 ms on localhost (4/4)

17/05/07 11:25:07 INFO TaskSchedulerImpl: RemovedTaskSet 8.0, whose tasks have all completed, from pool

17/05/07 11:25:07 INFO DAGScheduler: ShuffleMapStage 8(parallelize at <console>:27) finished in 0.022 s

17/05/07 11:25:07 INFO DAGScheduler: looking for newlyrunnable stages

17/05/07 11:25:07 INFO DAGScheduler: running: Set()

17/05/07 11:25:07 INFO DAGScheduler: waiting:Set(ResultStage 9)

17/05/07 11:25:07 INFO DAGScheduler: failed: Set()

17/05/07 11:25:07 INFO DAGScheduler: SubmittingResultStage 9 (ShuffledRDD[9] at reduceByKey at <console>:30), which hasno missing parents

17/05/07 11:25:07 INFO MemoryStore: Block broadcast_9stored as values in memory (estimated size 2.7 KB, free 28.5 KB)

17/05/07 11:25:07 INFO MemoryStore: Blockbroadcast_9_piece0 stored as bytes in memory (estimated size 1609.0 B, free30.0 KB)

17/05/07 11:25:07 INFO BlockManagerInfo: Addedbroadcast_9_piece0 in memory on localhost:33645 (size: 1609.0 B, free: 511.1MB)

17/05/07 11:25:07 INFO SparkContext: Created broadcast9 from broadcast at DAGScheduler.scala:1008

17/05/07 11:25:07 INFO DAGScheduler: Submitting 4missing tasks from ResultStage 9 (ShuffledRDD[9] at reduceByKey at<console>:30)

17/05/07 11:25:07 INFO TaskSchedulerImpl: Adding taskset 9.0 with 4 tasks

17/05/07 11:25:07 INFO TaskSetManager: Starting task0.0 in stage 9.0 (TID 36, localhost, partition 0,NODE_LOCAL, 1894 bytes)

17/05/07 11:25:07 INFO TaskSetManager: Starting task1.0 in stage 9.0 (TID 37, localhost, partition 1,NODE_LOCAL, 1894 bytes)

17/05/07 11:25:07 INFO TaskSetManager: Starting task2.0 in stage 9.0 (TID 38, localhost, partition 2,NODE_LOCAL, 1894 bytes)

17/05/07 11:25:07 INFO TaskSetManager: Starting task3.0 in stage 9.0 (TID 39, localhost, partition 3,NODE_LOCAL, 1894 bytes)

17/05/07 11:25:07 INFO Executor: Running task 1.0 instage 9.0 (TID 37)

17/05/07 11:25:07 INFO Executor: Running task 2.0 instage 9.0 (TID 38)

17/05/07 11:25:07 INFO Executor: Running task 0.0 instage 9.0 (TID 36)

17/05/07 11:25:07 INFO ShuffleBlockFetcherIterator:Getting 3 non-empty blocks out of 4 blocks

17/05/07 11:25:07 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 1 ms

17/05/07 11:25:07 INFO ShuffleBlockFetcherIterator:Getting 3 non-empty blocks out of 4 blocks

17/05/07 11:25:07 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:25:07 INFO ShuffleBlockFetcherIterator:Getting 2 non-empty blocks out of 4 blocks

17/05/07 11:25:07 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:25:07 INFO Executor: Finished task 0.0 instage 9.0 (TID 36). 1307 bytes result sent to driver

17/05/07 11:25:07 INFO Executor: Finished task 1.0 instage 9.0 (TID 37). 1327 bytes result sent to driver

17/05/07 11:25:07 INFO Executor: Finished task 2.0 instage 9.0 (TID 38). 1307 bytes result sent to driver

17/05/07 11:25:07 INFO Executor: Running task 3.0 instage 9.0 (TID 39)

17/05/07 11:25:07 INFO TaskSetManager: Finished task1.0 in stage 9.0 (TID 37) in 7 ms on localhost (1/4)

17/05/07 11:25:07 INFO ShuffleBlockFetcherIterator:Getting 2 non-empty blocks out of 4 blocks

17/05/07 11:25:07 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:25:07 INFO Executor: Finished task 3.0 instage 9.0 (TID 39). 1307 bytes result sent to driver

17/05/07 11:25:07 INFO TaskSetManager: Finished task2.0 in stage 9.0 (TID 38) in 11 ms on localhost (2/4)

17/05/07 11:25:07 INFO TaskSetManager: Finished task0.0 in stage 9.0 (TID 36) in 11 ms on localhost (3/4)

17/05/07 11:25:07 INFO TaskSetManager: Finished task3.0 in stage 9.0 (TID 39) in 11 ms on localhost (4/4)

17/05/07 11:25:07 INFO TaskSchedulerImpl: RemovedTaskSet 9.0, whose tasks have all completed, from pool

17/05/07 11:25:07 INFO DAGScheduler: ResultStage 9(collect at <console>:30) finished in 0.014 s

17/05/07 11:25:07 INFO DAGScheduler: Job 5 finished:collect at <console>:30, took 0.055570 s

res3: Array[(String, Int)] = Array((D,9), (A,9),(E,5), (B,14), (C,9))

 

58.启动spark-shell后,在scala中加载Key-Value数据 “("A",4),("A",2),("C",3),("A",4),("B",5),("C",3),("A",4),以Key为基准进行去重操作,并通过toDebugString方法来查看RDD的谱系。将以上操作命令和结果信息显示如下。

scala> valkv1=sc.parallelize(List(("A",4),("A",2),("C",3),("A",4),("B",5),("C",3),("A",4)))

kv1: org.apache.spark.rdd.RDD[(String, Int)] =ParallelCollectionRDD[0] at parallelize at <console>:27

 

scala> kv1.distinct.collect

17/05/17 09:18:17 INFO SparkContext: Starting job:collect at <console>:30

17/05/17 09:18:17 INFO DAGScheduler: Registering RDD 1(distinct at <console>:30)

17/05/17 09:18:17 INFO DAGScheduler: Got job 0(collect at <console>:30) with 4 output partitions

17/05/17 09:18:17 INFO DAGScheduler: Final stage:ResultStage 1 (collect at <console>:30)

17/05/17 09:18:17 INFO DAGScheduler: Parents of finalstage: List(ShuffleMapStage 0)

17/05/17 09:18:17 INFO DAGScheduler: Missing parents:List(ShuffleMapStage 0)

17/05/17 09:18:17 INFO DAGScheduler: SubmittingShuffleMapStage 0 (MapPartitionsRDD[1] at distinct at <console>:30),which has no missing parents

17/05/17 09:18:17 INFO MemoryStore: Block broadcast_0stored as values in memory (estimated size 2.7 KB, free 2.7 KB)

17/05/17 09:18:17 INFO MemoryStore: Blockbroadcast_0_piece0 stored as bytes in memory (estimated size 1626.0 B, free 4.3KB)

17/05/17 09:18:17 INFO BlockManagerInfo: Addedbroadcast_0_piece0 in memory on localhost:36432 (size: 1626.0 B, free: 511.1MB)

17/05/17 09:18:17 INFO SparkContext: Created broadcast0 from broadcast at DAGScheduler.scala:1008

17/05/17 09:18:17 INFO DAGScheduler: Submitting 4missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[1] at distinct at<console>:30)

17/05/17 09:18:17 INFO TaskSchedulerImpl: Adding taskset 0.0 with 4 tasks

17/05/17 09:18:18 INFO TaskSetManager: Starting task0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2168 bytes)

17/05/17 09:18:18 INFO TaskSetManager: Starting task1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2188 bytes)

17/05/17 09:18:18 INFO TaskSetManager: Starting task2.0 in stage 0.0 (TID 2, localhost, partition 2,PROCESS_LOCAL, 2188 bytes)

17/05/17 09:18:18 INFO TaskSetManager: Starting task3.0 in stage 0.0 (TID 3, localhost, partition 3,PROCESS_LOCAL, 2188 bytes)

17/05/17 09:18:18 INFO Executor: Running task 2.0 instage 0.0 (TID 2)

17/05/17 09:18:18 INFO Executor: Running task 3.0 instage 0.0 (TID 3)

17/05/17 09:18:18 INFO Executor: Running task 0.0 instage 0.0 (TID 0)

17/05/17 09:18:18 INFO Executor: Running task 1.0 instage 0.0 (TID 1)

17/05/17 09:18:18 INFO Executor: Finished task 3.0 instage 0.0 (TID 3). 1161 bytes result sent to driver

17/05/17 09:18:18 INFO Executor: Finished task 2.0 instage 0.0 (TID 2). 1161 bytes result sent to driver

17/05/17 09:18:18 INFO Executor: Finished task 1.0 instage 0.0 (TID 1). 1161 bytes result sent to driver

17/05/17 09:18:18 INFO Executor: Finished task 0.0 instage 0.0 (TID 0). 1161 bytes result sent to driver

17/05/17 09:18:18 INFO TaskSetManager: Finished task0.0 in stage 0.0 (TID 0) in 145 ms on localhost (1/4)

17/05/17 09:18:18 INFO TaskSetManager: Finished task1.0 in stage 0.0 (TID 1) in 129 ms on localhost (2/4)

17/05/17 09:18:18 INFO TaskSetManager: Finished task2.0 in stage 0.0 (TID 2) in 130 ms on localhost (3/4)

17/05/17 09:18:18 INFO TaskSetManager: Finished task3.0 in stage 0.0 (TID 3) in 127 ms on localhost (4/4)

17/05/17 09:18:18 INFO TaskSchedulerImpl: RemovedTaskSet 0.0, whose tasks have all completed, from pool

17/05/17 09:18:18 INFO DAGScheduler: ShuffleMapStage 0(distinct at <console>:30) finished in 0.174 s

17/05/17 09:18:18 INFO DAGScheduler: looking for newlyrunnable stages

17/05/17 09:18:18 INFO DAGScheduler: running: Set()

17/05/17 09:18:18 INFO DAGScheduler: waiting:Set(ResultStage 1)

17/05/17 09:18:18 INFO DAGScheduler: failed: Set()

17/05/17 09:18:18 INFO DAGScheduler: SubmittingResultStage 1 (MapPartitionsRDD[3] at distinct at <console>:30), whichhas no missing parents

17/05/17 09:18:18 INFO MemoryStore: Block broadcast_1stored as values in memory (estimated size 3.3 KB, free 7.6 KB)

17/05/17 09:18:18 INFO MemoryStore: Blockbroadcast_1_piece0 stored as bytes in memory (estimated size 1925.0 B, free 9.5KB)

17/05/17 09:18:18 INFO BlockManagerInfo: Addedbroadcast_1_piece0 in memory on localhost:36432 (size: 1925.0 B, free: 511.1MB)

17/05/17 09:18:18 INFO SparkContext: Created broadcast1 from broadcast at DAGScheduler.scala:1008

17/05/17 09:18:18 INFO DAGScheduler: Submitting 4missing tasks from ResultStage 1 (MapPartitionsRDD[3] at distinct at<console>:30)

17/05/17 09:18:18 INFO TaskSchedulerImpl: Adding taskset 1.0 with 4 tasks

17/05/17 09:18:18 INFO TaskSetManager: Starting task0.0 in stage 1.0 (TID 4, localhost, partition 0,NODE_LOCAL, 1894 bytes)

17/05/17 09:18:18 INFO TaskSetManager: Starting task1.0 in stage 1.0 (TID 5, localhost, partition 1,NODE_LOCAL, 1894 bytes)

17/05/17 09:18:18 INFO TaskSetManager: Starting task3.0 in stage 1.0 (TID 6, localhost, partition 3,NODE_LOCAL, 1894 bytes)

17/05/17 09:18:18 INFO TaskSetManager: Starting task2.0 in stage 1.0 (TID 7, localhost, partition 2,PROCESS_LOCAL, 1894 bytes)

17/05/17 09:18:18 INFO Executor: Running task 0.0 instage 1.0 (TID 4)

17/05/17 09:18:18 INFO Executor: Running task 1.0 instage 1.0 (TID 5)

17/05/17 09:18:18 INFO Executor: Running task 2.0 instage 1.0 (TID 7)

17/05/17 09:18:18 INFO Executor: Running task 3.0 instage 1.0 (TID 6)

17/05/17 09:18:18 INFO ShuffleBlockFetcherIterator:Getting 2 non-empty blocks out of 4 blocks

17/05/17 09:18:18 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 5 ms

17/05/17 09:18:18 INFO ShuffleBlockFetcherIterator:Getting 2 non-empty blocks out of 4 blocks

17/05/17 09:18:18 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 4 ms

17/05/17 09:18:18 INFO ShuffleBlockFetcherIterator:Getting 3 non-empty blocks out of 4 blocks

17/05/17 09:18:18 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 17 ms

17/05/17 09:18:18 INFO ShuffleBlockFetcherIterator:Getting 0 non-empty blocks out of 4 blocks

17/05/17 09:18:18 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 11 ms

17/05/17 09:18:18 INFO Executor: Finished task 2.0 instage 1.0 (TID 7). 1161 bytes result sent to driver

17/05/17 09:18:18 INFO Executor: Finished task 3.0 instage 1.0 (TID 6). 1307 bytes result sent to driver

17/05/17 09:18:18 INFO Executor: Finished task 0.0 instage 1.0 (TID 4). 1307 bytes result sent to driver

17/05/17 09:18:18 INFO Executor: Finished task 1.0 instage 1.0 (TID 5). 1327 bytes result sent to driver

17/05/17 09:18:18 INFO TaskSetManager: Finished task2.0 in stage 1.0 (TID 7) in 53 ms on localhost (1/4)

17/05/17 09:18:18 INFO TaskSetManager: Finished task3.0 in stage 1.0 (TID 6) in 54 ms on localhost (2/4)

17/05/17 09:18:18 INFO TaskSetManager: Finished task0.0 in stage 1.0 (TID 4) in 59 ms on localhost (3/4)

17/05/17 09:18:18 INFO TaskSetManager: Finished task1.0 in stage 1.0 (TID 5) in 57 ms on localhost (4/4)

17/05/17 09:18:18 INFO TaskSchedulerImpl: RemovedTaskSet 1.0, whose tasks have all completed, from pool

17/05/17 09:18:18 INFO DAGScheduler: ResultStage 1(collect at <console>:30) finished in 0.066 s

17/05/17 09:18:18 INFO DAGScheduler: Job 0 finished:collect at <console>:30, took 0.448824 s

res0: Array[(String, Int)] = Array((A,4), (B,5),(A,2), (C,3))

 

scala> kv1.toDebugString

res1: String = (4) ParallelCollectionRDD[0] atparallelize at <console>:27 []

 

 

59.启动spark-shell后,在scala中加载两组Key-Value数 据“("A",1),("B",2),("C",3),("A",4),("B",5)”             、("A",1),("B",2),("C",3),("A",4),("B",5),将两组数据以Key 为基准进行JOIN 操作,将以上操作命令和结果信息显示如下。

scala> valkv5=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5)))

kv5: org.apache.spark.rdd.RDD[(String, Int)] =ParallelCollectionRDD[11] at parallelize at <console>:27

 

scala> valkv6=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5)))

kv6: org.apache.spark.rdd.RDD[(String, Int)] =ParallelCollectionRDD[12] at parallelize at <console>:27

 

scala> kv5.join(kv6).collect

17/05/07 11:31:00 INFO SparkContext: Starting job:collect at <console>:32

17/05/07 11:31:00 INFO DAGScheduler: Registering RDD11 (parallelize at <console>:27)

17/05/07 11:31:00 INFO DAGScheduler: Registering RDD12 (parallelize at <console>:27)

17/05/07 11:31:00 INFO DAGScheduler: Got job 6(collect at <console>:32) with 4 output partitions

17/05/07 11:31:00 INFO DAGScheduler: Final stage:ResultStage 12 (collect at <console>:32)

17/05/07 11:31:00 INFO DAGScheduler: Parents of finalstage: List(ShuffleMapStage 10, ShuffleMapStage 11)

17/05/07 11:31:00 INFO DAGScheduler: Missing parents:List(ShuffleMapStage 10, ShuffleMapStage 11)

17/05/07 11:31:00 INFO DAGScheduler: SubmittingShuffleMapStage 10 (ParallelCollectionRDD[11] at parallelize at<console>:27), which has no missing parents

17/05/07 11:31:00 INFO MemoryStore: Block broadcast_10stored as values in memory (estimated size 1864.0 B, free 31.9 KB)

17/05/07 11:31:00 INFO MemoryStore: Blockbroadcast_10_piece0 stored as bytes in memory (estimated size 1182.0 B, free33.0 KB)

17/05/07 11:31:00 INFO BlockManagerInfo: Addedbroadcast_10_piece0 in memory on localhost:33645 (size: 1182.0 B, free: 511.1MB)

17/05/07 11:31:00 INFO SparkContext: Created broadcast10 from broadcast at DAGScheduler.scala:1008

17/05/07 11:31:00 INFO DAGScheduler: Submitting 4missing tasks from ShuffleMapStage 10 (ParallelCollectionRDD[11] at parallelizeat <console>:27)

17/05/07 11:31:00 INFO TaskSchedulerImpl: Adding taskset 10.0 with 4 tasks

17/05/07 11:31:00 INFO DAGScheduler: SubmittingShuffleMapStage 11 (ParallelCollectionRDD[12] at parallelize at<console>:27), which has no missing parents

17/05/07 11:31:00 INFO TaskSetManager: Starting task0.0 in stage 10.0 (TID 40, localhost, partition 0,PROCESS_LOCAL, 2168 bytes)

17/05/07 11:31:00 INFO TaskSetManager: Starting task1.0 in stage 10.0 (TID 41, localhost, partition 1,PROCESS_LOCAL, 2168 bytes)

17/05/07 11:31:00 INFO MemoryStore: Block broadcast_11stored as values in memory (estimated size 1864.0 B, free 34.8 KB)

17/05/07 11:31:00 INFO TaskSetManager: Starting task2.0 in stage 10.0 (TID 42, localhost, partition 2,PROCESS_LOCAL, 2168 bytes)

17/05/07 11:31:00 INFO TaskSetManager: Starting task3.0 in stage 10.0 (TID 43, localhost, partition 3,PROCESS_LOCAL, 2188 bytes)

17/05/07 11:31:00 INFO MemoryStore: Blockbroadcast_11_piece0 stored as bytes in memory (estimated size 1188.0 B, free36.0 KB)

17/05/07 11:31:00 INFO BlockManagerInfo: Addedbroadcast_11_piece0 in memory on localhost:33645 (size: 1188.0 B, free: 511.1MB)

17/05/07 11:31:00 INFO SparkContext: Created broadcast11 from broadcast at DAGScheduler.scala:1008

17/05/07 11:31:00 INFO DAGScheduler: Submitting 4missing tasks from ShuffleMapStage 11 (ParallelCollectionRDD[12] at parallelizeat <console>:27)

17/05/07 11:31:00 INFO TaskSchedulerImpl: Adding taskset 11.0 with 4 tasks

17/05/07 11:31:00 INFO Executor: Running task 1.0 instage 10.0 (TID 41)

17/05/07 11:31:00 INFO Executor: Finished task 1.0 instage 10.0 (TID 41). 1161 bytes result sent to driver

17/05/07 11:31:00 INFO Executor: Running task 3.0 instage 10.0 (TID 43)

17/05/07 11:31:00 INFO Executor: Finished task 3.0 instage 10.0 (TID 43). 1161 bytes result sent to driver

17/05/07 11:31:00 INFO TaskSetManager: Starting task0.0 in stage 11.0 (TID 44, localhost, partition 0,PROCESS_LOCAL, 2168 bytes)

17/05/07 11:31:00 INFO Executor: Running task 0.0 instage 10.0 (TID 40)

17/05/07 11:31:00 INFO TaskSetManager: Starting task1.0 in stage 11.0 (TID 45, localhost, partition 1,PROCESS_LOCAL, 2168 bytes)

17/05/07 11:31:00 INFO Executor: Running task 0.0 instage 11.0 (TID 44)

17/05/07 11:31:00 INFO Executor: Finished task 0.0 instage 10.0 (TID 40). 1161 bytes result sent to driver

17/05/07 11:31:00 INFO Executor: Running task 2.0 instage 10.0 (TID 42)

17/05/07 11:31:00 INFO TaskSetManager: Finished task1.0 in stage 10.0 (TID 41) in 19 ms on localhost (1/4)

17/05/07 11:31:00 INFO Executor: Finished task 0.0 instage 11.0 (TID 44). 1161 bytes result sent to driver

17/05/07 11:31:00 INFO TaskSetManager: Finished task3.0 in stage 10.0 (TID 43) in 19 ms on localhost (2/4)

17/05/07 11:31:00 INFO Executor: Running task 1.0 instage 11.0 (TID 45)

17/05/07 11:31:00 INFO TaskSetManager: Starting task2.0 in stage 11.0 (TID 46, localhost, partition 2,PROCESS_LOCAL, 2168 bytes)

17/05/07 11:31:00 INFO TaskSetManager: Starting task3.0 in stage 11.0 (TID 47, localhost, partition 3,PROCESS_LOCAL, 2188 bytes)

17/05/07 11:31:00 INFO Executor: Running task 2.0 instage 11.0 (TID 46)

17/05/07 11:31:00 INFO Executor: Finished task 1.0 instage 11.0 (TID 45). 1161 bytes result sent to driver

17/05/07 11:31:00 INFO Executor: Finished task 2.0 instage 11.0 (TID 46). 1161 bytes result sent to driver

17/05/07 11:31:00 INFO TaskSetManager: Finished task0.0 in stage 11.0 (TID 44) in 16 ms on localhost (1/4)

17/05/07 11:31:00 INFO TaskSetManager: Finished task1.0 in stage 11.0 (TID 45) in 16 ms on localhost (2/4)

17/05/07 11:31:00 INFO TaskSetManager: Finished task2.0 in stage 11.0 (TID 46) in 9 ms on localhost (3/4)

17/05/07 11:31:00 INFO Executor: Running task 3.0 instage 11.0 (TID 47)

17/05/07 11:31:00 INFO TaskSetManager: Finished task0.0 in stage 10.0 (TID 40) in 31 ms on localhost (3/4)

17/05/07 11:31:00 INFO Executor: Finished task 3.0 instage 11.0 (TID 47). 1161 bytes result sent to driver

17/05/07 11:31:00 INFO TaskSetManager: Finished task3.0 in stage 11.0 (TID 47) in 13 ms on localhost (4/4)

17/05/07 11:31:00 INFO TaskSchedulerImpl: RemovedTaskSet 11.0, whose tasks have all completed, from pool

17/05/07 11:31:00 INFO DAGScheduler: ShuffleMapStage11 (parallelize at <console>:27) finished in 0.032 s

17/05/07 11:31:00 INFO DAGScheduler: looking for newlyrunnable stages

17/05/07 11:31:00 INFO DAGScheduler: running:Set(ShuffleMapStage 10)

17/05/07 11:31:00 INFO DAGScheduler: waiting:Set(ResultStage 12)

17/05/07 11:31:00 INFO DAGScheduler: failed: Set()

17/05/07 11:31:00 INFO Executor: Finished task 2.0 instage 10.0 (TID 42). 1161 bytes result sent to driver

17/05/07 11:31:00 INFO TaskSetManager: Finished task2.0 in stage 10.0 (TID 42) in 90 ms on localhost (4/4)

17/05/07 11:31:00 INFO TaskSchedulerImpl: RemovedTaskSet 10.0, whose tasks have all completed, from pool

17/05/07 11:31:00 INFO DAGScheduler: ShuffleMapStage10 (parallelize at <console>:27) finished in 0.091 s

17/05/07 11:31:00 INFO DAGScheduler: looking for newlyrunnable stages

17/05/07 11:31:00 INFO DAGScheduler: running: Set()

17/05/07 11:31:00 INFO DAGScheduler: waiting: Set(ResultStage12)

17/05/07 11:31:00 INFO DAGScheduler: failed: Set()

17/05/07 11:31:00 INFO DAGScheduler: SubmittingResultStage 12 (MapPartitionsRDD[15] at join at <console>:32), which hasno missing parents

17/05/07 11:31:00 INFO MemoryStore: Block broadcast_12stored as values in memory (estimated size 3.2 KB, free 39.2 KB)

17/05/07 11:31:00 INFO MemoryStore: Blockbroadcast_12_piece0 stored as bytes in memory (estimated size 1810.0 B, free41.0 KB)

17/05/07 11:31:00 INFO BlockManagerInfo: Addedbroadcast_12_piece0 in memory on localhost:33645 (size: 1810.0 B, free: 511.1MB)

17/05/07 11:31:00 INFO SparkContext: Created broadcast12 from broadcast at DAGScheduler.scala:1008

17/05/07 11:31:00 INFO DAGScheduler: Submitting 4missing tasks from ResultStage 12 (MapPartitionsRDD[15] at join at<console>:32)

17/05/07 11:31:00 INFO TaskSchedulerImpl: Adding taskset 12.0 with 4 tasks

17/05/07 11:31:00 INFO TaskSetManager: Starting task0.0 in stage 12.0 (TID 48, localhost, partition 0,PROCESS_LOCAL, 1967 bytes)

17/05/07 11:31:00 INFO TaskSetManager: Starting task1.0 in stage 12.0 (TID 49, localhost, partition 1,PROCESS_LOCAL, 1967 bytes)

17/05/07 11:31:00 INFO TaskSetManager: Starting task2.0 in stage 12.0 (TID 50, localhost, partition 2,PROCESS_LOCAL, 1967 bytes)

17/05/07 11:31:00 INFO TaskSetManager: Starting task3.0 in stage 12.0 (TID 51, localhost, partition 3,PROCESS_LOCAL, 1967 bytes)

17/05/07 11:31:00 INFO Executor: Running task 1.0 instage 12.0 (TID 49)

17/05/07 11:31:00 INFO Executor: Running task 2.0 instage 12.0 (TID 50)

17/05/07 11:31:00 INFO Executor: Running task 3.0 instage 12.0 (TID 51)

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:31:00 INFO Executor: Running task 0.0 instage 12.0 (TID 48)

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Getting 4 non-empty blocks out of 4 blocks

17/05/07 11:31:00 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/07 11:31:00 INFO Executor: Finished task 1.0 instage 12.0 (TID 49). 1417 bytes result sent to driver

17/05/07 11:31:00 INFO Executor: Finished task 3.0 instage 12.0 (TID 51). 1323 bytes result sent to driver

17/05/07 11:31:00 INFO Executor: Finished task 0.0 instage 12.0 (TID 48). 1161 bytes result sent to driver

17/05/07 11:31:00 INFO Executor: Finished task 2.0 instage 12.0 (TID 50). 1417 bytes result sent to driver

17/05/07 11:31:00 INFO TaskSetManager: Finished task1.0 in stage 12.0 (TID 49) in 20 ms on localhost (1/4)

17/05/07 11:31:00 INFO TaskSetManager: Finished task3.0 in stage 12.0 (TID 51) in 21 ms on localhost (2/4)

17/05/07 11:31:00 INFO TaskSetManager: Finished task0.0 in stage 12.0 (TID 48) in 23 ms on localhost (3/4)

17/05/07 11:31:00 INFO TaskSetManager: Finished task2.0 in stage 12.0 (TID 50) in 22 ms on localhost (4/4)

17/05/07 11:31:00 INFO TaskSchedulerImpl: RemovedTaskSet 12.0, whose tasks have all completed, from pool

17/05/07 11:31:00 INFO DAGScheduler: ResultStage 12(collect at <console>:32) finished in 0.024 s

17/05/07 11:31:00 INFO DAGScheduler: Job 6 finished:collect at <console>:32, took 0.133590 s

res4: Array[(String, (Int, Int))] = Array((A,(1,1)),(A,(1,4)), (A,(4,1)), (A,(4,4)), (B,(2,2)), (B,(2,5)), (B,(5,2)), (B,(5,5)),(C,(3,3)))

 

60.在Spark-Shell中使用scala语言对sample-data目录中的文件使用flatMap语句进行数据压缩,压缩的所有数据以空格为分隔符,压缩后对字母进行Key:Value计数(字母为Key,出现次数为Value),将以上操作命令和结果信息显示如下。

scala> val rdd4 = sc.textFile("hdfs://10.0.0.115:8020/sample-data/")

17/05/16 06:39:26 INFO MemoryStore: Block broadcast_0stored as values in memory (estimated size 315.3 KB, free 315.3 KB)

17/05/16 06:39:26 INFO MemoryStore: Blockbroadcast_0_piece0 stored as bytes in memory (estimated size 27.2 KB, free342.4 KB)

17/05/16 06:39:26 INFO BlockManagerInfo: Addedbroadcast_0_piece0 in memory on localhost:41523 (size: 27.2 KB, free: 511.1 MB)

17/05/16 06:39:26 INFO SparkContext: Created broadcast0 from textFile at <console>:27

rdd4: org.apache.spark.rdd.RDD[String] =hdfs://10.0.0.115:8020/sample-data/ MapPartitionsRDD[1] at textFile at<console>:27

 

scala> rdd4.toDebugString

17/05/16 06:39:34 INFO FileInputFormat: Total inputpaths to process : 1

res1: String =

(2) hdfs://10.0.0.115:8020/sample-data/MapPartitionsRDD[1] at textFile at <console>:27 []

 |  hdfs://10.0.0.115:8020/sample-data/HadoopRDD[0] at textFile at <console>:27 []

 

scala> val words=rdd4.flatMap(_.split(""))

words: org.apache.spark.rdd.RDD[String] =MapPartitionsRDD[2] at flatMap at <console>:29

 

scala> val wordscount=words.map(word=>(word,1)).reduceByKey(_+_)

wordscount: org.apache.spark.rdd.RDD[(String, Int)] =ShuffledRDD[4] at reduceByKey at <console>:31

 

scala> wordscount.collect

17/05/16 06:40:37 INFO SparkContext: Starting job:collect at <console>:34

17/05/16 06:40:37 INFO DAGScheduler: Registering RDD 3(map at <console>:31)

17/05/16 06:40:37 INFO DAGScheduler: Got job 0(collect at <console>:34) with 2 output partitions

17/05/16 06:40:37 INFO DAGScheduler: Final stage:ResultStage 1 (collect at <console>:34)

17/05/16 06:40:37 INFO DAGScheduler: Parents of finalstage: List(ShuffleMapStage 0)

17/05/16 06:40:37 INFO DAGScheduler: Missing parents:List(ShuffleMapStage 0)

17/05/16 06:40:37 INFO DAGScheduler: SubmittingShuffleMapStage 0 (MapPartitionsRDD[3] at map at <console>:31), which hasno missing parents

17/05/16 06:40:37 INFO MemoryStore: Block broadcast_1stored as values in memory (estimated size 4.3 KB, free 346.7 KB)

17/05/16 06:40:37 INFO MemoryStore: Blockbroadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 349.0KB)

17/05/16 06:40:37 INFO BlockManagerInfo: Addedbroadcast_1_piece0 in memory on localhost:41523 (size: 2.3 KB, free: 511.1 MB)

17/05/16 06:40:37 INFO SparkContext: Created broadcast1 from broadcast at DAGScheduler.scala:1008

17/05/16 06:40:37 INFO DAGScheduler: Submitting 2missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at<console>:31)

17/05/16 06:40:37 INFO TaskSchedulerImpl: Adding taskset 0.0 with 2 tasks

17/05/16 06:40:38 INFO TaskSetManager: Starting task0.0 in stage 0.0 (TID 0, localhost, partition 0,ANY, 2149 bytes)

17/05/16 06:40:38 INFO TaskSetManager: Starting task1.0 in stage 0.0 (TID 1, localhost, partition 1,ANY, 2149 bytes)

17/05/16 06:40:38 INFO Executor: Running task 0.0 instage 0.0 (TID 0)

17/05/16 06:40:38 INFO Executor: Running task 1.0 instage 0.0 (TID 1)

17/05/16 06:40:38 INFO HadoopRDD: Input split:hdfs://10.0.0.115:8020/sample-data/2jobs2min-rumen-jh.json:0+201658

17/05/16 06:40:38 INFO HadoopRDD: Input split:hdfs://10.0.0.115:8020/sample-data/2jobs2min-rumen-jh.json:201658+201659

17/05/16 06:40:38 INFO deprecation: mapred.tip.id isdeprecated. Instead, use mapreduce.task.id

17/05/16 06:40:38 INFO deprecation: mapred.task.id isdeprecated. Instead, use mapreduce.task.attempt.id

17/05/16 06:40:38 INFO deprecation: mapred.task.is.mapis deprecated. Instead, use mapreduce.task.ismap

17/05/16 06:40:38 INFO deprecation:mapred.task.partition is deprecated. Instead, use mapreduce.task.partition

17/05/16 06:40:38 INFO deprecation: mapred.job.id isdeprecated. Instead, use mapreduce.job.id

17/05/16 06:40:38 INFO Executor: Finished task 0.0 instage 0.0 (TID 0). 2254 bytes result sent to driver

17/05/16 06:40:38 INFO Executor: Finished task 1.0 instage 0.0 (TID 1). 2254 bytes result sent to driver

17/05/16 06:40:38 INFO TaskSetManager: Finished task0.0 in stage 0.0 (TID 0) in 621 ms on localhost (1/2)

17/05/16 06:40:38 INFO TaskSetManager: Finished task1.0 in stage 0.0 (TID 1) in 600 ms on localhost (2/2)

17/05/16 06:40:38 INFO TaskSchedulerImpl: RemovedTaskSet 0.0, whose tasks have all completed, from pool

17/05/16 06:40:38 INFO DAGScheduler: ShuffleMapStage 0(map at <console>:31) finished in 0.646 s

17/05/16 06:40:38 INFO DAGScheduler: looking for newlyrunnable stages

17/05/16 06:40:38 INFO DAGScheduler: running: Set()

17/05/16 06:40:38 INFO DAGScheduler: waiting:Set(ResultStage 1)

17/05/16 06:40:38 INFO DAGScheduler: failed: Set()

17/05/16 06:40:38 INFO DAGScheduler: SubmittingResultStage 1 (ShuffledRDD[4] at reduceByKey at <console>:31), which hasno missing parents

17/05/16 06:40:38 INFO MemoryStore: Block broadcast_2stored as values in memory (estimated size 2.7 KB, free 351.7 KB)

17/05/16 06:40:38 INFO MemoryStore: Blockbroadcast_2_piece0 stored as bytes in memory (estimated size 1619.0 B, free353.2 KB)

17/05/16 06:40:38 INFO BlockManagerInfo: Addedbroadcast_2_piece0 in memory on localhost:41523 (size: 1619.0 B, free: 511.1MB)

17/05/16 06:40:38 INFO SparkContext: Created broadcast2 from broadcast at DAGScheduler.scala:1008

17/05/16 06:40:38 INFO DAGScheduler: Submitting 2missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at<console>:31)

17/05/16 06:40:38 INFO TaskSchedulerImpl: Adding taskset 1.0 with 2 tasks

17/05/16 06:40:38 INFO TaskSetManager: Starting task0.0 in stage 1.0 (TID 2, localhost, partition 0,NODE_LOCAL, 1894 bytes)

17/05/16 06:40:38 INFO TaskSetManager: Starting task1.0 in stage 1.0 (TID 3, localhost, partition 1,NODE_LOCAL, 1894 bytes)

17/05/16 06:40:38 INFO Executor: Running task 0.0 instage 1.0 (TID 2)

17/05/16 06:40:38 INFO Executor: Running task 1.0 instage 1.0 (TID 3)

17/05/16 06:40:38 INFO ShuffleBlockFetcherIterator:Getting 2 non-empty blocks out of 2 blocks

17/05/16 06:40:38 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 5 ms

17/05/16 06:40:38 INFO ShuffleBlockFetcherIterator:Getting 2 non-empty blocks out of 2 blocks

17/05/16 06:40:38 INFO ShuffleBlockFetcherIterator:Started 0 remote fetches in 0 ms

17/05/16 06:40:38 INFO Executor: Finished task 1.0 instage 1.0 (TID 3). 100625 bytes result sent to driver

17/05/16 06:40:38 INFO Executor: Finished task 0.0 instage 1.0 (TID 2). 100164 bytes result sent to driver

17/05/16 06:40:38 INFO TaskSetManager: Finished task0.0 in stage 1.0 (TID 2) in 162 ms on localhost (1/2)

17/05/16 06:40:38 INFO TaskSetManager: Finished task1.0 in stage 1.0 (TID 3) in 159 ms on localhost (2/2)

17/05/16 06:40:38 INFO TaskSchedulerImpl: RemovedTaskSet 1.0, whose tasks have all completed, from pool

17/05/16 06:40:38 INFO DAGScheduler: ResultStage 1(collect at <console>:34) finished in 0.164 s

17/05/16 06:40:38 INFO DAGScheduler: Job 0 finished:collect at <console>:34, took 0.941629 s

res2: Array[(String, Int)] = Array((245438,,1), (317184,,1),(1420663,,3), (315324,,1), (1456189440,,1), (1409,,2), (309211136,,1),("hadoop.kerberos.kinit.command",2), (819,,8), (522360,,1),("a2115.smile.com:8032",,2),("attempt_1369942127770_1205_m_000025_0",,1), (932,,1), (495172,,1),(309873,,1), (2507,,3), (316798,,1), ("taskID",192), (306780,,1),(11020,,1), (265201,,1),("attempt_1369942127770_1205_m_000032_0",,1), (883386,,1),(107219,,1), (298267,1), (300513,,1), ("mapreduce.map.memory.mb",2),(3155,,1), (1108,,8), (792,1), (1230,,7), (1210,,4), ("jobName",2),(19587,,1), (1005,,7), (319750144,,1), ("fs.df.interval",2),(1135,,5), (925,,8), (1058,,2), ("4000",,2), (1382,,15), (824426,,1),(1371222227254,,2), (1439980,,3), (174844,,1), ("attempt_1369942127770_1205_m_000085_0",,...

 

scala> wordscount.toDebugString

res3: String =

(2) ShuffledRDD[4] at reduceByKey at<console>:31 []

 +-(2)MapPartitionsRDD[3] at map at <console>:31 []

    |  MapPartitionsRDD[2] at flatMap at<console>:29 []

    |  hdfs://10.0.0.115:8020/sample-data/MapPartitionsRDD[1] at textFile at <console>:27 []

    |  hdfs://10.0.0.115:8020/sample-data/HadoopRDD[0] at textFile at <console>:27 []

 

 

61.在Spark-Shell中使用scala语言加载search.txt文件数据,其数据结构释义如下表所示。加载完成后过滤掉不足6列的行数据和第四列排名为2、第五列点击顺序号为1的数据,并进行计数。将以上操作命令和结果信息显示如下。

 

scala> val ardd =sc.textFile("/data/search.txt")

17/05/17 11:06:06 INFO MemoryStore: Block broadcast_14stored as values in memory (estimated size 315.3 KB, free 1000.2 KB)

17/05/17 11:06:06 INFO MemoryStore: Blockbroadcast_14_piece0 stored as bytes in memory (estimated size 27.2 KB, free1027.3 KB)

17/05/17 11:06:06 INFO BlockManagerInfo: Addedbroadcast_14_piece0 in memory on localhost:36432 (size: 27.2 KB, free: 511.0MB)

17/05/17 11:06:06 INFO SparkContext: Created broadcast14 from textFile at <console>:27

ardd: org.apache.spark.rdd.RDD[String] =/data/search.txt MapPartitionsRDD[25] at textFile at <console>:27

 

scala> val mapardd =ardd.map((_.split('\t'))).filter(_.length >= 6)

mapardd: org.apache.spark.rdd.RDD[Array[String]] =MapPartitionsRDD[27] at filter at <console>:29

 

scala> val filterardd =mapardd.filter(_(3).toString!="2").filter(_(4).toString!="1")

filterardd: org.apache.spark.rdd.RDD[Array[String]] =MapPartitionsRDD[31] at filter at <console>:31

 

scala> filterardd.count

17/05/17 11:08:55 INFO SparkContext: Starting job:count at <console>:34

17/05/17 11:08:55 INFO DAGScheduler: Got job 12 (countat <console>:34) with 2 output partitions

17/05/17 11:08:55 INFO DAGScheduler: Final stage:ResultStage 14 (count at <console>:34)

17/05/17 11:08:55 INFO DAGScheduler: Parents of finalstage: List()

17/05/17 11:08:55 INFO DAGScheduler: Missing parents:List()

17/05/17 11:08:55 INFO DAGScheduler: SubmittingResultStage 14 (MapPartitionsRDD[31] at filter at <console>:31), whichhas no missing parents

17/05/17 11:08:55 INFO MemoryStore: Block broadcast_17stored as values in memory (estimated size 3.7 KB, free 1036.7 KB)

17/05/17 11:08:55 INFO MemoryStore: Blockbroadcast_17_piece0 stored as bytes in memory (estimated size 2026.0 B, free1038.7 KB)

17/05/17 11:08:55 INFO BlockManagerInfo: Addedbroadcast_17_piece0 in memory on localhost:36432 (size: 2026.0 B, free: 511.0MB)

17/05/17 11:08:55 INFO SparkContext: Created broadcast17 from broadcast at DAGScheduler.scala:1008

17/05/17 11:08:55 INFO DAGScheduler: Submitting 2missing tasks from ResultStage 14 (MapPartitionsRDD[31] at filter at<console>:31)

17/05/17 11:08:55 INFO TaskSchedulerImpl: Adding taskset 14.0 with 2 tasks

17/05/17 11:08:55 INFO TaskSetManager: Starting task0.0 in stage 14.0 (TID 30, localhost, partition 0,ANY, 2136 bytes)

17/05/17 11:08:55 INFO TaskSetManager: Starting task1.0 in stage 14.0 (TID 31, localhost, partition 1,ANY, 2136 bytes)

17/05/17 11:08:55 INFO Executor: Running task 0.0 instage 14.0 (TID 30)

17/05/17 11:08:55 INFO Executor: Running task 1.0 instage 14.0 (TID 31)

17/05/17 11:08:55 INFO HadoopRDD: Input split:hdfs://master:8020/data/search.txt:57422788+57422788

17/05/17 11:08:55 INFO HadoopRDD: Input split:hdfs://master:8020/data/search.txt:0+57422788

17/05/17 11:08:55 INFO BlockManagerInfo: Removedbroadcast_16_piece0 on localhost:36432 in memory (size: 2026.0 B, free: 511.0MB)

17/05/17 11:08:55 INFO ContextCleaner: Cleanedaccumulator 14

17/05/17 11:08:57 INFO Executor: Finished task 1.0 instage 14.0 (TID 31). 2137 bytes result sent to driver

17/05/17 11:08:57 INFO Executor: Finished task 0.0 instage 14.0 (TID 30). 2137 bytes result sent to driver

17/05/17 11:08:57 INFO TaskSetManager: Finished task1.0 in stage 14.0 (TID 31) in 1532 ms on localhost (1/2)

17/05/17 11:08:57 INFO TaskSetManager: Finished task0.0 in stage 14.0 (TID 30) in 1535 ms on localhost (2/2)

17/05/17 11:08:57 INFO TaskSchedulerImpl: RemovedTaskSet 14.0, whose tasks have all completed, from pool

17/05/17 11:08:57 INFO DAGScheduler: ResultStage 14(count at <console>:34) finished in 1.535 s

17/05/17 11:08:57 INFO DAGScheduler: Job 12 finished:count at <console>:34, took 1.540537 s

res15: Long = 253772

 

猜你喜欢

转载自blog.csdn.net/KamRoseLee/article/details/80280489