spark-06:MLlib

一、DAG图

0.准备集群并启动

spark版本:

    spark-1.5.2-bin-hadoop2.6

一主三从:

    192.168.239.129  master

    192.168.239.130  slave

    192.168.239.131  slave

    192.168.239.144  slave

配置:/home/software/spark-1.5.2-bin-hadoop2.6/conf/spark-env.sh

    192.168.239.129  master

SPARK_LOCAL_IP=192.168.239.129
export SCALA_HOME=/home/software/scala-2.11.4
export JAVA_HOME=/home/software/jdk1.8
export HADOOP_HOME=/home/software/hadoop-2.7.1
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_JAR=/home/software/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar

    192.168.239.130  slave

SPARK_LOCAL_IP=192.168.239.130
export SCALA_HOME=/home/software/scala-2.11.4
export JAVA_HOME=/home/software/jdk1.8
export SPARK_JAR=/home/software/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar

    192.168.239.131  slave

SPARK_LOCAL_IP=192.168.239.130
export SCALA_HOME=/home/software/scala-2.11.4
export JAVA_HOME=/home/software/jdk1.8
export SPARK_JAR=/home/software/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar

    192.168.239.144  slave

SPARK_LOCAL_IP=192.168.239.130
export SCALA_HOME=/home/software/scala-2.11.4
export JAVA_HOME=/home/software/jdk1.8
export SPARK_JAR=/home/software/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar

启动:

    master:bin/spark-shell --master=spark://192.168.239.129:7077  #master的IP地址

    slave:bin/start-slave.sh spark://192.168.239.129:7077

1..准备代码:

map不发生shuffle
val rdd = sc.makeRDD(List(1,2,3,4,5,7,8))
rdd.map(x=>x*10)

join发送shuffle
val rdd1 = sc.makeRDD(List(("1","a"),("2","b"),("3","c"),("4","d")))
val rdd2 = sc.makeRDD(List(("1","A"),("2","B"),("3","C"),("4","D")))
rdd1.join(rdd2)

总例子
val list= scala.io.Source.fromFile(""hdfs:///spark/words.txt").getLines().toList
.map { (_,1) }
.groupBy { _._1 }
.mapValues{ _.map{ _._2 }.reduce{ _+_ } }

查看分区工具:
import org.apache.spark.rdd.RDD
import scala.reflect.ClassTag

object su {
  def debug[T: ClassTag](rdd: RDD[T]) = {
    rdd.mapPartitionsWithIndex((i: Int, iter: Iterator[T]) => {
      val m = scala.collection.mutable.Map[Int, List[T]]()
      var list = List[T]()
      while (iter.hasNext) {
        list = list :+ iter.next
      }
      m(i) = list
      m.iterator
    }).collect().foreach((x: Tuple2[Int, List[T]]) => {
      val i = x._1
      println(s"partition:[$i]")
      x._2.foreach { println }
    })
  }
}

2.map不发生shuffle

scala> val rdd = sc.makeRDD(List(1,2,3,4,5,7,8))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:21

scala> rdd.map(x=>x*10)
res0: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:24

scala> import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD

scala> import scala.reflect.ClassTag
import scala.reflect.ClassTag

scala> object su {
     |   def debug[T: ClassTag](rdd: RDD[T]) = {
     |     rdd.mapPartitionsWithIndex((i: Int, iter: Iterator[T]) => {
     |       val m = scala.collection.mutable.Map[Int, List[T]]()
     |       var list = List[T]()
     |       while (iter.hasNext) {
     |         list = list :+ iter.next
     |       }
     |       m(i) = list
     |       m.iterator
     |     }).collect().foreach((x: Tuple2[Int, List[T]]) => {
     |       val i = x._1
     |       println(s"partition:[$i]")
     |       x._2.foreach { println }
     |     })
     |   }
     | }
defined module su

scala> su.debug(res0)
18/05/05 20:20:08 INFO spark.SparkContext: Starting job: collect at <console>:35
18/05/05 20:20:08 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:35) with 6 output partitions
18/05/05 20:20:08 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(collect at <console>:35)
18/05/05 20:20:08 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/05/05 20:20:08 INFO scheduler.DAGScheduler: Missing parents: List()
18/05/05 20:20:08 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at mapPartitionsWithIndex at <console>:27), which has no missing parents
18/05/05 20:20:09 INFO storage.MemoryStore: ensureFreeSpace(2304) called with curMem=0, maxMem=555755765
18/05/05 20:20:09 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.3 KB, free 530.0 MB)
18/05/05 20:20:09 INFO storage.MemoryStore: ensureFreeSpace(1388) called with curMem=2304, maxMem=555755765
18/05/05 20:20:09 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1388.0 B, free 530.0 MB)
18/05/05 20:20:09 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.239.129:57789 (size: 1388.0 B, free: 530.0 MB)
18/05/05 20:20:09 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:861
18/05/05 20:20:09 INFO scheduler.DAGScheduler: Submitting 6 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at mapPartitionsWithIndex at <console>:27)
18/05/05 20:20:09 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 6 tasks
18/05/05 20:20:09 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.239.130, PROCESS_LOCAL, 2029 bytes)
18/05/05 20:20:09 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.239.144, PROCESS_LOCAL, 2029 bytes)
18/05/05 20:20:09 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 192.168.239.131, PROCESS_LOCAL, 2029 bytes)
18/05/05 20:20:09 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 192.168.239.130, PROCESS_LOCAL, 2029 bytes)
18/05/05 20:20:09 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 192.168.239.144, PROCESS_LOCAL, 2029 bytes)
18/05/05 20:20:09 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 192.168.239.131, PROCESS_LOCAL, 2033 bytes)
18/05/05 20:20:19 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.239.144:43403 (size: 1388.0 B, free: 534.5 MB)
18/05/05 20:20:19 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.239.130:53716 (size: 1388.0 B, free: 534.5 MB)
18/05/05 20:20:19 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.239.131:43057 (size: 1388.0 B, free: 534.5 MB)
18/05/05 20:20:28 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 18557 ms on 192.168.239.144 (1/6)
18/05/05 20:20:28 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 18570 ms on 192.168.239.144 (2/6)
18/05/05 20:20:29 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 20147 ms on 192.168.239.130 (3/6)
18/05/05 20:20:29 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 20089 ms on 192.168.239.130 (4/6)
18/05/05 20:20:30 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 20785 ms on 192.168.239.131 (5/6)
18/05/05 20:20:30 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 20781 ms on 192.168.239.131 (6/6)
18/05/05 20:20:30 INFO scheduler.DAGScheduler: ResultStage 0 (collect at <console>:35) finished in 20.883 s
18/05/05 20:20:30 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
18/05/05 20:20:30 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:35, took 22.654117 s
partition:[0]
10
partition:[1]
20
partition:[2]
30
partition:[3]
40
partition:[4]
50
partition:[5]
70
80

查看DAG图(使用Firefox浏览器,qq浏览器和Google浏览器无法查看DAG图,qq浏览器无法加载,Google浏览器加载出两个黑点):









3.join发生shuffle

scala> val rdd1 = sc.makeRDD(List(("1","a"),("2","b"),("3","c"),("4","d")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[3] at makeRDD at <console>:23

scala> val rdd2 = sc.makeRDD(List(("1","A"),("2","B"),("3","C"),("4","D")))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[4] at makeRDD at <console>:23

scala> rdd1.join(rdd2)
res3: org.apache.spark.rdd.RDD[(String, (String, String))] = MapPartitionsRDD[7] at join at <console>:28

scala> res3.collect
18/05/05 21:10:38 INFO spark.SparkContext: Starting job: collect at <console>:30
18/05/05 21:10:38 INFO scheduler.DAGScheduler: Registering RDD 3 (makeRDD at <console>:23)
18/05/05 21:10:38 INFO scheduler.DAGScheduler: Registering RDD 4 (makeRDD at <console>:23)
18/05/05 21:10:38 INFO scheduler.DAGScheduler: Got job 1 (collect at <console>:30) with 6 output partitions
18/05/05 21:10:38 INFO scheduler.DAGScheduler: Final stage: ResultStage 3(collect at <console>:30)
18/05/05 21:10:38 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 1, ShuffleMapStage 2)
18/05/05 21:10:38 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 1, ShuffleMapStage 2)
18/05/05 21:10:38 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 1 (ParallelCollectionRDD[3] at makeRDD at <console>:23), which has no missing parents
18/05/05 21:10:38 INFO storage.MemoryStore: ensureFreeSpace(1520) called with curMem=0, maxMem=555755765
18/05/05 21:10:38 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 1520.0 B, free 530.0 MB)
18/05/05 21:10:42 INFO storage.MemoryStore: ensureFreeSpace(982) called with curMem=1520, maxMem=555755765
18/05/05 21:10:42 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 982.0 B, free 530.0 MB)
18/05/05 21:10:42 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.239.129:57789 (size: 982.0 B, free: 530.0 MB)
18/05/05 21:10:42 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
18/05/05 21:10:42 INFO scheduler.DAGScheduler: Submitting 6 missing tasks from ShuffleMapStage 1 (ParallelCollectionRDD[3] at makeRDD at <console>:23)
18/05/05 21:10:42 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 6 tasks
18/05/05 21:10:42 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 2 (ParallelCollectionRDD[4] at makeRDD at <console>:23), which has no missing parents
18/05/05 21:10:43 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 6, 192.168.239.130, PROCESS_LOCAL, 2072 bytes)
18/05/05 21:10:43 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 7, 192.168.239.131, PROCESS_LOCAL, 2129 bytes)
18/05/05 21:10:43 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 8, 192.168.239.144, PROCESS_LOCAL, 2129 bytes)
18/05/05 21:10:43 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 1.0 (TID 9, 192.168.239.130, PROCESS_LOCAL, 2072 bytes)
18/05/05 21:10:43 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 1.0 (TID 10, 192.168.239.131, PROCESS_LOCAL, 2129 bytes)
18/05/05 21:10:43 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 1.0 (TID 11, 192.168.239.144, PROCESS_LOCAL, 2129 bytes)
18/05/05 21:10:43 INFO storage.MemoryStore: ensureFreeSpace(1520) called with curMem=2502, maxMem=555755765
18/05/05 21:10:43 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 1520.0 B, free 530.0 MB)
18/05/05 21:10:43 INFO storage.MemoryStore: ensureFreeSpace(984) called with curMem=4022, maxMem=555755765
18/05/05 21:10:43 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 984.0 B, free 530.0 MB)
18/05/05 21:10:43 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.239.129:57789 (size: 984.0 B, free: 530.0 MB)
18/05/05 21:10:43 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:861
18/05/05 21:10:43 INFO scheduler.DAGScheduler: Submitting 6 missing tasks from ShuffleMapStage 2 (ParallelCollectionRDD[4] at makeRDD at <console>:23)
18/05/05 21:10:43 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 6 tasks
18/05/05 21:10:43 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.239.144:43403 (size: 982.0 B, free: 534.5 MB)
18/05/05 21:10:45 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.239.131:43057 (size: 982.0 B, free: 534.5 MB)
18/05/05 21:10:46 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.239.130:53716 (size: 982.0 B, free: 534.5 MB)
18/05/05 21:10:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 12, 192.168.239.144, PROCESS_LOCAL, 2072 bytes)
18/05/05 21:10:48 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 13, 192.168.239.144, PROCESS_LOCAL, 2129 bytes)
18/05/05 21:10:48 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.239.144:43403 (size: 984.0 B, free: 534.5 MB)
18/05/05 21:10:48 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.0 (TID 14, 192.168.239.144, PROCESS_LOCAL, 2129 bytes)
18/05/05 21:10:48 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 2.0 (TID 15, 192.168.239.144, PROCESS_LOCAL, 2072 bytes)
18/05/05 21:10:48 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 12) in 624 ms on 192.168.239.144 (1/6)
18/05/05 21:10:48 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 2.0 (TID 16, 192.168.239.144, PROCESS_LOCAL, 2129 bytes)
18/05/05 21:10:48 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 2.0 (TID 13) in 629 ms on 192.168.239.144 (2/6)
18/05/05 21:10:48 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 8) in 5949 ms on 192.168.239.144 (1/6)
18/05/05 21:10:48 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 1.0 (TID 11) in 5962 ms on 192.168.239.144 (2/6)
18/05/05 21:10:48 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 2.0 (TID 14) in 464 ms on 192.168.239.144 (3/6)
18/05/05 21:10:48 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 2.0 (TID 17, 192.168.239.144, PROCESS_LOCAL, 2129 bytes)
18/05/05 21:10:48 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 2.0 (TID 15) in 454 ms on 192.168.239.144 (4/6)
18/05/05 21:10:49 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 2.0 (TID 17) in 30 ms on 192.168.239.144 (5/6)
18/05/05 21:10:49 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 2.0 (TID 16) in 94 ms on 192.168.239.144 (6/6)
18/05/05 21:10:49 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
18/05/05 21:10:49 INFO scheduler.DAGScheduler: ShuffleMapStage 2 (makeRDD at <console>:23) finished in 5.973 s
18/05/05 21:10:49 INFO scheduler.DAGScheduler: looking for newly runnable stages
18/05/05 21:10:49 INFO scheduler.DAGScheduler: running: Set(ShuffleMapStage 1)
18/05/05 21:10:49 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 3)
18/05/05 21:10:49 INFO scheduler.DAGScheduler: failed: Set()
18/05/05 21:10:50 INFO scheduler.DAGScheduler: Missing parents for ResultStage 3: List(ShuffleMapStage 1)
18/05/05 21:10:50 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 6) in 7303 ms on 192.168.239.130 (3/6)
18/05/05 21:10:50 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 1.0 (TID 9) in 7297 ms on 192.168.239.130 (4/6)
18/05/05 21:10:50 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 7) in 7687 ms on 192.168.239.131 (5/6)
18/05/05 21:10:50 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 1.0 (TID 10) in 7687 ms on 192.168.239.131 (6/6)
18/05/05 21:10:50 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
18/05/05 21:10:50 INFO scheduler.DAGScheduler: ShuffleMapStage 1 (makeRDD at <console>:23) finished in 7.695 s
18/05/05 21:10:50 INFO scheduler.DAGScheduler: looking for newly runnable stages
18/05/05 21:10:50 INFO scheduler.DAGScheduler: running: Set()
18/05/05 21:10:50 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 3)
18/05/05 21:10:50 INFO scheduler.DAGScheduler: failed: Set()
18/05/05 21:10:50 INFO scheduler.DAGScheduler: Missing parents for ResultStage 3: List()
18/05/05 21:10:50 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[7] at join at <console>:28), which is now runnable
18/05/05 21:10:50 INFO storage.MemoryStore: ensureFreeSpace(2816) called with curMem=5006, maxMem=555755765
18/05/05 21:10:50 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.8 KB, free 530.0 MB)
18/05/05 21:10:50 INFO storage.MemoryStore: ensureFreeSpace(1558) called with curMem=7822, maxMem=555755765
18/05/05 21:10:50 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1558.0 B, free 530.0 MB)
18/05/05 21:10:50 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.239.129:57789 (size: 1558.0 B, free: 530.0 MB)
18/05/05 21:10:50 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:861
18/05/05 21:10:50 INFO scheduler.DAGScheduler: Submitting 6 missing tasks from ResultStage 3 (MapPartitionsRDD[7] at join at <console>:28)
18/05/05 21:10:50 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 6 tasks
18/05/05 21:10:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 18, 192.168.239.144, PROCESS_LOCAL, 1974 bytes)
18/05/05 21:10:50 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 3.0 (TID 19, 192.168.239.131, PROCESS_LOCAL, 1974 bytes)
18/05/05 21:10:50 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 3.0 (TID 20, 192.168.239.130, PROCESS_LOCAL, 1974 bytes)
18/05/05 21:10:50 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 3.0 (TID 21, 192.168.239.144, PROCESS_LOCAL, 1974 bytes)
18/05/05 21:10:50 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 3.0 (TID 22, 192.168.239.131, PROCESS_LOCAL, 1974 bytes)
18/05/05 21:10:50 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 3.0 (TID 23, 192.168.239.130, PROCESS_LOCAL, 1974 bytes)
18/05/05 21:10:50 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.239.131:43057 (size: 1558.0 B, free: 534.5 MB)
18/05/05 21:10:51 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 192.168.239.131:58335
18/05/05 21:10:51 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.239.130:53716 (size: 1558.0 B, free: 534.5 MB)
18/05/05 21:10:51 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.239.144:43403 (size: 1558.0 B, free: 534.5 MB)
18/05/05 21:10:51 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 196 bytes
18/05/05 21:10:52 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 192.168.239.144:49621
18/05/05 21:10:52 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 192.168.239.130:56518
18/05/05 21:10:53 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 192.168.239.131:58335
18/05/05 21:10:53 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 175 bytes
18/05/05 21:10:53 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 192.168.239.144:49621
18/05/05 21:10:53 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 192.168.239.130:56518
18/05/05 21:10:54 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 18) in 3432 ms on 192.168.239.144 (1/6)
18/05/05 21:10:54 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 3.0 (TID 21) in 3509 ms on 192.168.239.144 (2/6)
18/05/05 21:10:54 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 3.0 (TID 19) in 3737 ms on 192.168.239.131 (3/6)
18/05/05 21:10:54 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 3.0 (TID 22) in 3738 ms on 192.168.239.131 (4/6)
18/05/05 21:10:54 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 3.0 (TID 23) in 3758 ms on 192.168.239.130 (5/6)
18/05/05 21:10:54 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 3.0 (TID 20) in 3982 ms on 192.168.239.130 (6/6)
18/05/05 21:10:54 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
18/05/05 21:10:54 INFO scheduler.DAGScheduler: ResultStage 3 (collect at <console>:30) finished in 4.009 s
18/05/05 21:10:54 INFO scheduler.DAGScheduler: Job 1 finished: collect at <console>:30, took 16.052410 s
res4: Array[(String, (String, String))] = Array((1,(a,A)), (2,(b,B)), (3,(c,C)), (4,(d,D)))

scala> 

序号

名称

说明

1

PROCESS_LOCAL

表示HDFS数据和worker是在同一个进程

2

NODE_LOCAL

表示HDFS数据和worker是在同一台服务器

3

NO_PREF

表示spark不能确定数据的位置

4

RACK_LOCAL

表示HDFS的数据和worker在同一个机架内

5

ANY

表示HDFS的数据和worker不在同一个机架内


二、机器学习

MLLib(科学计算,数学的功底)

1)  把实际需求进行转化,数学公式去进行计算,这个过程数据建模

需求:想在海淀区买200平米,多少钱?


1)  有了模型就可以带入数学公式进行计算(机器学习)


1)  找一根离真实数据(样本数据)最近的一根线

2)  x=200,x称为特征值

3)  蓝点这些数据称为样本数据,它将来用来训练模型

三、数学建模

目的:让计算机来技术

公式:直线公式

 

x0=1,过原点直线

数学中减少变量,变量越少越好计算

如果把这些变量都带入值,自然直线就出来了

带无数变量的值,找到最接近蓝色点的线,把这个数学公式的结果叫做预测函数  

检验最终是哪个直线最接近真实值,检验函数就称为Cost误差函数   


1)h代表预测函数,x代表特征值,m=22,代表样本数量2(x0,x1)特征值数量

2)y算出的结果(真实成交房价)m样本数量,

3)预测值-真实值,结果越小越好 100, 0.00000000001,0.000000001,数学上推出均方差,把差距值放大

m个(h-y)平方(消除正负的影响),相加,累加,结果除以/2m

四、总结

1)  预测函数h,利用公式可以算出预测房价

2)  通过误差函数来验证哪个h是最接近真实成交价

上面两个公式定出,计算机就可以完成

数据建模过程,就是把业务需求转成数学公式。

误差函数算完有很多结果

五、 梯度下降法来实现


1)  跳跃很多点,大量点不需计算,计算速度加快

2)  增加一个参数a,下降速率,通过它减小台阶高度

通过梯度下降法就获取到最小值

六、工具,octave,matlab(进行科学计算)

已经有现成api,可以方便画出图形,矩阵,向量,生成特殊测试数据

矩阵(多列)

1 2

3 4

向量(一列)

1

2

3

计算机以矩阵,向量运算比直接带入公式计算更加方便表达,计算方便



工具:

1.matlab商业版需要付费


R语言,并发语言,go/scala。R语言中加了很多科学计算实现,scala自己去按R实现这些数学公式,spark基于RDD,逐渐把这些科学计算数学公式利用RDD实现一遍

2.Octave开源版本matlab,功能远小于matlab


(1)向量是特殊矩阵,列只有一列

矩阵*向量=矩阵行每个元素乘以对应向量,累加和

1 4

2 5

3 6

向量

1

2

结果:

1*1+4*2=9

2*1+5*2=12

3*1+6*2=15

矩阵*向量,结果向量

习惯:矩阵大写,向量小写,

(2)提供了很多常用函数

读取文件:

cd c:		#进入c盘
cd txt 		#进入目录
ls 			#列出当前目录下文件
F = load("prices.txt")	#也可以直接指定路径c:/prices/txt

(3)集合运算

矩阵相加、减、乘

之前的预测函数就是矩阵相乘的每一行的结果

A=[θo,θ1;θo,θ1;θo,θ1]       三行数据

B=[Xo;X1]

第一行:θo*Xo + θ1*X1

第二行:θo*Xo + θ1*X1

第三行:θo*Xo + θ1*X1

七、案例实现

1.变成矩阵和向量运算,生成公式

1)  读文件

>> F = load("prices.txt")
F =

    825.0000      1.0000    135.0000      3.0000      2.0000
    997.5000      1.0000    133.0000      3.0000      2.0000
   1005.0000      1.0000    134.0000      3.0000      2.0000
    384.0000      1.0000     64.0000      3.0000      2.0000
    270.0000      1.0000     45.0000      2.0000      1.0000
    459.6000      1.0000     76.0000      1.0000      1.0000
    388.8000      1.0000     64.8000      1.0000      1.0000
    713.4000      1.0000    118.9000      1.0000      1.0000
    218.4000      1.0000     39.0000      1.0000      1.0000
   1145.5000      1.0000    145.0000      3.0000      2.0000
   1864.4000      1.0000    236.0000      4.0000      2.0000
    539.0000      1.0000     77.0000      2.0000      2.0000
    679.0000      1.0000     97.0000      3.0000      2.0000
    756.0000      1.0000    108.0000      3.0000      2.0000
    784.0000      1.0000    112.0000      3.0000      2.0000
    487.5000      1.0000     75.0000      1.0000      2.0000
    780.0000      1.0000    120.0000      2.0000      2.0000
    780.0000      1.0000    120.0000      2.0000      2.0000
    780.0000      1.0000    120.0000      3.0000      2.0000
    995.5000      1.0000    147.0000      3.0000      2.0000
   1072.5000      1.0000    165.0000      3.0000      2.0000
   1072.5000      1.0000    165.0000      3.0000      2.0000

2)  m=样本数量

>> m=size(F,1)
m =  22

3)  x0,很多1,m个1

>> x0=ones(m,1)
x0 =

   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1

4)  x1,文件读取第三列

>> x1=F(:,3)
x1 =

   135.000
   133.000
   134.000
    64.000
    45.000
    76.000
    64.800
   118.900
    39.000
   145.000
   236.000
    77.000
    97.000
   108.000
   112.000
    75.000
   120.000
   120.000
   120.000
   147.000
   165.000
   165.000

5)  X=x0+x1矩阵合并

>> X=[x0,x1]
X =

     1.0000   135.0000
     1.0000   133.0000
     1.0000   134.0000
     1.0000    64.0000
     1.0000    45.0000
     1.0000    76.0000
     1.0000    64.8000
     1.0000   118.9000
     1.0000    39.0000
     1.0000   145.0000
     1.0000   236.0000
     1.0000    77.0000
     1.0000    97.0000
     1.0000   108.0000
     1.0000   112.0000
     1.0000    75.0000
     1.0000   120.0000
     1.0000   120.0000
     1.0000   120.0000
     1.0000   147.0000
     1.0000   165.0000
     1.0000   165.0000

6)  y=F(:,1)

>> x1=F(:,3)
x1 =

   135.000
   133.000
   134.000
    64.000
    45.000
    76.000
    64.800
   118.900
    39.000
   145.000
   236.000
    77.000
    97.000
   108.000
   112.000
    75.000
   120.000
   120.000
   120.000
   147.000
   165.000
   165.000

7)z=[6;7]

z代表,此处赋值为6和7,他是不断变化的

>> z=[6;7]
z =

   6
   7

8)H=X*z

9)  H-y

10)  r=(H-y).^2

11)  sum(r)

12) sum(r)/2/m

8)-12)整合后:sum((X*z-y).^2)/2/m

>> sum((X*z-y).^2)/2/m
ans =  3445.9

2.利用算法得到θ0θ1的值,之前的误差COST函数就发挥作用了。

1.理论:

Z 返回值=函数名(参数)

z0就是θ0,给个范围-100~100

z1就是θ1,给个范围-100~100

X就是之前的矩阵

y就是之前的向量

function Z = costFunction(z0,z1,X,y)
	Z = zeros(length(z0),length(z1));	% 初始0的集合
	m = length(y);							% 样本个数
	% 通过两层循环不断的尝试θ0和θ1的组合
	for i=1:length(z0)
		for j=1:length(z1)
			% 运算公式,构建了100*100的组合都放在Z(i,j)中
			Z(i,j) = sum((X*[z0(i);z1(j)]-y).^2)/(2*m);
		end
	end
	Z = Z';		% 单撇矩阵转置
end

2.创建一个自定义函数:

(1)将以上代码保存编码格式为utf-8(防止乱码);

(2)文件名必须和函数名称一致

(3)将文件后缀名改为.m(这样可以使Octave可以识别识别后的文件图标如下图);

验证这个函数是否被加载

调用函数名,参数故意给错

如果提示的是函数名未定义undefined如下图,证明该函数未被加载

>> costFunction(x)
error: 'costFunction' undefined near line 1 column 1

如果提示是参数不对,代表octave工具已经识别函数

>> costFunction(x)
error: 'z1' undefined near line 2 column 30
error: called from
    costFunction at line 2 column 4
error: evaluating argument list element number 1
error: called from
    costFunction at line 2 column 4
error: evaluating argument list element number 2
error: called from
    costFunction at line 2 column 4

3.调用自定义函数

z0就是θ0,给个范围-100~100

>> z0=[-100:1:100]   #这样的编写会将-100到100,步长为1的所有值都打印出来
z0 =

 Columns 1 through 25:

  -100   -99   -98   -97   -96   -95   -94   -93   -92   -91   -90   -89   -88   -87   -86   -85   -84   -83   -82   -81   -80   -79   -78   -77   -76

 Columns 26 through 50:

   -75   -74   -73   -72   -71   -70   -69   -68   -67   -66   -65   -64   -63   -62   -61   -60   -59   -58   -57   -56   -55   -54   -53   -52   -51

 Columns 51 through 75:

   -50   -49   -48   -47   -46   -45   -44   -43   -42   -41   -40   -39   -38   -37   -36   -35   -34   -33   -32   -31   -30   -29   -28   -27   -26

 Columns 76 through 100:

   -25   -24   -23   -22   -21   -20   -19   -18   -17   -16   -15   -14   -13   -12   -11   -10    -9    -8    -7    -6    -5    -4    -3    -2    -1

 Columns 101 through 125:

     0     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24

 Columns 126 through 150:

    25    26    27    28    29    30    31    32    33    34    35    36    37    38    39    40    41    42    43    44    45    46    47    48    49

 Columns 151 through 175:

    50    51    52    53    54    55    56    57    58    59    60    61    62    63    64    65    66    67    68    69    70    71    72    73    74

 Columns 176 through 200:

    75    76    77    78    79    80    81    82    83    84    85    86    87    88    89    90    91    92    93    94    95    96    97    98    99

 Column 201:

   100

>> z0=[-100:1:100];   #这样的编写不会将-100到100,步长为1的所有值都打印出来

z1就是θ1,给个范围-100~100

>> z1=[-100:1:100];

X就是之前的矩阵

y就是之前的向量

调用函数:

>> J=costFunction(z0,z1,X,y)
J =

 Columns 1 through 10:

  8.6161e+007  8.6149e+007  8.6137e+007  8.6125e+007  8.6113e+007  8.6100e+007  8.6088e+007  8.6076e+007  8.6064e+007  8.6051e+007
  8.4569e+007  8.4557e+007  8.4545e+007  8.4533e+007  8.4521e+007  8.4508e+007  8.4496e+007  8.4484e+007  8.4472e+007  8.4460e+007
  8.2991e+007  8.2979e+007  8.2967e+007  8.2955e+007  8.2943e+007  8.2931e+007  8.2919e+007  8.2907e+007  8.2895e+007  8.2883e+007
  8.1429e+007  8.1417e+007  8.1405e+007  8.1393e+007  8.1381e+007  8.1369e+007  8.1357e+007  8.1346e+007  8.1334e+007  8.1322e+007
  7.9881e+007  7.9869e+007  7.9857e+007  7.9846e+007  7.9834e+007  7.9822e+007  7.9810e+007  7.9798e+007  7.9787e+007  7.9775e+007
  7.8348e+007  7.8336e+007  7.8325e+007  7.8313e+007  7.8301e+007  7.8290e+007  7.8278e+007  7.8266e+007  7.8255e+007  7.8243e+007
  7.6830e+007  7.6818e+007  7.6807e+007  7.6795e+007  7.6784e+007  7.6772e+007  7.6760e+007  7.6749e+007  7.6737e+007  7.6726e+007
  7.5326e+007  7.5315e+007  7.5304e+007  7.5292e+007  7.5281e+007  7.5269e+007  7.5258e+007  7.5246e+007  7.5235e+007  7.5224e+007
  7.3838e+007  7.3827e+007  7.3815e+007  7.3804e+007  7.3793e+007  7.3781e+007  7.3770e+007  7.3759e+007  7.3747e+007  7.3736e+007
  7.2364e+007  7.2353e+007  7.2342e+007  7.2331e+007  7.2320e+007  7.2308e+007  7.2297e+007  7.2286e+007  7.2275e+007  7.2264e+007
  7.0906e+007  7.0894e+007  7.0883e+007  7.0872e+007  7.0861e+007  7.0850e+007  7.0839e+007  7.0828e+007  7.0817e+007  7.0806e+007
  6.9462e+007  6.9451e+007  6.9440e+007  6.9429e+007  6.9418e+007  6.9407e+007  6.9396e+007  6.9385e+007  6.9374e+007  6.9363e+007
  6.8033e+007  6.8022e+007  6.8011e+007  6.8000e+007  6.7989e+007  6.7978e+007  6.7968e+007  6.7957e+007  6.7946e+007  6.7935e+007
  6.6618e+007  6.6608e+007  6.6597e+007  6.6586e+007  6.6576e+007  6.6565e+007  6.6554e+007  6.6543e+007  6.6533e+007  6.6522e+007
  6.5219e+007  6.5209e+007  6.5198e+007  6.5187e+007  6.5177e+007  6.5166e+007  6.5155e+007  6.5145e+007  6.5134e+007  6.5124e+007
  6.3835e+007  6.3824e+007  6.3814e+007  6.3803e+007  6.3793e+007  6.3782e+007  6.3772e+007  6.3761e+007  6.3751e+007  6.3740e+007
  6.2465e+007  6.2455e+007  6.2444e+007  6.2434e+007  6.2423e+007  6.2413e+007  6.2403e+007  6.2392e+007  6.2382e+007  6.2371e+007
  6.1110e+007  6.1100e+007  6.1090e+007  6.1079e+007  6.1069e+007  6.1059e+007  6.1049e+007  6.1038e+007  6.1028e+007  6.1018e+007
  5.9770e+007  5.9760e+007  5.9750e+007  5.9740e+007  5.9730e+007  5.9720e+007  5.9709e+007  5.9699e+007  5.9689e+007  5.9679e+007
  5.8445e+007  5.8435e+007  5.8425e+007  5.8415e+007  5.8405e+007  5.8395e+007  5.8385e+007  5.8375e+007  5.8365e+007  5.8355e+007
  5.7135e+007  5.7125e+007  5.7115e+007  5.7105e+007  5.7095e+007  5.7085e+007  5.7075e+007  5.7065e+007  5.7056e+007  5.7046e+007
  5.5840e+007  5.5830e+007  5.5820e+007  5.5810e+007  5.5800e+007  5.5791e+007  5.5781e+007  5.5771e+007  5.5761e+007  5.5751e+007

生成图像,可以直观的查看

plot3(J)		//生成3D的图,速度比较慢,稍加等待
plot(J)			//生成2D的图



4.自定义函数并调用

伪代码:

循环很多次,不断的改变θ1,观察J(θ1)的值如果值3000,然后继续循环,还是3000,继续循环,还是3000,不变化了。就可以认为找到最小值了。

function [z,J_his] = descentFunction(z,X,y,a,iters)
	J_his = zeros(iters,1);
	m = length(y);		% 样本数
	n = length(z);		
	t = zeros(n,1);
	for iter = 1:iters
		for i = 1:n
			% 变化率
			t(i) = (a/m)*(X*z-y)'*X(:,i);
		end;
		for i = 1:n
			z(i) = z(i) - t(i);
		end;
		J_his(iter) = sum((X*z-y).^2)/(2*m);
	end
end
将上述代码编写为.m文件后由以下调用
>> cd c:
>> dir
$360Section                Documents and Settings     PerfLogs                   Users                      grldr
$Recycle.Bin               Drivers                    Program Files              Windows                    mm.cfg
360SANDBOX                 InstallConfig.ini          Program Files (x86)        bootmgr                    offline_FtnInfo.txt
Boot                       Intel                      ProgramData                ckcore.txt                 pagefile.sys
Config.Msi                 MSOCache                   RECYCLER                   costFunction.m
DRMsoft                    OEMSF                      System Volume Information  descentFunction.m
>> descentFunction(1)
error: 'iters' undefined near line 2 column 16
error: called from
    descentFunction at line 2 column 8
error: evaluating argument list element number 1
error: called from
    descentFunction at line 2 column 8
>> z=[0;6] 
z =

   0
   6

>> A = load("prices.txt")
A =

    825.0000      1.0000    135.0000      3.0000      2.0000
    997.5000      1.0000    133.0000      3.0000      2.0000
   1005.0000      1.0000    134.0000      3.0000      2.0000
    384.0000      1.0000     64.0000      3.0000      2.0000
    270.0000      1.0000     45.0000      2.0000      1.0000
    459.6000      1.0000     76.0000      1.0000      1.0000
    388.8000      1.0000     64.8000      1.0000      1.0000
    713.4000      1.0000    118.9000      1.0000      1.0000
    218.4000      1.0000     39.0000      1.0000      1.0000
   1145.5000      1.0000    145.0000      3.0000      2.0000
   1864.4000      1.0000    236.0000      4.0000      2.0000
    539.0000      1.0000     77.0000      2.0000      2.0000
    679.0000      1.0000     97.0000      3.0000      2.0000
    756.0000      1.0000    108.0000      3.0000      2.0000
    784.0000      1.0000    112.0000      3.0000      2.0000
    487.5000      1.0000     75.0000      1.0000      2.0000
    780.0000      1.0000    120.0000      2.0000      2.0000
    780.0000      1.0000    120.0000      2.0000      2.0000
    780.0000      1.0000    120.0000      3.0000      2.0000
    995.5000      1.0000    147.0000      3.0000      2.0000
   1072.5000      1.0000    165.0000      3.0000      2.0000
   1072.5000      1.0000    165.0000      3.0000      2.0000

>> X=A(:,2:3)
X =

     1.0000   135.0000
     1.0000   133.0000
     1.0000   134.0000
     1.0000    64.0000
     1.0000    45.0000
     1.0000    76.0000
     1.0000    64.8000
     1.0000   118.9000
     1.0000    39.0000
     1.0000   145.0000
     1.0000   236.0000
     1.0000    77.0000
     1.0000    97.0000
     1.0000   108.0000
     1.0000   112.0000
     1.0000    75.0000
     1.0000   120.0000
     1.0000   120.0000
     1.0000   120.0000
     1.0000   147.0000
     1.0000   165.0000
     1.0000   165.0000

>> y=A(:,1)
y =

    825.00
    997.50
   1005.00
    384.00
    270.00
    459.60
    388.80
    713.40
    218.40
   1145.50
   1864.40
    539.00
    679.00
    756.00
    784.00
    487.50
    780.00
    780.00
    780.00
    995.50
   1072.50
   1072.50

>> a=0.0001
a =   1.0000e-004
>> iters=1
iters =  1
>> [z,J]=descentFunction(z,X,y,a,iters)
z =

   0.0091700
   7.4036164

J =  4837.5
>>

小结:

1)  将需求转成数学公式,预测函数,误差函数

2)  通过数学公式来解决需求,带入所有的变量就可以进行计算

3)  所有的数据都带入来计算,计算次数太多,计算速度慢

4)  减少计算次数,梯度下降法,跳过很多值,越到下面越密集,a梯度下降法速率(机器学习速率)a=0.0001

5)  利用oactave工具就可以实现计算,向量和矩阵运算,方便展现,方便计算快速计算

八、spark实现案例

1.   spark就封装成函数库

线性回归函数LinearRegressionWithSGD

spark机器学习,就要知道,需求最终对应的是哪个函数方法

数据海量,

spark不可能把所有的数据一次读入内存,

转成一次读一条数据


1)  把一条数据封装成一个对象,LabeledPoint对象,基于RDD

2)  Vector向量,x0,x1叫特征点

3)  y,label

4)  机器学习中的所有的值的类型都是Double

5)  带入样本数据,选择合适函数库中对应函数,线性回归算法,进行训练

6)  得到的结果就叫,模型

7)  有了模型,带入特征值,直接就返回预测值。x1=200,结果:

2.稠密向量和稀疏向量

向量(1.0,0.0,0.0,0.0,0.0,0.0,3.0)

用密集格式表示为[1.0,0.0,0.0,0.0,0.0,0.0,3.0],

用稀疏格式表示为(7,[0,6],[1.0,3.0])

7代表个数;[0,6]存放位置,第1个位置,第7个位置;[1.0,3.0]第一个位置存放1.0,第7个位置存放3.0。

存储长度稀疏向量比稠密向量占用存储空间小。

差异:0的元素值处理方式。实际开发中含有大量0。


3.准备数据

4.使用spark来预估指定大小房间的价格

scala> import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionModel

scala> import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LabeledPoint

scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors

scala> import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LinearRegressionWithSGD

scala> Vectors.dense(1,2,3)
res5: org.apache.spark.mllib.linalg.Vector = [1.0,2.0,3.0]

scala> Vectors.sparse(5,Array(0,1),Array(2,3))
res6: org.apache.spark.mllib.linalg.Vector = (5,[0,1],[2.0,3.0])

scala> 

scala> val v = Vectors.dense(1,2,3);
v: org.apache.spark.mllib.linalg.Vector = [1.0,2.0,3.0]

scala> val y = 4
y: Int = 4

scala> val l = LabeledPoint(y, v)
l: org.apache.spark.mllib.regression.LabeledPoint = (4.0,[1.0,2.0,3.0])

scala> 

scala> val rdd = sc.textFile("hdfs:///spark/prices.txt")
18/05/06 05:32:33 INFO storage.MemoryStore: ensureFreeSpace(216696) called with curMem=83219, maxMem=555755765
18/05/06 05:32:33 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 211.6 KB, free 529.7 MB)
18/05/06 05:32:33 INFO storage.MemoryStore: ensureFreeSpace(19939) called with curMem=299915, maxMem=555755765
18/05/06 05:32:33 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 19.5 KB, free 529.7 MB)
18/05/06 05:32:33 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.239.129:57789 (size: 19.5 KB, free: 530.0 MB)
18/05/06 05:32:33 INFO spark.SparkContext: Created broadcast 5 from textFile at <console>:27
rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at textFile at <console>:27

scala> rdd.first
18/05/06 05:32:36 INFO mapred.FileInputFormat: Total input paths to process : 1
18/05/06 05:32:39 INFO spark.SparkContext: Starting job: first at <console>:30
18/05/06 05:32:39 INFO scheduler.DAGScheduler: Got job 2 (first at <console>:30) with 1 output partitions
18/05/06 05:32:39 INFO scheduler.DAGScheduler: Final stage: ResultStage 4(first at <console>:30)
18/05/06 05:32:39 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/05/06 05:32:39 INFO scheduler.DAGScheduler: Missing parents: List()
18/05/06 05:32:39 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[11] at textFile at <console>:27), which has no missing parents
18/05/06 05:32:39 INFO storage.MemoryStore: ensureFreeSpace(3152) called with curMem=319854, maxMem=555755765
18/05/06 05:32:39 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.1 KB, free 529.7 MB)
18/05/06 05:32:39 INFO storage.MemoryStore: ensureFreeSpace(1827) called with curMem=323006, maxMem=555755765
18/05/06 05:32:39 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1827.0 B, free 529.7 MB)
18/05/06 05:32:39 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on 192.168.239.129:57789 (size: 1827.0 B, free: 530.0 MB)
18/05/06 05:32:39 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:861
18/05/06 05:32:39 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[11] at textFile at <console>:27)
18/05/06 05:32:39 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
18/05/06 05:32:39 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 24, 192.168.239.131, ANY, 2146 bytes)
18/05/06 05:32:46 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on 192.168.239.131:43057 (size: 1827.0 B, free: 534.5 MB)
18/05/06 05:32:48 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.239.131:43057 (size: 19.5 KB, free: 534.5 MB)
18/05/06 05:33:01 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 24) in 21657 ms on 192.168.239.131 (1/1)
18/05/06 05:33:01 INFO scheduler.DAGScheduler: ResultStage 4 (first at <console>:30) finished in 21.703 s
18/05/06 05:33:01 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
18/05/06 05:33:01 INFO scheduler.DAGScheduler: Job 2 finished: first at <console>:30, took 22.290016 s
res8: String = 825.0	1	135.00	3	2

scala> val rdd2 = rdd.map{ x=>x.split("\\t") }
rdd2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[12] at map at <console>:29

scala> rdd2.collect
18/05/06 05:35:56 INFO spark.SparkContext: Starting job: collect at <console>:32
18/05/06 05:35:56 INFO scheduler.DAGScheduler: Got job 3 (collect at <console>:32) with 2 output partitions
18/05/06 05:35:56 INFO scheduler.DAGScheduler: Final stage: ResultStage 5(collect at <console>:32)
18/05/06 05:35:56 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/05/06 05:35:56 INFO scheduler.DAGScheduler: Missing parents: List()
18/05/06 05:35:56 INFO scheduler.DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[12] at map at <console>:29), which has no missing parents
18/05/06 05:35:56 INFO storage.MemoryStore: ensureFreeSpace(3312) called with curMem=324833, maxMem=555755765
18/05/06 05:35:56 INFO storage.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 3.2 KB, free 529.7 MB)
18/05/06 05:35:56 INFO storage.MemoryStore: ensureFreeSpace(1889) called with curMem=328145, maxMem=555755765
18/05/06 05:35:56 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 1889.0 B, free 529.7 MB)
18/05/06 05:35:56 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on 192.168.239.129:57789 (size: 1889.0 B, free: 530.0 MB)
18/05/06 05:35:56 INFO spark.SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:861
18/05/06 05:35:56 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 5 (MapPartitionsRDD[12] at map at <console>:29)
18/05/06 05:35:56 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 2 tasks
18/05/06 05:35:56 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 25, 192.168.239.144, ANY, 2146 bytes)
18/05/06 05:35:56 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 5.0 (TID 26, 192.168.239.130, ANY, 2146 bytes)
18/05/06 05:36:07 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on 192.168.239.130:53716 (size: 1889.0 B, free: 534.5 MB)
18/05/06 05:36:11 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on 192.168.239.144:43403 (size: 1889.0 B, free: 534.5 MB)
18/05/06 05:36:13 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.239.130:53716 (size: 19.5 KB, free: 534.5 MB)
18/05/06 05:36:19 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.239.144:43403 (size: 19.5 KB, free: 534.5 MB)
18/05/06 05:36:40 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 5.0 (TID 26) in 43728 ms on 192.168.239.130 (1/2)
18/05/06 05:36:43 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 5.0 (TID 25) in 46563 ms on 192.168.239.144 (2/2)
18/05/06 05:36:43 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 
18/05/06 05:36:43 INFO scheduler.DAGScheduler: ResultStage 5 (collect at <console>:32) finished in 46.583 s
18/05/06 05:36:43 INFO scheduler.DAGScheduler: Job 3 finished: collect at <console>:32, took 46.599154 s
res9: Array[Array[String]] = Array(Array(825.0, 1, 135.00, 3, 2), Array(997.5, 1, 133.00, 3, 2), Array(1005.0, 1, 134.00, 3, 2), Array(384.00, 1, 64.00, 3, 2), Array(270.00, 1, 45.00, 2, 1), Array(459.6, 1, 76.00, 1, 1), Array(388.8, 1, 64.8, 1, 1), Array(713.4, 1, 118.9, 1, 1), Array(218.4, 1, 39.0, 1, 1), Array(1145.5, 1, 145.0, 3, 2), Array(1864.4, 1, 236.00, 4, 2), Array(539.0, 1, 77.00, 2, 2), Array(679.0, 1, 97.00, 3, 2), Array(756.00, 1, 108.00, 3, 2), Array(784.00, 1, 112.00, 3, 2), Array(487.5, 1, 75.0, 1, 2), Array(780.0, 1, 120.00, 2, 2), Array(780.0, 1, 120.00, 2, 2), Array(780.0, 1, 120.00, 3, 2), Array(995.5, 1, 147.0, 3, 2), Array(1072.5, 1, 165.00, 3, 2), Array(1072.5, 1, 165.00, 3, 2))

scala> val rdd3 = rdd2.map{ x=>LabeledPoint(x(0).toDouble, Vectors.dense(x(1).toDouble, x(2).toDouble)) }
rdd3: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[13] at map at <console>:31

scala> val model = LinearRegressionWithSGD.train(rdd3, 1000, 0.0001)
18/05/06 05:39:15 INFO spark.SparkContext: Starting job: first at GeneralizedLinearAlgorithm.scala:206
18/05/06 05:39:15 INFO scheduler.DAGScheduler: Got job 4 (first at GeneralizedLinearAlgorithm.scala:206) with 1 output partitions
18/05/06 05:39:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 6(first at GeneralizedLinearAlgorithm.scala:206)
18/05/06 05:39:15 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/05/06 05:39:15 INFO scheduler.DAGScheduler: Missing parents: List()
18/05/06 05:39:15 INFO scheduler.DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[14] at map at GeneralizedLinearAlgorithm.scala:206), which has no missing parents
18/05/06 05:39:15 INFO storage.MemoryStore: ensureFreeSpace(3872) called with curMem=330034, maxMem=555755765
18/05/06 05:39:15 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 3.8 KB, free 529.7 MB)
18/05/06 05:39:15 INFO storage.MemoryStore: ensureFreeSpace(2191) called with curMem=333906, maxMem=555755765
18/05/06 05:39:15 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 2.1 KB, free 529.7 MB)
18/05/06 05:39:15 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.239.129:57789 (size: 2.1 KB, free: 530.0 MB)
18/05/06 05:39:15 INFO spark.SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:861
18/05/06 05:39:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 6 (MapPartitionsRDD[14] at map at GeneralizedLinearAlgorithm.scala:206)
18/05/06 05:39:15 INFO scheduler.TaskSchedulerImpl: Adding task set 6.0 with 1 tasks
18/05/06 05:39:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 6.0 (TID 27, 192.168.239.144, ANY, 2146 bytes)
18/05/06 05:39:15 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.239.144:43403 (size: 2.1 KB, free: 534.5 MB)
18/05/06 05:39:16 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 6.0 (TID 27) in 1137 ms on 192.168.239.144 (1/1)
18/05/06 05:39:16 INFO scheduler.DAGScheduler: ResultStage 6 (first at GeneralizedLinearAlgorithm.scala:206) finished in 1.145 s
18/05/06 05:39:16 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 
18/05/06 05:39:16 INFO scheduler.DAGScheduler: Job 4 finished: first at GeneralizedLinearAlgorithm.scala:206, took 1.172036 s
18/05/06 05:39:16 WARN regression.LinearRegressionWithSGD: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.
18/05/06 05:39:16 INFO spark.SparkContext: Starting job: count at GradientDescent.scala:195
18/05/06 05:39:16 INFO scheduler.DAGScheduler: Got job 5 (count at GradientDescent.scala:195) with 2 output partitions
18/05/06 05:39:16 INFO scheduler.DAGScheduler: Final stage: ResultStage 7(count at GradientDescent.scala:195)
18/05/06 05:39:16 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/05/06 05:39:16 INFO scheduler.DAGScheduler: Missing parents: List()
18/05/06 05:39:16 INFO scheduler.DAGScheduler: Submitting ResultStage 7 (MapPartitionsRDD[15] at map at GeneralizedLinearAlgorithm.scala:292), which has no missing parents
18/05/06 05:39:16 INFO storage.MemoryStore: ensureFreeSpace(3584) called with curMem=336097, maxMem=555755765
18/05/06 05:39:16 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 3.5 KB, free 529.7 MB)
18/05/06 05:39:17 INFO storage.MemoryStore: ensureFreeSpace(2030) called with curMem=339681, maxMem=555755765
18/05/06 05:39:17 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 2030.0 B, free 529.7 MB)
18/05/06 05:39:17 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on 192.168.239.129:57789 (size: 2030.0 B, free: 530.0 MB)
18/05/06 05:39:17 INFO spark.SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:861
18/05/06 05:39:17 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 7 (MapPartitionsRDD[15] at map at GeneralizedLinearAlgorithm.scala:292)
18/05/06 05:39:17 INFO scheduler.TaskSchedulerImpl: Adding task set 7.0 with 2 tasks
18/05/06 05:39:17 INFO spark.ContextCleaner: Cleaned accumulator 5
18/05/06 05:39:17 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 7.0 (TID 28, 192.168.239.144, ANY, 2146 bytes)
18/05/06 05:39:17 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 7.0 (TID 29, 192.168.239.130, ANY, 2146 bytes)
18/05/06 05:39:18 INFO storage.BlockManagerInfo: Removed broadcast_8_piece0 on 192.168.239.129:57789 in memory (size: 2.1 KB, free: 530.0 MB)
18/05/06 05:39:18 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on 192.168.239.144:43403 (size: 2030.0 B, free: 534.5 MB)
18/05/06 05:39:19 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on 192.168.239.130:53716 (size: 2030.0 B, free: 534.5 MB)
18/05/06 05:39:19 INFO storage.BlockManagerInfo: Removed broadcast_8_piece0 on 192.168.239.144:43403 in memory (size: 2.1 KB, free: 534.5 MB)
18/05/06 05:39:19 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 7.0 (TID 28) in 1412 ms on 192.168.239.144 (1/2)
18/05/06 05:39:19 INFO spark.ContextCleaner: Cleaned accumulator 7
18/05/06 05:39:19 INFO storage.BlockManagerInfo: Removed broadcast_7_piece0 on 192.168.239.129:57789 in memory (size: 1889.0 B, free: 530.0 MB)
18/05/06 05:39:19 INFO storage.BlockManagerInfo: Removed broadcast_7_piece0 on 192.168.239.144:43403 in memory (size: 1889.0 B, free: 534.5 MB)
18/05/06 05:39:19 INFO storage.BlockManagerInfo: Removed broadcast_7_piece0 on 192.168.239.130:53716 in memory (size: 1889.0 B, free: 534.5 MB)
18/05/06 05:39:19 INFO spark.ContextCleaner: Cleaned accumulator 6
18/05/06 05:39:19 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on 192.168.239.129:57789 in memory (size: 1827.0 B, free: 530.0 MB)
18/05/06 05:39:19 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on 192.168.239.131:43057 in memory (size: 1827.0 B, free: 534.5 MB)
18/05/06 05:39:21 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 7.0 (TID 29) in 3087 ms on 192.168.239.130 (2/2)
18/05/06 05:39:21 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, from pool 
18/05/06 05:39:21 INFO scheduler.DAGScheduler: ResultStage 7 (count at GradientDescent.scala:195) finished in 3.097 s
18/05/06 05:39:21 INFO scheduler.DAGScheduler: Job 5 finished: count at GradientDescent.scala:195, took 4.086191 s
18/05/06 05:39:26 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
18/05/06 05:39:26 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
18/05/06 05:39:26 INFO storage.MemoryStore: ensureFreeSpace(72) called with curMem=325468, maxMem=555755765
18/05/06 05:39:26 INFO storage.MemoryStore: Block broadcast_10 stored as values in memory (estimated size 72.0 B, free 529.7 MB)
18/05/06 05:39:26 INFO storage.MemoryStore: ensureFreeSpace(138) called with curMem=325540, maxMem=555755765
18/05/06 05:39:26 INFO storage.MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 138.0 B, free 529.7 MB)
18/05/06 05:39:26 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on 192.168.239.129:57789 (size: 138.0 B, free: 530.0 MB)
18/05/06 05:39:26 INFO spark.SparkContext: Created broadcast 10 from broadcast at GradientDescent.scala:221
18/05/06 05:39:27 INFO spark.SparkContext: Starting job: treeAggregate at GradientDescent.scala:225
18/05/06 05:39:27 INFO scheduler.DAGScheduler: Got job 6 (treeAggregate at GradientDescent.scala:225) with 2 output partitions
18/05/06 05:39:27 INFO scheduler.DAGScheduler: Final stage: ResultStage 8(treeAggregate at GradientDescent.scala:225)
18/05/06 05:39:27 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/05/06 05:39:27 INFO scheduler.DAGScheduler: Missing parents: List()
18/05/06 05:39:27 INFO scheduler.DAGScheduler: Submitting ResultStage 8 (MapPartitionsRDD[17] at treeAggregate at GradientDescent.scala:225), which has no missing parents
18/05/06 05:39:27 INFO storage.MemoryStore: ensureFreeSpace(5656) called with curMem=325678, maxMem=555755765
18/05/06 05:39:27 INFO storage.MemoryStore: Block broadcast_11 stored as values in memory (estimated size 5.5 KB, free 529.7 MB)
18/05/06 05:39:27 INFO storage.MemoryStore: ensureFreeSpace(2995) called with curMem=331334, maxMem=555755765
18/05/06 05:39:27 INFO storage.MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 2.9 KB, free 529.7 MB)
18/05/06 05:39:27 INFO storage.BlockManagerInfo: Added broadcast_11_piece0 in memory on 192.168.239.129:57789 (size: 2.9 KB, free: 530.0 MB)
18/05/06 05:39:27 INFO spark.SparkContext: Created broadcast 11 from broadcast at DAGScheduler.scala:861
18/05/06 05:39:27 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 8 (MapPartitionsRDD[17] at treeAggregate at GradientDescent.scala:225)
18/05/06 05:39:27 INFO scheduler.TaskSchedulerImpl: Adding task set 8.0 with 2 tasks
18/05/06 05:39:27 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 8.0 (TID 30, 192.168.239.131, ANY, 2255 bytes)
18/05/06 05:39:27 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 8.0 (TID 31, 192.168.239.144, ANY, 2255 bytes)
18/05/06 05:39:27 INFO storage.BlockManagerInfo: Added broadcast_11_piece0 in memory on 192.168.239.144:43403 (size: 2.9 KB, free: 534.5 MB)
18/05/06 05:39:28 INFO storage.BlockManagerInfo: Added broadcast_11_piece0 in memory on 192.168.239.131:43057 (size: 2.9 KB, free: 534.5 MB)
18/05/06 05:39:44 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on 192.168.239.144:43403 (size: 138.0 B, free: 534.5 MB)
18/05/06 05:39:45 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 8.0 (TID 31) in 17678 ms on 192.168.239.144 (1/2)
18/05/06 05:39:47 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on 192.168.239.131:43057 (size: 138.0 B, free: 534.5 MB)
18/05/06 05:39:48 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 8.0 (TID 30) in 21035 ms on 192.168.239.131 (2/2)
18/05/06 05:39:48 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool 
18/05/06 05:39:48 INFO scheduler.DAGScheduler: ResultStage 8 (treeAggregate at GradientDescent.scala:225) finished in 21.034 s
18/05/06 05:39:48 INFO scheduler.DAGScheduler: Job 6 finished: treeAggregate at GradientDescent.scala:225, took 21.088037 s
18/05/06 05:39:48 INFO storage.MemoryStore: ensureFreeSpace(72) called with curMem=334329, maxMem=555755765
18/05/06 05:39:48 INFO storage.MemoryStore: Block broadcast_12 stored as values in memory (estimated size 72.0 B, free 529.7 MB)
18/05/06 05:39:48 INFO storage.MemoryStore: ensureFreeSpace(138) called with curMem=334401, maxMem=555755765
18/05/06 05:39:48 INFO storage.MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 138.0 B, free 529.7 MB)
18/05/06 05:39:48 INFO storage.BlockManagerInfo: Added broadcast_12_piece0 in memory on 192.168.239.129:57789 (size: 138.0 B, free: 530.0 MB)
18/05/06 05:39:48 INFO spark.SparkContext: Created broadcast 12 from broadcast at GradientDescent.scala:221
18/05/06 05:39:48 INFO spark.SparkContext: Starting job: treeAggregate at GradientDescent.scala:225
18/05/06 05:39:48 INFO scheduler.DAGScheduler: Got job 7 (treeAggregate at GradientDescent.scala:225) with 2 output partitions
18/05/06 05:39:48 INFO scheduler.DAGScheduler: Final stage: ResultStage 9(treeAggregate at GradientDescent.scala:225)
18/05/06 05:39:48 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/05/06 05:39:48 INFO scheduler.DAGScheduler: Missing parents: List()
18/05/06 05:39:48 INFO scheduler.DAGScheduler: Submitting ResultStage 9 (MapPartitionsRDD[19] at treeAggregate at GradientDescent.scala:225), which has no missing parents
18/05/06 05:39:48 INFO storage.MemoryStore: ensureFreeSpace(5656) called with curMem=334539, maxMem=555755765
18/05/06 05:39:48 INFO storage.MemoryStore: Block broadcast_13 stored as values in memory (estimated size 5.5 KB, free 529.7 MB)
18/05/06 05:39:48 INFO storage.MemoryStore: ensureFreeSpace(2996) called with curMem=340195, maxMem=555755765
18/05/06 05:39:48 INFO storage.MemoryStore: Block broadcast_13_piece0 stored as bytes in memory (estimated size 2.9 KB, free 529.7 MB)
18/05/06 05:39:48 INFO storage.BlockManagerInfo: Added broadcast_13_piece0 in memory on 192.168.239.129:57789 (size: 2.9 KB, free: 530.0 MB)
18/05/06 05:39:48 INFO spark.SparkContext: Created broadcast 13 from broadcast at DAGScheduler.scala:861
18/05/06 05:39:48 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 9 (MapPartitionsRDD[19] at treeAggregate at GradientDescent.scala:225)
18/05/06 05:39:48 INFO scheduler.TaskSchedulerImpl: Adding task set 9.0 with 2 tasks
18/05/06 05:39:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 9.0 (TID 32, 192.168.239.130, ANY, 2255 bytes)
18/05/06 05:39:48 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 9.0 (TID 33, 192.168.239.144, ANY, 2255 bytes)
18/05/06 05:39:49 INFO storage.BlockManagerInfo: Added broadcast_13_piece0 in memory on 192.168.239.130:53716 (size: 2.9 KB, free: 534.5 MB)
18/05/06 05:39:49 INFO storage.BlockManagerInfo: Added broadcast_13_piece0 in memory on 192.168.239.144:43403 (size: 2.9 KB, free: 534.5 MB)
18/05/06 05:39:49 INFO storage.BlockManagerInfo: Added broadcast_12_piece0 in memory on 192.168.239.144:43403 (size: 138.0 B, free: 534.5 MB)
18/05/06 05:39:49 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 9.0 (TID 33) in 1028 ms on 192.168.239.144 (1/2)
18/05/06 05:39:57 INFO storage.BlockManagerInfo: Added broadcast_12_piece0 in memory on 192.168.239.130:53716 (size: 138.0 B, free: 534.5 MB)
18/05/06 05:39:57 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 9.0 (TID 32) in 9286 ms on 192.168.239.130 (2/2)
18/05/06 05:39:57 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool 
18/05/06 05:39:57 INFO scheduler.DAGScheduler: ResultStage 9 (treeAggregate at GradientDescent.scala:225) finished in 9.342 s
18/05/06 05:39:57 INFO scheduler.DAGScheduler: Job 7 finished: treeAggregate at GradientDescent.scala:225, took 9.356678 s
18/05/06 05:39:58 INFO storage.MemoryStore: ensureFreeSpace(72) called with curMem=343191, maxMem=555755765
18/05/06 05:39:58 INFO storage.MemoryStore: Block broadcast_14 stored as values in memory (estimated size 72.0 B, free 529.7 MB)
18/05/06 05:39:58 INFO storage.MemoryStore: ensureFreeSpace(138) called with curMem=343263, maxMem=555755765
18/05/06 05:39:58 INFO storage.MemoryStore: Block broadcast_14_piece0 stored as bytes in memory (estimated size 138.0 B, free 529.7 MB)
18/05/06 05:39:58 INFO storage.BlockManagerInfo: Added broadcast_14_piece0 in memory on 192.168.239.129:57789 (size: 138.0 B, free: 530.0 MB)
18/05/06 05:39:58 INFO spark.SparkContext: Created broadcast 14 from broadcast at GradientDescent.scala:221
18/05/06 05:39:58 INFO spark.SparkContext: Starting job: treeAggregate at GradientDescent.scala:225
18/05/06 05:39:58 INFO scheduler.DAGScheduler: Got job 8 (treeAggregate at GradientDescent.scala:225) with 2 output partitions
18/05/06 05:39:58 INFO scheduler.DAGScheduler: Final stage: ResultStage 10(treeAggregate at GradientDescent.scala:225)
18/05/06 05:39:58 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/05/06 05:39:58 INFO scheduler.DAGScheduler: Missing parents: List()
18/05/06 05:39:58 INFO scheduler.DAGScheduler: Submitting ResultStage 10 (MapPartitionsRDD[21] at treeAggregate at GradientDescent.scala:225), which has no missing parents
18/05/06 05:39:58 INFO storage.MemoryStore: ensureFreeSpace(5656) called with curMem=343401, maxMem=555755765
18/05/06 05:39:58 INFO storage.MemoryStore: Block broadcast_15 stored as values in memory (estimated size 5.5 KB, free 529.7 MB)
18/05/06 05:39:58 INFO storage.MemoryStore: ensureFreeSpace(2995) called with curMem=349057, maxMem=555755765
18/05/06 05:39:58 INFO storage.MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 2.9 KB, free 529.7 MB)
18/05/06 05:39:58 INFO storage.BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.239.129:57789 (size: 2.9 KB, free: 530.0 MB)
18/05/06 05:39:58 INFO spark.SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:861
18/05/06 05:39:58 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 10 (MapPartitionsRDD[21] at treeAggregate at GradientDescent.scala:225)
18/05/06 05:39:58 INFO scheduler.TaskSchedulerImpl: Adding task set 10.0 with 2 tasks
18/05/06 05:39:58 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 10.0 (TID 34, 192.168.239.130, ANY, 2255 bytes)
18/05/06 05:39:58 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 10.0 (TID 35, 192.168.239.144, ANY, 2255 bytes)
18/05/06 05:39:58 INFO storage.BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.239.130:53716 (size: 2.9 KB, free: 534.5 MB)
18/05/06 05:39:58 INFO storage.BlockManagerInfo: Added broadcast_15_piece0 in memory on 192.168.239.144:43403 (size: 2.9 KB, free: 534.5 MB)
18/05/06 05:39:58 INFO storage.BlockManagerInfo: Added broadcast_14_piece0 in memory on 192.168.239.130:53716 (size: 138.0 B, free: 534.5 MB)
18/05/06 05:39:58 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 10.0 (TID 34) in 383 ms on 192.168.239.130 (1/2)
18/05/06 05:39:58 INFO storage.BlockManagerInfo: Added broadcast_14_piece0 in memory on 192.168.239.144:43403 (size: 138.0 B, free: 534.5 MB)
18/05/06 05:39:58 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 10.0 (TID 35) in 545 ms on 192.168.239.144 (2/2)
18/05/06 05:39:58 INFO scheduler.DAGScheduler: ResultStage 10 (treeAggregate at GradientDescent.scala:225) finished in 0.546 s
18/05/06 05:39:58 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 10.0, whose tasks have all completed, from pool 
18/05/06 05:39:58 INFO scheduler.DAGScheduler: Job 8 finished: treeAggregate at GradientDescent.scala:225, took 0.556249 s
18/05/06 05:39:58 INFO storage.MemoryStore: ensureFreeSpace(72) called with curMem=352052, maxMem=555755765
18/05/06 05:39:58 INFO storage.MemoryStore: Block broadcast_16 stored as values in memory (estimated size 72.0 B, free 529.7 MB)
18/05/06 05:39:58 INFO storage.MemoryStore: ensureFreeSpace(138) called with curMem=352124, maxMem=555755765
18/05/06 05:39:58 INFO storage.MemoryStore: Block broadcast_16_piece0 stored as bytes in memory (estimated size 138.0 B, free 529.7 MB)
18/05/06 05:39:58 INFO storage.BlockManagerInfo: Added broadcast_16_piece0 in memory on 192.168.239.129:57789 (size: 138.0 B, free: 530.0 MB)
18/05/06 05:39:58 INFO spark.SparkContext: Created broadcast 16 from broadcast at GradientDescent.scala:221
18/05/06 05:39:58 INFO spark.SparkContext: Starting job: treeAggregate at GradientDescent.scala:225
18/05/06 05:39:58 INFO scheduler.DAGScheduler: Got job 9 (treeAggregate at GradientDescent.scala:225) with 2 output partitions
18/05/06 05:39:58 INFO scheduler.DAGScheduler: Final stage: ResultStage 11(treeAggregate at GradientDescent.scala:225)
18/05/06 05:39:58 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/05/06 05:39:58 INFO scheduler.DAGScheduler: Missing parents: List()
18/05/06 05:39:58 INFO scheduler.DAGScheduler: Submitting ResultStage 11 (MapPartitionsRDD[23] at treeAggregate at GradientDescent.scala:225), which has no missing parents
18/05/06 05:39:58 INFO storage.MemoryStore: ensureFreeSpace(5656) called with curMem=352262, maxMem=555755765
18/05/06 05:39:58 INFO storage.MemoryStore: Block broadcast_17 stored as values in memory (estimated size 5.5 KB, free 529.7 MB)
18/05/06 05:39:58 INFO storage.MemoryStore: ensureFreeSpace(2997) called with curMem=357918, maxMem=555755765
18/05/06 05:39:58 INFO storage.MemoryStore: Block broadcast_17_piece0 stored as bytes in memory (estimated size 2.9 KB, free 529.7 MB)
18/05/06 05:39:58 INFO storage.BlockManagerInfo: Added broadcast_17_piece0 in memory on 192.168.239.129:57789 (size: 2.9 KB, free: 530.0 MB)
18/05/06 05:39:58 INFO spark.SparkContext: Created broadcast 17 from broadcast at DAGScheduler.scala:861
18/05/06 05:39:58 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 11 (MapPartitionsRDD[23] at treeAggregate at GradientDescent.scala:225)
18/05/06 05:39:58 INFO scheduler.TaskSchedulerImpl: Adding task set 11.0 with 2 tasks
18/05/06 05:39:58 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 11.0 (TID 36, 192.168.239.144, ANY, 2255 bytes)
18/05/06 05:39:58 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 11.0 (TID 37, 192.168.239.131, ANY, 2255 bytes)
18/05/06 05:39:58 INFO storage.BlockManagerInfo: Added broadcast_17_piece0 in memory on 192.168.239.144:43403 (size: 2.9 KB, free: 534.5 MB)
18/05/06 05:39:58 INFO storage.BlockManagerInfo: Added broadcast_17_piece0 in memory on 192.168.239.131:43057 (size: 2.9 KB, free: 534.5 MB)
18/05/06 05:39:58 INFO storage.BlockManagerInfo: Added broadcast_16_piece0 in memory on 192.168.239.144:43403 (size: 138.0 B, free: 534.5 MB)
18/05/06 05:39:58 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 11.0 (TID 36) in 275 ms on 192.168.239.144 (1/2)
18/05/06 05:39:59 INFO storage.BlockManagerInfo: Added broadcast_16_piece0 in memory on 192.168.239.131:43057 (size: 138.0 B, free: 534.5 MB)
18/05/06 05:39:59 INFO scheduler.DAGScheduler: ResultStage 11 (treeAggregate at GradientDescent.scala:225) finished in 0.375 s
18/05/06 05:39:59 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 11.0 (TID 37) in 375 ms on 192.168.239.131 (2/2)
18/05/06 05:39:59 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 11.0, whose tasks have all completed, from pool 
18/05/06 05:39:59 INFO scheduler.DAGScheduler: Job 9 finished: treeAggregate at GradientDescent.scala:225, took 0.406524 s
18/05/06 05:39:59 INFO storage.MemoryStore: ensureFreeSpace(72) called with curMem=360915, maxMem=555755765
18/05/06 05:39:59 INFO storage.MemoryStore: Block broadcast_18 stored as values in memory (estimated size 72.0 B, free 529.7 MB)
18/05/06 05:39:59 INFO storage.MemoryStore: ensureFreeSpace(138) called with curMem=360987, maxMem=555755765
18/05/06 05:39:59 INFO storage.MemoryStore: Block broadcast_18_piece0 stored as bytes in memory (estimated size 138.0 B, free 529.7 MB)
18/05/06 05:39:59 INFO storage.BlockManagerInfo: Added broadcast_18_piece0 in memory on 192.168.239.129:57789 (size: 138.0 B, free: 530.0 MB)
18/05/06 05:39:59 INFO spark.SparkContext: Created broadcast 18 from broadcast at GradientDescent.scala:221
18/05/06 05:39:59 INFO spark.SparkContext: Starting job: treeAggregate at GradientDescent.scala:225
18/05/06 05:39:59 INFO scheduler.DAGScheduler: Got job 10 (treeAggregate at GradientDescent.scala:225) with 2 output partitions
18/05/06 05:39:59 INFO scheduler.DAGScheduler: Final stage: ResultStage 12(treeAggregate at GradientDescent.scala:225)
18/05/06 05:39:59 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/05/06 05:39:59 INFO scheduler.DAGScheduler: Missing parents: List()
18/05/06 05:39:59 INFO scheduler.DAGScheduler: Submitting ResultStage 12 (MapPartitionsRDD[25] at treeAggregate at GradientDescent.scala:225), which has no missing parents
18/05/06 05:39:59 INFO storage.MemoryStore: ensureFreeSpace(5656) called with curMem=361125, maxMem=555755765
18/05/06 05:39:59 INFO storage.MemoryStore: Block broadcast_19 stored as values in memory (estimated size 5.5 KB, free 529.7 MB)
18/05/06 05:39:59 INFO storage.MemoryStore: ensureFreeSpace(2996) called with curMem=366781, maxMem=555755765
18/05/06 05:39:59 INFO storage.MemoryStore: Block broadcast_19_piece0 stored as bytes in memory (estimated size 2.9 KB, free 529.7 MB)
18/05/06 05:39:59 INFO storage.BlockManagerInfo: Added broadcast_19_piece0 in memory on 192.168.239.129:57789 (size: 2.9 KB, free: 530.0 MB)
18/05/06 05:39:59 INFO spark.SparkContext: Created broadcast 19 from broadcast at DAGScheduler.scala:861
18/05/06 05:39:59 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 12 (MapPartitionsRDD[25] at treeAggregate at GradientDescent.scala:225)
18/05/06 05:39:59 INFO scheduler.TaskSchedulerImpl: Adding task set 12.0 with 2 tasks
18/05/06 05:39:59 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 12.0 (TID 38, 192.168.239.131, ANY, 2255 bytes)
18/05/06 05:39:59 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 12.0 (TID 39, 192.168.239.130, ANY, 2255 bytes)
18/05/06 05:39:59 INFO storage.BlockManagerInfo: Added broadcast_19_piece0 in memory on 192.168.239.131:43057 (size: 2.9 KB, free: 534.5 MB)
18/05/06 05:39:59 INFO storage.BlockManagerInfo: Added broadcast_19_piece0 in memory on 192.168.239.130:53716 (size: 2.9 KB, free: 534.5 MB)
18/05/06 05:39:59 INFO storage.BlockManagerInfo: Added broadcast_18_piece0 in memory on 192.168.239.131:43057 (size: 138.0 B, free: 534.5 MB)
18/05/06 05:39:59 INFO storage.BlockManagerInfo: Added broadcast_18_piece0 in memory on 192.168.239.130:53716 (size: 138.0 B, free: 534.5 MB)
18/05/06 05:39:59 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 12.0 (TID 39) in 594 ms on 192.168.239.130 (1/2)
18/05/06 05:39:59 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 12.0 (TID 38) in 719 ms on 192.168.239.131 (2/2)
18/05/06 05:39:59 INFO scheduler.DAGScheduler: ResultStage 12 (treeAggregate at GradientDescent.scala:225) finished in 0.717 s
18/05/06 05:39:59 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool 
18/05/06 05:39:59 INFO scheduler.DAGScheduler: Job 10 finished: treeAggregate at GradientDescent.scala:225, took 0.890308 s
18/05/06 05:39:59 INFO optimization.GradientDescent: GradientDescent.runMiniBatchSGD finished. Last 10 stochastic losses 361485.19500000007, 87684.419014825, 3489.5361094534996, 3279.8174503807445, 3275.7530596492475
18/05/06 05:39:59 WARN regression.LinearRegressionWithSGD: The input data was not directly cached, which may hurt performance if its parent RDDs are also uncached.
model: org.apache.spark.mllib.regression.LinearRegressionModel = org.apache.spark.mllib.regression.LinearRegressionModel: intercept = 0.0, numFeatures = 2

scala> model.weights
res10: org.apache.spark.mllib.linalg.Vector = [0.04801929334645109,6.942515869951296]

scala> model.predict(Vectors.dense(1,80))
res11: Double = 555.4492888894501

scala> model.predict(Vectors.dense(1,200))
res12: Double = 1388.5511932836057

scala> 

小结:spark机器学习

1)  速度非常快

2)  spark函数库逐渐丰富

3)  数学建模过程,最重要。需要有数学知识。



猜你喜欢

转载自blog.csdn.net/tansuoliming/article/details/80212930