After learning the Spark RDD and RDD operation, it is not really want to quickly write a program to consolidate Spark about what they have learned. Big Data can learn how to develop less wordcount it?
Man of few words said, directly on the code:
val lines = sc.textFile("data/dataset.txt")
val result = lines.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
result.collect()
.foreach(x => println(x._1 + " = " + x._2))
Output:
Deer = 2
Bear = 2
Car = 3
River = 2
Just a few lines of code to get WordCount, think of WordCount write MapReduce programs, Spark is really easy to use too.
Next, we want to take advantage of the program you just wrote, to do some other things ...
First of all, we can provide the Spark toDebugString
to see the pedigree RDD method:
result.toDebugString
Output:
(2) ShuffledRDD[4] at reduceByKey at WordCount.scala:21 []
+-(2) MapPartitionsRDD[3] at map at WordCount.scala:20 []
| MapPartitionsRDD[2] at flatMap at WordCount.scala:19 []
| data/dataset.txt MapPartitionsRDD[1] at textFile at WordCount.scala:17 []
| data/dataset.txt HadoopRDD[0] at textFile at WordCount.scala:17 []
From the output, we can see some RDD WordCount program was created out. Here from the bottom up.
Call to sc.textFile
create a MapPartitionsRDD. In practical textFile
interior is to create a HadoopRDD, then the RDD mapping operation, it has finally been MapPartitionsRDD. After then flatMap
, Map
and reduceByKey
other conversion operations, before the latter is dependent on a RDD RDD. The last call collect
action operation, generate a the Job . At this Spark scheduler will create a user computing operations RDD physical action implementation plan. Spark scheduler can be invoked from the final collect()
departure RDD operation, back up all the necessary calculations RDD. Parent scheduler will visit RDD's parent, parent, and so on, recursively up to generate all the necessary computational physics program ancestors of RDD.
DAG generation
The original RDD (s) is formed by converting a series of DAG. Dependencies between RDD, which contains the RDD is converted Parent RDD (s) derived from the Partitions and it depends on what Parent RDD (s) of. These dependencies on the formation of the RDD Lineage (descent). With Lineage, can guarantee a RDD is calculated before, it depends Parent RDD have been completed calculation; but also to achieve fault tolerance RDD, that some or all of the results if a RDD is lost, it can be re this part of the calculation of the lost data without recalculate all data.
How to generate computing task DAG
Depending on the dependency of the DAG is divided into different stages (Stage). For narrow-dependent, since Partition dependencies certainty, Partition conversion process can be completed in the same thread, narrow dependencies are divided into a Spark Stage same; for a wide-dependent, due to the presence of Shuffle, only Parent RDD ( s) after the treatment is completed, the following calculations can begin, thus wider Spark dependence is based on dividing the Stage . In one internal Stage, each Partition is assigned a calculation task (Task), which can be performed in parallel Task. The dependencies between Stage a DAG into a large particle size, the order of execution of the DAG are also front to rear. In other words, Stage only when it has no Parent Stage or Parent Stage have been executed, you can perform.
WordCount In the above example, the shift operation is performed reduceByKey
when a trigger Shuffle (shuffling) process. Thus, the entire Job from here will split, forming two Stage. WebUI in the Spark ( http://127.0.0.1:4040 in), we can see:
Stage1 dependent on Stage0, therefore Stage0 must be executed and then execute Stage1. Next, consider the relationship between Stage0 RDD conversion and Stage1 in.
In Stage0 is pipelined execution from reading a file into the final MapPartitionsRDD. Stage0 generated in two task, as shown below:
Stage0 Task Stage1 Task as input the output of the final result.
summary
After a series of RDD conversion, it will call an action on the last operation RDD, then will generate a Job. After being divided into a number of computing tasks (Task) in the Job, these Task will be submitted to the compute nodes on a cluster to calculate. Spark in Task divided ShuffleMapTask
and ResultTask
two kinds. A final Stage of the DAG (corresponding to a Stage1) to generate a ResultTask Partition each result, all the rest will generate Stage ShuffleMapTask. Task generated will be sent to the Executor has been started by the Executor to accomplish computing tasks.
Spark execute a process:
- RDD user code defined in directed acyclic graph
to create a new RDD RDD action on, the sub-references the parent RDD RDD, formed DAG. - The operating action directed acyclic graph forced translated into implementation plans
when calling a RDD action operation, the RDD must be calculated. This also requires the parent node is calculated in the RDD. Spark scheduler submits a job to calculate all the necessary RDD. This job will comprise one or more steps, each step corresponding to a number of computing tasks. Also corresponds to a step in the DAG or a plurality of RDD (execution pipeline). - Task scheduling in the cluster and perform
a task processing a partition data, after a call to action RDD operation, it will generate a lot of tasks, Spark scheduler will schedule tasks to be executed on Worker. Once the last step of the job ended, an action will perform the operation finished.
Good text recommended:
Reproduced in: https: //www.jianshu.com/p/8d2bf49cf97d