https://www.youtube.com/watch?v=Wg2boMqLjCg
Understanding the Shuffle in Spark
-Common causes of inefficiency
Understanding when code runs on the driver vs. the workers
-Common causes of errors
How to factor your code
-For reuse between batch and streaming
Part1 Understanding the Shuffle in Spark
Shuffle occurs to transfer all data with the same key to the same worker node.
1) reduceByKey vs. groupByKey
> reduceByKey is more efficient
> groupByKey can cause of out of disk problems
> reduceByKey, aggregateByKey, foldByKey, and combineByKey, preferred over groupByKey, because they do some combination steps before network transfer and they cost less disk
2) Join a large table with a Small Table
ShuffledHashJoin vs. BroadCastHashJoin
> ShuffledHashJoin: all the data will be shuffled
> BroadcasrHashJoin: broadcast the small table to all worker nodes
> Use .toDebugString() or EXPLAIN to double check
3)Join a medium table with a large table
Before join, do some transformation steps to filter useful data from the large table, then do shuffle
4) In Practice: Detecting Shuffle Problems
Check the Spark UI pages for task level detail about your Spark Job.
Things to Look for:
> Tasks that take much longer to run than others.
> Speculative tasks that are launching
> Shards that have a lot more input or shuffle output than others
Part2 Execution on the Driver vs. Workers
1)
The main program are executed on the Spark Driver.
Transformations are executed on the Spark Workers.
Actions may transfer data from workers to driver.
2) collect()
collect() sends all the partitions to the single driver
Don't call collect() on a large RDD
> use count()/take(N)
> saveAsTextFile()