spark Tuning Guide

A process Detailed

The most important thing is to spark shuffle process, the so-called mr process for processing map-> reduce the process
to reduce the shuffle process is actually two ideas.

Try not to change the key, so that it is completed at the local, reducing network and disk IO IO cpu overhead
to reduce the size of data shuffle

'' '
1. deduplication Union
2. mapValues map better than
A.map {case (A, (( B, C), (D, E))) => (A, (B, C, E)) }
changed:
A.mapValues {Case ((B, C), (D, E)) => (B, C, E)}
3. use broadcast + filter instead join
actually join small large RDD RDD
a {Case .map (name, Age, Sex) => (Age, (name, Sex))}
.join (B)
.map {Case (Age, ((name, Sex), title)) => (name, age, sex)}
you can imagine, when the execution of large a is broken up and distributed to each node to go. And even worse is that, in order to restore (name, age, sex) is the beginning of a structure, and made a map, and this map as lead shuffle. Two shuffle, crazy. However, if the write:
Val sc.broadcast B = (B.collectAsMap)
A.filter {Case (name, Age, Sex) => b.values.contains (Age)}

'''

So that key data is uneven, in fact, in other words, is that when we read the statistics of random data
to understand what data appears in the master or slave
concrete is similar to a small key can be read into memory
'' '
A. foreach (println)
want RDD print out the contents one by one, but the results did not appear in stdout inside, because this step is to execute the above into slave. In fact, only need to collect it, the content will be loaded into memory in the master print:
A.collect.foreach (println)
'' '
reduceBykey replace groupByKey

Two tuning parameters

Three sql tuning

Guess you like

Origin www.cnblogs.com/corx/p/11519444.html