Spark API - Spark partition

First, the concept of partition

Calculating a cell internal partition is parallel computing RDD, RDD data set is divided into a plurality of logically slices, each slice format called a partition, determines the size of parallel computation, and the value of each partition calculations are carried out in a number of tasks, therefore, the task, but also by the RDD (to be exact job last RDD) number of partitions decision.

Second, why should partition

 Data partition, in a distributed cluster, the cost of network communication greatly reduce network traffic can greatly improve performance. Performance spending, mainly in the framework of mapreduce io and network transmission, io due to the large number of read and write files, it is inevitable, but the network transmission is avoided, the smaller file large file compression, thereby reducing network traffic, but adds cpu load calculation.

  Spark inside io is inevitable, but the network transmission spark which is optimized:

  Spark rdd the partition (slice), on parallel computing cluster. Rdd same fragment 100, node 10, node 10 partitions a mean, when performed sum calculation type, the first sum for each partition, and then transmitted to the main routine shuffle sum value sum globally, so sum calculation type for network transmission is very small. But when to join type of calculation, you need the data itself shuffle, a large network overhead.

How spark is to optimize this issue?

  When the key-value rdd Spark partitioned by hashCode key, and to ensure that the same key is stored on the same node, so that the key change for rdd polymerization process does not need to shuffle why we performed calculations to be performed shuffle mapreduce ? , Which means that the network transmission mapreduce mainly shuffle stage, the root cause is the same shuffle key exist on different nodes, according to the time polymerizing key had to shuffle . shuffle is the impact of the network, it should go all the data mixed network before it can come together to the same key. To conduct shuffle is stored decision.

  Spark get from this lesson in inspiration, spark key will be partitioned, which is the key hashcode of partition, the same key, hashcode must be the same, so it is time to partition the data into 10 points 100t, 10 per section t, it can ensure that the same key must be on a partition inside, and it can ensure the same key storage time can exist on the same node. For example, a rdd divided into 100 parts, 10 cluster nodes, each node memory 10 parts, and each referred to each partition, spark same key to ensure the presence of the same node there are actually with the same key a partition.

  Uneven distribution of key decisions and some small partitions that much of. Partition can not guarantee completely equal, but it will ensure that in a close range. So there mapreduce do some work inside, spark does not need to shuffle up, spark address the underlying principle is this piece of the network transmission.

  When to join the two tables, two tables are impossible to partition good, usually with a large table is the frequent prior to partition, a small table associated with it when small table shuffle process.

  Big table does not need to shuffle.
  

  Conversion requires data between nodes shuffle work greatly benefit from the partition. Such conversion is cogroup, groupWith, join, leftOuterJoin, rightOuterJoin, groupByKey, reduceByKey, combineByKey and lookup.

   Partition is configurable, as long as the RDD is based on key-value pairs can be .

Three, Spark zoning principles and methods

RDD partition of a partition principle: as much as possible the number of scoring area is equal to the number of core cluster

Whether it is local mode, Standalone mode, or mode YARN Mesos mode, we can configure the number of partitions by default spark.default.parallelism , if not set this value, this value is determined depending on the cluster environment

3.1 Local Mode

(1) default

The following such a default on a partition

 

 result

 

 

(2) manual setting

Set up several partitions is several partitions

 

 result

 

 

(3) with local [n] related

n is equal to a few a few default partition

So if n = * is equal to the number of partition number cpu core

 

 

result

 

 Check the local computer cpu core, my computer - "Right Management -" Device Manager - "Processor

 

 

 

(4) Parameter Control

 

 

result

 

3.2 YARN mode

 

 

Method entered defaultParallelism

 

 The method continues into defaultParallelism

 

 

 

This a trait, which is the implementation class (Ctrl + h)

 

 Find defaultParallelism into the class method TaskSchedulerImpl

 

 

 DefaultParallelism continue into the method, but also a trait, to see its implementation class

 

 

 

Ctrl + h to see the implementation class SchedulerBackend class

 

 

 Enter CoarseGrainedSchedulerBackend find defaultParallelism

 

 

totalCoreCount.get () is the sum of all core executor used, and 2 to a larger value comparison

If under normal circumstances, you set how much is

 

Fourth, the partitioner

(1) If the data is read out from the inside HDFS, does not require the partitioner. Because HDFS already divided into many areas of.

    We can control the number of partitions, but there is no need to partition device.

(2) a non-key-value RDD partition, the partition is not necessary to provide

al testRDD = sc.textFile("C:\\Users\\Administrator\\IdeaProjects\\myspark\\src\\main\\hello.txt")
  .flatMap(line => line.split(","))
  .map(word => (word, 1)).partitionBy(new HashPartitioner(2))

  

  No need to set up, but also have to set the line.

(3) Key-value in the form of time, we have necessary.

HashPartitioner

testRDD.reduceByKey resultRDD = Val (new new HashPartitioner (2), (X: Int, Y: Int) => X + Y) 
// If the default is not set HashPartitoiner, the number of partitions like spark.default.parallelism 
the println (resultRDD.partitioner ) 
println ( "resultRDD" + resultRDD.getNumPartitions)

  Rank Partitions

 

val resultRDD = testRDD.reduceByKey((x:Int,y:Int) => x+ y)
val newresultRDD=resultRDD.partitionBy(new RangePartitioner[String,Int](3,resultRDD))
println(newresultRDD.partitioner)
println("newresultRDD"+newresultRDD.getNumPartitions)

  

NOTE: partitioned according to the range, if it is a string, then divided by the range of the order of dictionary. If it is digital, according to the data from the scope of the division .

Custom Partitioning

We need to implement two methods

class MyPartitoiner(val numParts:Int) extends  Partitioner{
  override def numPartitions: Int = numParts
  override def getPartition(key: Any): Int = {
    val domain = new URL(key.toString).getHost
    val code = (domain.hashCode % numParts)
    if (code < 0) {
      code + numParts
    } else {
      code
    }
  }
}

object DomainNamePartitioner {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("word count").setMaster("local")

    val sc = new SparkContext(conf)

    val urlRDD = sc.makeRDD(Seq(("http://baidu.com/test", 2),
      ( "http://baidu.com/index", 2), ( "http://ali.com",. 3), ( "http://baidu.com/tmmmm",. 4), 
      ( "HTTP: //baidu.com/test ",. 4))) 
    // the Array [the Array [(String, Int)]] 
    // = the Array (the Array (), 
    // the Array ((http://baidu.com/index, 2), (http://baidu.com/tmmmm,4), 
    // (http://baidu.com/test,4), (http://baidu.com/test,2), (HTTP: //ali.com,3))) 
    Val hashPartitionedRDD = urlRDD.partitionBy (new new HashPartitioner (2)) 
    hashPartitionedRDD.glom (). the collect () 

    // use of spark-shell --jar manner jar package where the partitioner introduction to, and then test the following code 
    // --master the shell-Spark Spark: // Master: 7077 --jars Spark-1.0-SNAPSHOT.jar RDD- 
    Val partitionedRDD = urlRDD.partitionBy (new new MyPartitoiner (2)) 
    Val partitionedRDD.glom = Array () the collect (. ) 

  } 
}

  

 

 

 

Guess you like

Origin www.cnblogs.com/tesla-turing/p/11489262.html