there are several nice techniques in spark,eg. in user api side.here will dive into it check how does spark implement them.
1.abstract(functions in RDD)
group | function | feature | principle | ||
1 | first() | retrieve the first element in this rdd,if it's more than one partitons,the first partition will be taken by priority. esp,it will call take(1) internally. |
runs a job pairition by partition untill the total amount reaches the expected number |
||
take(n) | extract the first n elements in this rdd.it's the equivalent of first() if n is 1 | ||||
2 | top(n)(order) | extract the top (max by default) N elements. calls takeOrdered(num)(ord.reverse) internally.several search engine says solr will use similar technique to figure out it. |
concurrently spawns all tasks to do the same operation on respective partiton,ie each ,ie each tasks will try to retrieve the 'n' elements. |
||
max()(order) | retrieve the max element.though they are different algorithms with top(n) internally,both are the same effect(performance?) in finally. uses rdd.reduce(ord.max) internally. |
||||
3 | min()(order) | it's in the opposite of max(), uses rdd.reduce(ord.min) internally. | simiar to top(n) | ||
takeOrdered(n)(order) | in the opposite of top(n).similar to min() but take N minimum items. | ||||
4 | collect() | retrieve all the results for this rdd computation.so OOM exception will occur occasionally | similar to top(n),but here each task will not be limited to be n instead of 'max' |
2.techniques
a.lazy computation & compuates range by range
eg. in terms of take(n),spark can act as a lazy-worker:action when need! that is spark will try to use least resources as far as possible.see below for details:
/**-estimates partitions step by step to decrease resource consumption.ie lazy copmutation. * Take the first num elements of the RDD. It works by first scanning one partition, and use the * results from that partition to estimate the number of additional partitions needed to satisfy * the limit.-loop runs(continued jobs) to estimate whether several partitons's results are satisfied the target num. * the result returned is sorted by partiton sequnce. * @note due to complications in the internal implementation, this method will raise * an exception if called on an RDD of `Nothing` or `Null`. */ def take(num: Int): Array[T] = withScope { if (num == 0) { new Array[T](0) } else { val buf = new ArrayBuffer[T] val totalParts = this.partitions.length var partsScanned = 0 while (buf.size < num && partsScanned < totalParts) { //-loop to check whether results is satisfied the target //1 -compute what partitions range to run // The number of partitions to try in this iteration. It is ok for this number to be // greater than totalParts because we actually cap it at totalParts in runJob. var numPartsToTry = 1 if (partsScanned > 0) { log.info(s"-step to next loop,numPartsToTry=${numPartsToTry}") // If we didn't find any rows after the previous iteration, quadruple/四位相乘 and retry. // Otherwise, interpolate/篡改 the number of partitions we need to try, but overestimate/高估 // it by 50%. We also cap/覆盖 the estimation in the end. if (buf.size == 0) { //-no any data in prevous scanned partitons,ranges into more partitions numPartsToTry = partsScanned * 4 } else { // the left side of max is >=1 whenever partsScanned >= 2 //-estimate the remain parts to compute.but estimated total parts=num/buf.size * partsScanned * 1.5 numPartsToTry = Math.max((1.5 * num * partsScanned / buf.size).toInt - partsScanned, 1) numPartsToTry = Math.min(numPartsToTry, partsScanned * 4) //-narrow down the partiton range } } val left = num - buf.size //-step(range) to next run val p = partsScanned until math.min(partsScanned + numPartsToTry, totalParts) //-2 proceed with scanning the remain size of each specified partitions.similar to solr's group query val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true) //-3 add up to total buf;note:here doesn't take num-buf.size per partiton,but the remain size as the buf is mutable res.foreach(buf ++= _.take(num - buf.size)) //-change buf size per partition partsScanned += numPartsToTry } buf.toArray }
again ,spark will try to estimate the partitons to be computated by current scanned items amount.ie numPartsToTry.
of course ,this feature is dependented on the parted-computation utility in spark.
b.lazy load by iterator
by diving into takeOrdered(n),some nice stuffs are shown here,
/**-similar to a search engine,asign the request 'num' to each partition,then merge all partitions' result.so here * the mapPartitons() is called. * Returns the first k (smallest) elements from this RDD as defined by the specified * implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]]. * For example: * {{{ * sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1) * // returns Array(2) * * sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2) * // returns Array(2, 3) * }}} * * @param num k, the number of elements to return * @param ord the implicit ordering for T * @return an array of top elements */ def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { if (num == 0) { Array.empty } else { //1 retrieve top n items per partition val mapRDDs = mapPartitions { items => // Priority keeps the largest elements, so let's reverse the ordering. //-restore to small to large order:as the ord is used to trim the smallest element,ie element count is limited // in here val queue = new BoundedPriorityQueue[T](num)(ord.reverse) queue ++= util.collection.Utils.takeOrdered(items, num)(ord) //-keep large to samll order to limit items Iterator.single(queue) } //2 merge all the results into final n items if (mapRDDs.partitions.length == 0) { Array.empty } else { //-merge the individual partition's sub-result mapRDDs.reduce { (queue1, queue2) => queue1 ++= queue2 queue1 //-always keep left one to comulate; element count is implemented by queue,see above }.toArray.sorted(ord) //-resort by raw ord(reverse order) } }
note:items is a Iterator,which means that only a reference to the underlying storage is cost other than a concreate a Array or Seq!
for more clearly ,we can demostrate some snippets:
a.driver api
val maprdd = fmrdd.map((_,1)) //-MapPartitionsRDD[3]
b.rdd internal
/** * Return a new RDD by applying a function to all elements of this RDD. */ def map[U: ClassTag](f: T => U): RDD[U] = withScope { val cleanF = sc.clean(f) new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF)) //-so 'this' will be parent rdd
c. then dive into iter.map()
def map[B](f: A => B): Iterator[B] = new AbstractIterator[B] { def hasNext = self.hasNext def next() = f(self.next()) }
so we know that,every key-value pari will be read in per loop(callbac func 'f()').
ie. embeded processure: Fn->...->F2(F1(read root rdd's kv pair1))),then Fn(F..(kv pair2))
also,since every RDD#iterator()(bedides HadoopRDD's one ) will produce a new Iterator(see above),so no any 'no more elements exception ' will rise for follow RDD's calls.