浅析MapReduce之PriorityQueue

前言：

PriorityQueue.java 在Merger.java 里的主要作用是建立一个小顶堆（调用PriorityQueue.put函数，其实就是upHeap()，它是PriorityQueue的核心函数之一），建成小顶堆后。用户可以调用PriorityQueue.pop 函数取顶（也就是最小值,, 取顶后直接调用downHeap()，它是PriorityQueue的另一个核心函数)。

下面按代码流程来撸一撸：

1. 来瞅一瞅建小顶堆的代码（自下而上upHeap建堆）。在Merger.java 里是通过MergeQueue的merge(...) 方法调用来建立的。而MergeQueue继承了 abstract class PriorityQueue，只是重新实现了abstract boolean lessThan(Object a, Object b)函数，其它一概继承。

//feed the streams to the priority queue
initialize(segmentsToMerge.size()); //初始化PriorityQueue大小
clear(); //clear PriorityQueue的内容
for (Segment<K, V> segment : segmentsToMerge) { //循环调用 PriorityQueue的put函数，建立小顶堆。

put(segment); //保证每个父节点(i) 都小于其子节点（i*2 和 (i*2)+1） } //Heap 的index是从1开始

2. 取最小值（也就是堆顶），size减1，并重新自上而下（downHeap）重新调整为一个新的小顶堆。来瞅一瞅代码流程，依然是Merger.java 里启航，这次是next()函数。注意：堆的节点（node）是以segment为单元的，排序规则是以segment的第一个key的大小作比较，而next()每次取得是segment里的一个k/v对。

public boolean next() throws IOException {
if (size() == 0) {
resetKeyValue();
return false;
}
if (minSegment != null) {
//minSegment is non-null for all invocations of next except the first
//one. For the first invocation, the priority queue is ready for use
//but for the subsequent invocations, first adjust the queue
adjustPriorityQueue(minSegment); //上次取顶后，重新调整小顶堆。
if (size() == 0) { //如果只剩一个值了，直接pop().
minSegment = null; //如果hasNext，它的值不一定是最小的，
resetKeyValue(); //所以就需要调用adjustTop()=>downHeap()调整小顶堆。
return false;
}
}
minSegment = top(); //每次调用top取顶 (最小值)
long startPos = minSegment.getReader().bytesRead;
key = minSegment.getKey();
if (!minSegment.inMemory()) {
//When we load the value from an inmemory segment, we reset
//the "value" DIB in this class to the inmem segment's byte[].
//When we load the value bytes from disk, we shouldn't use
//the same byte[] since it would corrupt the data in the inmem
//segment. So we maintain an explicit DIB for value bytes
//obtained from disk, and if the current segment is a disk
//segment, we reset the "value" DIB to the byte[] in that (so
//we reuse the disk segment DIB whenever we consider
//a disk segment).
minSegment.getValue(diskIFileValue);
value.reset(diskIFileValue.getData(), diskIFileValue.getLength());
} else {
minSegment.getValue(value);
}
long endPos = minSegment.getReader().bytesRead;
totalBytesProcessed += endPos - startPos;
mergeProgress.set(totalBytesProcessed * progPerByte);
return true;

}

总结：PriorityQueue的两个核心函数就是 upHeap() 和 downHeap()，所有优先队列的操作都是围绕这两个函数进行的。upHeap()函数名里的up代表目的方向，实际是从down 到 up 的过程。downHeap()函数名里的down也代表目的方向，实际是从up到 down 的过程。，upHeap() 是建小顶堆，downHeap() 重新调整节点，重新成为小顶堆。

浅析MapReduce之PriorityQueue

猜你喜欢