我从网上拉过来一张图,加上自己的标注来详细记录一下该过程,以便后期优化代码做一个记录
mapreduce整个执行过程如下如所示
其中1、2、3、4....是我自己加上的以便一步一步来分析,下面我们来根据源代码分析这一步一步的过程,在此我跟踪的源代码是 hadoop-1.2.1 版本
1:inputSplit 这个过程我们看JobClient 类的writeNewSplits 方法,此方法为根据获得到的输入文件,将文件分块放入map中,关键的一句代码
List<InputSplit> splits = input.getSplits(job);
我们跟踪进去FileInputFormat 的 getSplits方法
long blockSize = file.getBlockSize(); long splitSize = computeSplitSize(blockSize, minSize, maxSize); long bytesRemaining = length; while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) { int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining); splits.add(new FileSplit(path, length-bytesRemaining, splitSize, blkLocations[blkIndex].getHosts())); bytesRemaining -= splitSize; } if (bytesRemaining != 0) { splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining, blkLocations[blkLocations.length-1].getHosts())); }
这段代码也是划分文件块的关键所在,首先获得文件块大小,在配置 block.size中可以配置,默认为64MB,然后计算出split的大小,默认也是64MB(可跟踪computeSplitSize方法查看原因),然后开始划分文件(while循环),比如64M文件则默认为一个块,65m文件则为两个块。
2:这一步自不必说,就是map计算的过程
3:这一步中,map计算的输出结果首先是写到一个缓冲区中,当缓冲区数据大小超过一定阀值之后,则进行spill 溢写操作,即将缓冲区中的数据写入到本地磁盘,在此过程中还根据key的hash值进行键值对的排序和合并操作,核心实现代码在MapTask的MapOutputBuffer 类中的collect方法,该方法主要用于收集map输出数据并写入缓冲区,当缓冲区超出临界值则开启溢写线程。
final boolean kvsoftlimit = ((kvnext > kvend) ? kvnext - kvend > softRecordLimit : kvend - kvnext <= kvoffsets.length - softRecordLimit); if (kvstart == kvend && kvsoftlimit) { LOG.info("Spilling map output: record full = " + kvsoftlimit); startSpill(); }
4:这一步是溢写的过程,在这个过程中还进行partition和sort
该Thread会检查内存中的输出缓存区,在满足一定条件的时候将缓冲区中的内容spill到硬盘上。这是一个标准的生产者-消费者模型,MapTask的collect方法是生产者,spillThread是消费者,它们之间同步是通过spillLock(ReentrantLock)和spillLock上的两个条件变量(spillDone和spillReady)完成的。当kvstart == kvend条件成立时,表示没有要spill的记录。
protected class SpillThread extends Thread { @Override public void run() { spillLock.lock(); spillThreadRunning = true; try { while (true) { spillDone.signal(); while (kvstart == kvend) { spillReady.await(); } try { spillLock.unlock(); //此处是进行排序溢写的核心方法 sortAndSpill(); } catch (Exception e) { sortSpillException = e; } catch (Throwable t) { sortSpillException = t; String logMsg = "Task " + getTaskID() + " failed : " + StringUtils.stringifyException(t); reportFatalError(getTaskID(), t, logMsg); } finally { spillLock.lock(); if (bufend < bufindex && bufindex < bufstart) { bufvoid = kvbuffer.length; } kvstart = kvend; bufstart = bufend; } } } catch (InterruptedException e) { Thread.currentThread().interrupt(); } finally { spillLock.unlock(); spillThreadRunning = false; } } }
SpillThread线程的run方法中调用sortAndSpill把缓存中的输出写到格式为+ '/spill' + spillNumber + '.out'的spill文件中。索引(kvindices)保持在spill{spill号}.out.index中,数据保存在spill{spill号}.out中 创建SpillRecord记录,输出文件和IndexRecord记录,然后,需要在kvoffsets上做排序,排完序后顺序访问kvoffsets,也就是按partition顺序访问记录。按partition循环处理排完序的数组,如果没有combiner,则直接输出记录,否则,调用combineAndSpill,先做combin然后输出。循环的最后记录IndexRecord到SpillRecord。
private void sortAndSpill() throws IOException, ClassNotFoundException, InterruptedException { //approximate the length of the output file to be the length of the //buffer + header lengths for the partitions long size = (bufend >= bufstart ? bufend - bufstart : (bufvoid - bufend) + bufstart) + partitions * APPROX_HEADER_LENGTH; FSDataOutputStream out = null; try { // create spill file final SpillRecord spillRec = new SpillRecord(partitions); final Path filename = mapOutputFile.getSpillFileForWrite(numSpills, size); //创建溢出文件 格式为+ '/spill' + spillNumber + '.out out = rfs.create(filename); final int endPosition = (kvend > kvstart) ? kvend : kvoffsets.length + kvend; sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter); int spindex = kvstart; IndexRecord rec = new IndexRecord(); InMemValBytes value = new InMemValBytes(); for (int i = 0; i < partitions; ++i) { IFile.Writer<K, V> writer = null; try { long segmentStart = out.getPos(); writer = new Writer<K, V>(job, out, keyClass, valClass, codec, spilledRecordsCounter); if (combinerRunner == null) {//如果为空则直接write // spill directly DataInputBuffer key = new DataInputBuffer(); while (spindex < endPosition && kvindices[kvoffsets[spindex % kvoffsets.length] + PARTITION] == i) { final int kvoff = kvoffsets[spindex % kvoffsets.length]; getVBytesForOffset(kvoff, value); key.reset(kvbuffer, kvindices[kvoff + KEYSTART], (kvindices[kvoff + VALSTART] - kvindices[kvoff + KEYSTART])); writer.append(key, value); ++spindex; } } else {如果不为空则先combiner在排序再输出 int spstart = spindex; while (spindex < endPosition && kvindices[kvoffsets[spindex % kvoffsets.length] + PARTITION] == i) { ++spindex; } // Note: we would like to avoid the combiner if we've fewer // than some threshold of records for a partition if (spstart != spindex) { combineCollector.setWriter(writer); RawKeyValueIterator kvIter = new MRResultIterator(spstart, spindex); combinerRunner.combine(kvIter, combineCollector); } } // close the writer writer.close(); // record offsets rec.startOffset = segmentStart; rec.rawLength = writer.getRawLength(); rec.partLength = writer.getCompressedLength(); spillRec.putIndex(rec, i); writer = null; } finally { if (null != writer) writer.close(); } } if (totalIndexCacheMemory >= INDEX_CACHE_MEMORY_LIMIT) { // create spill index file Path indexFilename = mapOutputFile.getSpillIndexFileForWrite(numSpills, partitions * MAP_OUTPUT_INDEX_RECORD_LENGTH); spillRec.writeToFile(indexFilename, job); } else { indexCacheList.add(spillRec); totalIndexCacheMemory += spillRec.size() * MAP_OUTPUT_INDEX_RECORD_LENGTH; } LOG.info("Finished spill " + numSpills); ++numSpills; } finally { if (out != null) out.close(); } }
注:在此系统存放文件的方式使用的是二级索引,在此没有做研究。
5:在此步骤将磁盘中的多个map输出文件具有相同key 的进行合并
核心代码
@SuppressWarnings("unchecked") RawKeyValueIterator kvIter = Merger.merge(job, rfs, keyClass, valClass, codec, segmentList, job.getInt("io.sort.factor", 100), new Path(mapId.toString()), job.getOutputKeyComparator(), reporter, null, spilledRecordsCounter); //write merged output to disk long segmentStart = finalOut.getPos(); Writer<K, V> writer = new Writer<K, V>(job, finalOut, keyClass, valClass, codec, spilledRecordsCounter); if (combinerRunner == null || numSpills < minSpillsForCombine) { Merger.writeFile(kvIter, writer, reporter, job); } else { combineCollector.setWriter(writer); combinerRunner.combine(kvIter, combineCollector); } //close writer.close();
6:在这一步中reduce任务通过http方式将map的输出结果复制到reduce执行的节点上来,在此开始关注 RecudeTask 类中方法
URL url = mapOutputLoc.getOutputLocation(); HttpURLConnection connection = (HttpURLConnection)url.openConnection(); //通过http方式拉取map输出数据 InputStream input = setupSecureConnection(mapOutputLoc, connection);
下面的核心代码是对reduce输入数据进行混淆,涉及到的操作类似map段 进行合并和排序
MapOutput mapOutput = null; if (shuffleInMemory) {//判断混淆能在缓存中进行(此种方式效率比较高) if (LOG.isDebugEnabled()) { LOG.debug("Shuffling " + decompressedLength + " bytes (" + compressedLength + " raw bytes) " + "into RAM from " + mapOutputLoc.getTaskAttemptId()); } mapOutput = shuffleInMemory(mapOutputLoc, connection, input, (int)decompressedLength, (int)compressedLength); } else { if (LOG.isDebugEnabled()) { LOG.debug("Shuffling " + decompressedLength + " bytes (" + compressedLength + " raw bytes) " + "into Local-FS from " + mapOutputLoc.getTaskAttemptId()); } mapOutput = shuffleToDisk(mapOutputLoc, input, filename, compressedLength); } mapOutput.decompressedSize = decompressedLength; return mapOutput;
7:这一步进行merge 合并数据 不做过多的关注了
8和9是reduce真正运行的逻辑过程并将最终结果输出
上述步骤中涉及到的混淆 shuffle过程为3、4、5、6、7,优化方面有很多方式,其中最主要的优化面就是reduce通过http获取map输出的过程。