本文,简单介绍下,如何运行hadoop自带的mapreduce的那些例子。
本文针对的hadoop版本,是2.6.5版本,自带的例子包名为:hadoop-mapreduce-examples-2.6.5.jar;
位于/share/hadoop/mapreduce目录下。
简单来说,如果想要完成范例的运行,直接:
hadoop jar hadoop-mapreduce-examples-2.6.5.jar wordcount /input /output
其中,hadoop命令配置了环境变量。
那么,这个执行具体是怎么实现的呢;或者说,如果我使用其他的名称,或者说WordCount首字母大写,因为解压缩该包能够看到WordCount.class文件的存在,能成功么?
答案是:不能。
为什么不能?
我们仔细分析下这个整体的运行过程,目标在于,我们能够实现在mapreduce-examples包内添加我们自己的类,而且能够实现自己的运行逻辑。
执行命令很简单,入口,看下hadoop的运行脚本:
elif [ "$COMMAND" = "jar" ] ; then CLASS=org.apache.hadoop.util.RunJar
很清楚,接下来需要注意的类,是RunJar类,而且指明了该类的位置,该命令,实际执行了RunJar中的main方法,我们看一下:
/** * Run a Hadoop job jar. If the main class is not in the jar's manifest, * then it must be provided on the command line. */ public static void main(String[] args) throws Throwable { new RunJar().run(args); }
这里看的更清楚了,我们在运行其它程序的时候,或者说是自己随意打出来的一个包,我们可以在后面直接指定className,就可以成功运行出结果了,注意,这里需要的className必须是全名。
接下来,看看RunJar类中,run方法的内容:
public void run(String[] args) throws Throwable { String usage = "RunJar jarFile [mainClass] args..."; if (args.length < 1) { System.err.println(usage); System.exit(-1); } int firstArg = 0; String fileName = args[firstArg++]; File file = new File(fileName); if (!file.exists() || !file.isFile()) { System.err.println("Not a valid JAR: " + file.getCanonicalPath()); System.exit(-1); } String mainClassName = null; JarFile jarFile; try { jarFile = new JarFile(fileName); } catch (IOException io) { throw new IOException("Error opening job jar: " + fileName) .initCause(io); } Manifest manifest = jarFile.getManifest(); if (manifest != null) { mainClassName = manifest.getMainAttributes().getValue("Main-Class"); } jarFile.close(); if (mainClassName == null) { if (args.length < 2) { System.err.println(usage); System.exit(-1); } mainClassName = args[firstArg++]; } mainClassName = mainClassName.replaceAll("/", "."); File tmpDir = new File(System.getProperty("java.io.tmpdir")); ensureDirectory(tmpDir); final File workDir; try { workDir = File.createTempFile("hadoop-unjar", "", tmpDir); } catch (IOException ioe) { // If user has insufficient perms to write to tmpDir, default // "Permission denied" message doesn't specify a filename. System.err.println("Error creating temp dir in java.io.tmpdir " + tmpDir + " due to " + ioe.getMessage()); System.exit(-1); return; } if (!workDir.delete()) { System.err.println("Delete failed for " + workDir); System.exit(-1); } ensureDirectory(workDir); ShutdownHookManager.get().addShutdownHook(new Runnable() { @Override public void run() { FileUtil.fullyDelete(workDir); } }, SHUTDOWN_HOOK_PRIORITY); unJar(file, workDir); ClassLoader loader = createClassLoader(file, workDir); Thread.currentThread().setContextClassLoader(loader); Class<?> mainClass = Class.forName(mainClassName, true, loader); Method main = mainClass.getMethod("main", new Class[] { Array .newInstance(String.class, 0).getClass() }); String[] newArgs = Arrays.asList(args).subList(firstArg, args.length) .toArray(new String[0]); try { main.invoke(null, new Object[] { newArgs }); } catch (InvocationTargetException e) { throw e.getTargetException(); } }
纵观全部的代码,很容易看懂,主要是解析我们的参数,然后将提交的包解压,其中需要注意这一句:
Manifest manifest = jarFile.getManifest(); if (manifest != null) { mainClassName = manifest.getMainAttributes().getValue("Main-Class"); }
跟上文对应,如果打包的文件中有这个ManiFest文件,且包含了Main-class的配置,那么,就会运行这个Main-class的main方法。
Class<?> mainClass = Class.forName(mainClassName, true, loader); Method main = mainClass.getMethod("main", new Class[] { Array .newInstance(String.class, 0).getClass() });
最后的运行逻辑如下:
try { main.invoke(null, new Object[] { newArgs }); } catch (InvocationTargetException e) { throw e.getTargetException(); }
可以看出,使用反射机制来运行,暂且不提,按照这一段的说法,运行的是MAINFEST内定义的Main-class,那么,我们看下,包里面是否有MAINFEST文件呢?
有的,对于2.6.5版本的hadoop来说,文件内容如下:
Manifest-Version: 1.0 Archiver-Version: Plexus Archiver Created-By: Apache Maven Built-By: YYZYHC Build-Jdk: 1.8.0_131 Main-Class: org.apache.hadoop.examples.ExampleDriver
非常清楚,接下来我们需要看一下,这个ExampleDriver是何方神圣:
public class ExampleDriver { public static void main(String argv[]) { int exitCode = -1; ProgramDriver pgd = new ProgramDriver(); try { pgd.addClass("wordcount", WordCount.class, "A map/reduce program that counts the words in the input files."); pgd.addClass( "wordmean", WordMean.class, "A map/reduce program that counts the average length of the words in the input files."); pgd.addClass( "wordmedian", WordMedian.class, "A map/reduce program that counts the median length of the words in the input files."); pgd.addClass( "wordstandarddeviation", WordStandardDeviation.class, "A map/reduce program that counts the standard deviation of the length of the words in the input files."); pgd.addClass( "aggregatewordcount", AggregateWordCount.class, "An Aggregate based map/reduce program that counts the words in the input files."); pgd.addClass( "aggregatewordhist", AggregateWordHistogram.class, "An Aggregate based map/reduce program that computes the histogram of the words in the input files."); pgd.addClass("grep", Grep.class, "A map/reduce program that counts the matches of a regex in the input."); pgd.addClass("randomwriter", RandomWriter.class, "A map/reduce program that writes 10GB of random data per node."); pgd.addClass("randomtextwriter", RandomTextWriter.class, "A map/reduce program that writes 10GB of random textual data per node."); pgd.addClass("sort", Sort.class, "A map/reduce program that sorts the data written by the random writer."); pgd.addClass("pi", QuasiMonteCarlo.class, QuasiMonteCarlo.DESCRIPTION); pgd.addClass("bbp", BaileyBorweinPlouffe.class, BaileyBorweinPlouffe.DESCRIPTION); pgd.addClass("distbbp", DistBbp.class, DistBbp.DESCRIPTION); pgd.addClass("pentomino", DistributedPentomino.class, "A map/reduce tile laying program to find solutions to pentomino problems."); pgd.addClass("secondarysort", SecondarySort.class, "An example defining a secondary sort to the reduce."); pgd.addClass("sudoku", Sudoku.class, "A sudoku solver."); pgd.addClass("join", Join.class, "A job that effects a join over sorted, equally partitioned datasets"); pgd.addClass("multifilewc", MultiFileWordCount.class, "A job that counts words from several files."); pgd.addClass("dbcount", DBCountPageView.class, "An example job that count the pageview counts from a database."); pgd.addClass("teragen", TeraGen.class, "Generate data for the terasort"); pgd.addClass("terasort", TeraSort.class, "Run the terasort"); pgd.addClass("teravalidate", TeraValidate.class, "Checking results of terasort"); exitCode = pgd.run(argv); } catch (Throwable e) { e.printStackTrace(); } System.exit(exitCode); } }
这代码也很容易看懂,里面定义了一个ProgramDriver,这个类位于hadoop-common包下,具体内容不贴了,接下来介绍其中部分被ExampleDriver使用到的东西:
/** * A description of a program based on its class and a human-readable * description. */ Map<String, ProgramDescription> programs; public ProgramDriver() { programs = new TreeMap<String, ProgramDescription>(); }
只有一个构造器,生成了一个TreeMap,key是String,value是ProgramDescription,其内没什么花样,主要是一个main的Method,一个String的description。
ExampleDriver中,ProgramDriver中添加了很多的class,最后执行了run方法,所以,接下来我们看下ProgramDriver中的run方法:
/** * This is a driver for the example programs. It looks at the first command * line argument and tries to find an example program with that name. If it * is found, it calls the main method in that class with the rest of the * command line arguments. * * @param args * The argument from the user. args[0] is the command to run. * @return -1 on error, 0 on success * @throws NoSuchMethodException * @throws SecurityException * @throws IllegalAccessException * @throws IllegalArgumentException * @throws Throwable * Anything thrown by the example program's main */ public int run(String[] args) throws Throwable { // Make sure they gave us a program name. if (args.length == 0) { System.out.println("An example program must be given as the" + " first argument."); printUsage(programs); return -1; } // And that it is good. ProgramDescription pgm = programs.get(args[0]); if (pgm == null) { System.out.println("Unknown program '" + args[0] + "' chosen."); printUsage(programs); return -1; } // Remove the leading argument and call main String[] new_args = new String[args.length - 1]; for (int i = 1; i < args.length; ++i) { new_args[i - 1] = args[i]; } pgm.invoke(new_args); return 0; }
这里,看的很清楚,先根据后续的参数,提取出对应的类,然后予以执行。
同样,我们定义的wordcount实质上就是上文中TreeMap的key,必须精确指定,才能让程序运行起来。
综上所述,运行Hadoop自带的MapReduce-examples的途径有以下几种:
- 修改打包的实现,删除MAINFEST中的Main-class,或者是修改Main-class的名称为自己想要的名称。
- 直接指定自己的className全名,简易便捷。