YARN's DistributedShell source code analysis

Ready to work

Hadoop officially provides two examples of YARN applications. Modify distributedshell here to
hadoop-2.7.6-src\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-applications
Insert picture description here
copy this module to your workspace, and then import idea

When importing, all options are available by default, and it is not recommended to modify if you don’t understand, even the project name can’t be changed at will

Insert picture description here

mvn package -Dmaven.test.skip=trueSkip the test and package it directly. I uploaded the jar package to yarn-demo in my home directory, and then run it with the packaged version myself.

Maven skips unit tests-the difference between maven.test.skip and skipTests

$HADOOP_HOME/bin/hadoop jar ~/yarn-demo/hadoop-yarn-applications-distributedshell-2.7.6.jar \
org.apache.hadoop.yarn.applications.distributedshell.Client \
   --jar ~/yarn-demo/hadoop-yarn-applications-distributedshell-2.7.6.jar\
   --shell_command "touch /tmp/hello-world" \
   --num_containers 3 \
   --container_memory 350 \
   --master_memory 350 \
   --priority 10

After the execution is successful, you can find the new file that was created. After
Insert picture description here
confirming that the development environment and operating environment are normal through the preparation work, you can rest assured that you can develop the code.

Code analysis

There are many codes, my level is limited, only part of the code is analyzed, and it is a good idea

Client

Start with the main method of the submitted mainclass

  public static void main(String[] args) {
    boolean result = false;
     /**注意try-catch的对应关系,外层的try-catch**/
    try {
      Client client = new Client();
      LOG.info("Initializing Client");
      /**注意try-catch的对应关系,内层的try-catch**/
      try {
        boolean doRun = client.init(args);//初始化
        if (!doRun) {
          System.exit(0);
        }
      } catch (IllegalArgumentException e) {/*参数异常退出*/
        System.err.println(e.getLocalizedMessage());
        client.printUsage();
        System.exit(-1);
      }
      /**注意try-catch的对应关系,内层的try-catch**/
      result = client.run();//运行
    } catch (Throwable t) {/*运行出错退出*/
      LOG.fatal("Error running Client", t);
      System.exit(1);
    }
    /**注意try-catch的对应关系,外层的try-catch**/
    /**运行结束,但不一定有正常的结果,下面根据结果来判断**/
    if (result) {
      LOG.info("Application completed successfully");//成功
      System.exit(0);
    } 
    LOG.error("Application failed to complete successfully");
    System.exit(2);
  }

The main thing involved is peace, init()and the run()two methods are mainly analyzed below.

Of course, the construction method is also very important, but the construction method is mainly to prepare a lot of parameters to bind the ApplicationMaster, and you may not understand how much you can directly look at a bunch of parameters.

init()

The code is long, only the main logic is extracted. It is not difficult to find that the parameters entered on the command line are parsed, verified, and assigned to the object.

public boolean init(String[] args) throws ParseException {
    CommandLine cliParser = new GnuParser().parse(opts, args);
    if (args.length == 0) {throw new IllegalArgumentException("No args specified for client to initialize");}
	//客户端解析的一些参数
    if (cliParser.hasOption("log_properties")) {...}
    if (cliParser.hasOption("help"))  {...}
    if (cliParser.hasOption("debug")) {...}
    if (cliParser.hasOption("keep_containers_across_application_attempts")) {...}
	//app的名字、ApplicationMaster相关的一些参数
    appName = cliParser.getOptionValue("appname", "DistributedShell");
    amPriority = Integer.parseInt(cliParser.getOptionValue("priority", "0"));
    amQueue = cliParser.getOptionValue("queue", "default");
    amMemory = Integer.parseInt(cliParser.getOptionValue("master_memory", "10"));		
    amVCores = Integer.parseInt(cliParser.getOptionValue("master_vcores", "1"));
    if (amMemory < 0) {...}
    if (amVCores < 0) {...}
    //jar包
    if (!cliParser.hasOption("jar")) {...}		
    appMasterJar = cliParser.getOptionValue("jar");
	//shell相关的参数处理
    if (!cliParser.hasOption("shell_command") && !cliParser.hasOption("shell_script")) {
      throw new IllegalArgumentException(
          "No shell command or shell script specified to be executed by application master");
    } else if (cliParser.hasOption("shell_command") && cliParser.hasOption("shell_script")) {
      throw new IllegalArgumentException("Can not specify shell_command option " +
          "and shell_script option at the same time");
    } else if (cliParser.hasOption("shell_command")) {shellCommand = cliParser.getOptionValue("shell_command");
    } else {shellScriptPath = cliParser.getOptionValue("shell_script");}
    if (cliParser.hasOption("shell_args")) {shellArgs = cliParser.getOptionValues("shell_args");}
    if (cliParser.hasOption("shell_env")) { ... }
    shellCmdPriority = Integer.parseInt(cliParser.getOptionValue("shell_cmd_priority", "0"));
   //容器相关的参数
    containerMemory = Integer.parseInt(cliParser.getOptionValue("container_memory", "10"));
    containerVirtualCores = Integer.parseInt(cliParser.getOptionValue("container_vcores", "1"));
    numContainers = Integer.parseInt(cliParser.getOptionValue("num_containers", "1"));
    if (containerMemory < 0 || containerVirtualCores < 0 || numContainers < 1) {...}
    //节点标签
    nodeLabelExpression = cliParser.getOptionValue("node_label_expression", null);
    //超时时间
    clientTimeout = Integer.parseInt(cliParser.getOptionValue("timeout", "600000"));
	//失败的尝试次数
    attemptFailuresValidityInterval =Long.parseLong(cliParser.getOptionValue("attempt_failures_validity_interval", "-1"));
    log4jPropFile = cliParser.getOptionValue("log_properties", "");//日志配置
    // Get timeline domain options
    if (cliParser.hasOption("domain")) {...}
    return true;
  }

Insert picture description here

run()

The run method is longer, let’s break it down for a while.

    //客户端启动中
    LOG.info("Running Client");
    yarnClient.start();
   //获得yarn集群的情况并输出
    YarnClusterMetrics clusterMetrics = yarnClient.getYarnClusterMetrics();
    LOG.info("Got Cluster metric info from ASM" 
        + ", numNodeManagers=" + clusterMetrics.getNumNodeManagers());
	//获得node的分布情况并输出
    List<NodeReport> clusterNodeReports = yarnClient.getNodeReports(
        NodeState.RUNNING);
    LOG.info("Got Cluster node info from ASM");
    for (NodeReport node : clusterNodeReports) {
      LOG.info("Got node report from ASM for"
          + ", nodeId=" + node.getNodeId() 
          + ", nodeAddress" + node.getHttpAddress()
          + ", nodeRackName" + node.getRackName()
          + ", nodeNumContainers" + node.getNumContainers());
    }
   //获得队列信息并输出
    QueueInfo queueInfo = yarnClient.getQueueInfo(this.amQueue);
    LOG.info("Queue info"
        + ", queueName=" + queueInfo.getQueueName()
        + ", queueCurrentCapacity=" + queueInfo.getCurrentCapacity()
        + ", queueMaxCapacity=" + queueInfo.getMaximumCapacity()
        + ", queueApplicationCount=" + queueInfo.getApplications().size()
        + ", queueChildQueueCount=" + queueInfo.getChildQueues().size());		

This information is easy to find in the log just run. The
Insert picture description here
same set of many parameters, such as the following AM container settings (other omissions)

    // Set up the container launch context for the application master
    ContainerLaunchContext amContainer = ContainerLaunchContext.newInstance(
      localResources, env, commands, null, null, null);
      ...
    appContext.setAMContainerSpec(amContainer);

These parameters are recorded in the appContext, submit it to YARN

    LOG.info("Submitting application to ASM");

    yarnClient.submitApplication(appContext);//提交

    // TODO
    // Try submitting the same request again
    // app submission failure?

    // Monitor the application
    return monitorApplication(appId);

After submission, you also need to monitor the execution status of the task and manipulate the status accordingly

 private boolean monitorApplication(ApplicationId appId)
      throws YarnException, IOException {

    while (true) {

      // Check app status every 1 second.
      try {
        Thread.sleep(1000);
      } catch (InterruptedException e) {
        LOG.debug("Thread sleep in monitoring loop interrupted");
      }

      // Get application report for the appId we are interested in 
      ApplicationReport report = yarnClient.getApplicationReport(appId);

	 ...//打印report:Got application report from ASM for ...

      YarnApplicationState state = report.getYarnApplicationState();
      FinalApplicationStatus dsStatus = report.getFinalApplicationStatus();
      if (YarnApplicationState.FINISHED == state) {
        if (FinalApplicationStatus.SUCCEEDED == dsStatus) {
          LOG.info("Application has completed successfully. Breaking monitoring loop");
          return true;        
        }
        else {
          LOG.info("Application did finished unsuccessfully."
              + " YarnState=" + state.toString() + ", DSFinalStatus=" + dsStatus.toString()
              + ". Breaking monitoring loop");
          return false;
        }			  
      }
      else if (YarnApplicationState.KILLED == state	|| YarnApplicationState.FAILED == state) {...}			
      
      if (System.currentTimeMillis() > (clientStartTime + clientTimeout)) {...}
    }			

  }

Following the previous log, it is not difficult to find that this section with a lot of repeated printing corresponds to the report here (printed in order of 1 second)
Insert picture description here

ApplicationMaster

This class is more complicated, with three internal classes: two callback processing classes, and a multi-threaded Runnable

This is one of the reasons why hadoop officially emphasized that YARN applications need to be written by professionals. From the official document , we can know that the interaction of three roles should be paid attention to:

  • Client<-->ResourceManager
    , the client above has been analyzed
  • ApplicationMaster<–>ResourceManager
    here corresponds to AM's internal class RMCallbackHandler, which implements the AMRMClientAsync.CallbackHandler interface
  • ApplicationMaster<–>NodeManager
    here corresponds to AM's internal class NMCallbackHandler, which implements the NMClientAsync.CallbackHandler interface

Insert picture description here
Still start with the main method first, and found that the main method is very similar to the Client's

 public static void main(String[] args) {
    boolean result = false;
    try {
      ApplicationMaster appMaster = new ApplicationMaster();
      LOG.info("Initializing ApplicationMaster");
      boolean doRun = appMaster.init(args);//初始化
      if (!doRun) {
        System.exit(0);
      }
      appMaster.run();//执行
      result = appMaster.finish();//等待执行结束获得结果
    } catch (Throwable t) {
      LOG.fatal("Error running ApplicationMaster", t);
      LogManager.shutdown();
      ExitUtil.terminate(1, t);
    }
    if (result) {
      LOG.info("Application Master completed successfully. exiting");
      System.exit(0);
    } else {
      LOG.info("Application Master failed. exiting");
      System.exit(2);
    }
  }

RMCallbackHandler

NMCallbackHandler

Want to know more, it is recommended to read YARN programming examples—distributedshell source code analysis

Guess you like

Origin blog.csdn.net/weixin_44112790/article/details/112728406