关于Yarn源码那些事前传-NodeManager(二) 服务启动篇

上文说了NodeManager的初始化,本文说下其服务启动的代码:

@Override
	protected void serviceStart() throws Exception {
		try {
			doSecureLogin();
		} catch (IOException e) {
			throw new YarnRuntimeException("Failed NodeManager login", e);
		}
		super.serviceStart();
	}

看起来真简单,而实际上,则是把我们初始化过程中加入到serviceList中的所有服务都拿出来,进行一轮serviceStart的过程:

第一个:

DeletionService del = createDeletionService(exec);
		addService(del);

定时清除服务,其实际上并没有serviceStart,是因为其初始化的时候,已经定义了一个定时处理的线程池:

@Override
	protected void serviceInit(Configuration conf) throws Exception {
		ThreadFactory tf = new ThreadFactoryBuilder().setNameFormat("DeletionService #%d").build();
		if (conf != null) {
			sched = new ScheduledThreadPoolExecutor(conf.getInt(YarnConfiguration.NM_DELETE_THREAD_COUNT,
					YarnConfiguration.DEFAULT_NM_DELETE_THREAD_COUNT), tf);
			debugDelay = conf.getInt(YarnConfiguration.DEBUG_NM_DELETE_DELAY_SEC, 0);
		} else {
			sched = new ScheduledThreadPoolExecutor(YarnConfiguration.DEFAULT_NM_DELETE_THREAD_COUNT, tf);
		}
		sched.setExecuteExistingDelayedTasksAfterShutdownPolicy(false);
		sched.setKeepAliveTime(60L, SECONDS);
		if (stateStore.canRecover()) {
			recover(stateStore.loadDeletionServiceState());
		}
		super.serviceInit(conf);
	}

接着,看这部分:

nodeHealthChecker = new NodeHealthCheckerService();
		addService(nodeHealthChecker);
		dirsHandler = nodeHealthChecker.getDiskHandler();

其初始化已经完成了,但是实际上内部并没有serviceStart方法,而实际上,其所用的地方在下面:

		nodeStatusUpdater = createNodeStatusUpdater(context, dispatcher, nodeHealthChecker);

看看其serviceStart的方法:

		// NodeManager is the last service to start, so NodeId is available.
		this.nodeId = this.context.getNodeId();
		this.httpPort = this.context.getHttpPort();
		this.nodeManagerVersionId = YarnVersionInfo.getVersion();
		try {
			// Registration has to be in start so that ContainerManager can get the
			// perNM tokens needed to authenticate ContainerTokens.
			this.resourceTracker = getRMClient();
			registerWithRM();
			super.serviceStart();
			startStatusUpdater();
		} catch (Exception e) {
			String errorMessage = "Unexpected error starting NodeStatusUpdater";
			LOG.error(errorMessage, e);
			throw new YarnRuntimeException(e);
		}
	

这里,如果查看代码,会发现nodeId,和httpPort都没有定义,看似是bug,是bug么?不是,可以看这块:

addService(nodeStatusUpdater);
		((NMContext) context).setNodeStatusUpdater(nodeStatusUpdater);

最后才把nodeStatusUpdater才加到服务清单内,所以最后才会对其进行初始化,所以我们先略过这块,看看后面的:

NodeResourceMonitor nodeResourceMonitor = createNodeResourceMonitor();
		addService(nodeResourceMonitor);

还是很奇怪这段代码,感觉什么用都没有,没有初始化,也没有自定义的serviceStart方法,只能采用默认的方法,不多介绍了:

containerManager = createContainerManager(context, exec, del, nodeStatusUpdater, this.aclsManager, dirsHandler);
		addService(containerManager);

看看这个containerManager的serviceStart方法:

扫描二维码关注公众号,回复: 858068 查看本文章
final InetSocketAddress initialAddress = conf.getSocketAddr(YarnConfiguration.NM_BIND_HOST,
				YarnConfiguration.NM_ADDRESS, YarnConfiguration.DEFAULT_NM_ADDRESS, YarnConfiguration.DEFAULT_NM_PORT);
		boolean usingEphemeralPort = (initialAddress.getPort() == 0);
		if (context.getNMStateStore().canRecover() && usingEphemeralPort) {
			throw new IllegalArgumentException("Cannot support recovery with an "
					+ "ephemeral server port. Check the setting of " + YarnConfiguration.NM_ADDRESS);
		}
		// If recovering then delay opening the RPC service until the recovery
		// of resources and containers have completed, otherwise requests from
		// clients during recovery can interfere with the recovery process.
		final boolean delayedRpcServerStart = context.getNMStateStore().canRecover();

		Configuration serverConf = new Configuration(conf);

		// always enforce it to be token-based.
		serverConf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
				SaslRpcServer.AuthMethod.TOKEN.toString());

		YarnRPC rpc = YarnRPC.create(conf);

		server = rpc.getServer(ContainerManagementProtocol.class, this, initialAddress, serverConf,
				this.context.getNMTokenSecretManager(), conf.getInt(YarnConfiguration.NM_CONTAINER_MGR_THREAD_COUNT,
						YarnConfiguration.DEFAULT_NM_CONTAINER_MGR_THREAD_COUNT));

毫无疑问,这段代码负责建立一个RPCServer,奇怪的是ipc的默认端口竟然是0,所以启动之前一定要配置,不然启动应该会报错:

/** address of node manager IPC. */
	public static final String NM_ADDRESS = NM_PREFIX + "address";
	public static final int DEFAULT_NM_PORT = 0;
	public static final String DEFAULT_NM_ADDRESS = "0.0.0.0:" + DEFAULT_NM_PORT;

接下来,看看nodeId到底是怎么来的:

// setup node ID
		InetSocketAddress connectAddress;
		if (delayedRpcServerStart) {
			connectAddress = NetUtils.getConnectAddress(initialAddress);
		} else {
			server.start();
			connectAddress = NetUtils.getConnectAddress(server);
		}
		NodeId nodeId = buildNodeId(connectAddress, hostOverride);
		((NodeManager.NMContext) context).setNodeId(nodeId);
		this.context.getNMTokenSecretManager().setNodeId(nodeId);
		this.context.getContainerTokenSecretManager().setNodeId(nodeId);
我们给出了connectAddress,生成了一个nodeId,
private NodeId buildNodeId(InetSocketAddress connectAddress, String hostOverride) {
		if (hostOverride != null) {
			connectAddress = NetUtils.getConnectAddress(new InetSocketAddress(hostOverride, connectAddress.getPort()));
		}
		return NodeId.newInstance(connectAddress.getAddress().getCanonicalHostName(), connectAddress.getPort());
	}
@Private
  @Unstable
  public static NodeId newInstance(String host, int port) {
    NodeId nodeId = Records.newRecord(NodeId.class);
    nodeId.setHost(host);
    nodeId.setPort(port);
    nodeId.build();
    return nodeId;
  }

如此,生成了一个NodeId。

LOG.info("ContainerManager started at " + connectAddress);
		LOG.info("ContainerManager bound to " + initialAddress);

最后有日志输出,我们也可以在日志中看到得到的NodeId到底是什么:

WebServer webServer = createWebServer(context, containerManager.getContainersMonitor(), this.aclsManager,
				dirsHandler);
		addService(webServer);

看看NM监控webapp的启动:

@Override
  protected void serviceStart() throws Exception {
    String bindAddress = WebAppUtils.getWebAppBindURL(getConfig(),
                          YarnConfiguration.NM_BIND_HOST,
                          WebAppUtils.getNMWebAppURLWithoutScheme(getConfig()));
    
    LOG.info("Instantiating NMWebApp at " + bindAddress);
    try {
      this.webApp =
          WebApps
            .$for("node", Context.class, this.nmContext, "ws")
            .at(bindAddress)
            .with(getConfig())
            .withHttpSpnegoPrincipalKey(
              YarnConfiguration.NM_WEBAPP_SPNEGO_USER_NAME_KEY)
            .withHttpSpnegoKeytabKey(
              YarnConfiguration.NM_WEBAPP_SPNEGO_KEYTAB_FILE_KEY)
            .start(this.nmWebApp);
      this.port = this.webApp.httpServer().getConnectorAddress(0).getPort();
    } catch (Exception e) {
      String msg = "NMWebapps failed to start.";
      LOG.error(msg, e);
      throw new YarnRuntimeException(msg, e);
    }
    super.serviceStart();
  }
没什么可说的,最重要的是需要注意address和port的加载来源,

/** NM Webapp address. **/
	public static final String NM_WEBAPP_ADDRESS = NM_PREFIX + "webapp.address";
	public static final int DEFAULT_NM_WEBAPP_PORT = 8042;
	public static final String DEFAULT_NM_WEBAPP_ADDRESS = "0.0.0.0:" + DEFAULT_NM_WEBAPP_PORT;

这些配置都在YarnConfiguration内:

这一切结束后,我们再看看:

		addService(nodeStatusUpdater);
		// NodeManager is the last service to start, so NodeId is available.
		this.nodeId = this.context.getNodeId();
		this.httpPort = this.context.getHttpPort();
		this.nodeManagerVersionId = YarnVersionInfo.getVersion();
		try {
			// Registration has to be in start so that ContainerManager can get the
			// perNM tokens needed to authenticate ContainerTokens.
			this.resourceTracker = getRMClient();
			registerWithRM();
			super.serviceStart();
			startStatusUpdater();
		} catch (Exception e) {
			String errorMessage = "Unexpected error starting NodeStatusUpdater";
			LOG.error(errorMessage, e);
			throw new YarnRuntimeException(e);
		}
	

这下看的清楚了,里面的nodeId和httpPort实际上已经初始化完毕了,重点放在startStatusUpdater,重点在其中的try部分的代码:

						NodeStatus nodeStatus = getNodeStatus(lastHeartBeatID);

该方法读取了NM节点的基本状态:

private NodeStatus getNodeStatus(int responseId) throws IOException {

		NodeHealthStatus nodeHealthStatus = this.context.getNodeHealthStatus();
		nodeHealthStatus.setHealthReport(healthChecker.getHealthReport());
		nodeHealthStatus.setIsNodeHealthy(healthChecker.isHealthy());
		nodeHealthStatus.setLastHealthReportTime(healthChecker.getLastHealthReportTime());
		if (LOG.isDebugEnabled()) {
			LOG.debug("Node's health-status : " + nodeHealthStatus.getIsNodeHealthy() + ", "
					+ nodeHealthStatus.getHealthReport());
		}
		List<ContainerStatus> containersStatuses = getContainerStatuses();
		NodeStatus nodeStatus = NodeStatus.newInstance(nodeId, responseId, containersStatuses,
				createKeepAliveApplicationList(), nodeHealthStatus);

		return nodeStatus;
	}

而实际上的实现,采用的是healthChecker来实现的,实际上则是我们前面的NodeHealthCheckerService:

nodeHealthChecker = new NodeHealthCheckerService();
		addService(nodeHealthChecker);
		nodeStatusUpdater = createNodeStatusUpdater(context, dispatcher, nodeHealthChecker);

我们看看其中的方法,大同小异,看其中一个:

/**
	 * @return the reporting string of health of the node
	 */
	String getHealthReport() {
		String scriptReport = (nodeHealthScriptRunner == null) ? "" : nodeHealthScriptRunner.getHealthReport();
		if (scriptReport.equals("")) {
			return dirsHandler.getDisksHealthReport(false);
		} else {
			return scriptReport.concat(SEPARATOR + dirsHandler.getDisksHealthReport(false));
		}
	}

其实工作都是交给了dirsHandler,具体不多说了:

而所有的服务启动完毕之后,我们的NMManager就可以使用了。

猜你喜欢

转载自blog.csdn.net/u013384984/article/details/80299761
今日推荐