Hive on Spark源码分析

1、Hive on Spark基本原理
1.1 运行模式
在之前的Hive on Spark原理的文档中已经对Hive on Spark 的运行流程进行了分析:
Hive on Spark支持两种运行模式,本地(local)和远程(remote):
这里写图片描述
当用户把Spark Master URL设置为local时,采用本地模式;其余情况采用远程模式。本地模式下,SparkContext与客户端运行在同一个JVM中;远程模式下,SparkContext运行在独立的JVM中。本地模式通常用于调试,所以主要分析远程模式。
这里写图片描述
1.2 Hive解析HQL
Hive的sql解析引擎会将每句sql解析成任务,并且根据不同的执行引擎调用不同子类去生成Task
这里写图片描述
1.3 TaskCompile生成Task
在TaskCompile中去生成task的方法:
这里写图片描述
在SparkCompile中的实现,生成HQL执行的任务树:

protected void generateTaskTree(List<Task<? extends Serializable>> rootTasks, ParseContext pCtx,
    List<Task<MoveWork>> mvTask, Set<ReadEntity> inputs, Set<WriteEntity> outputs)
    throws SemanticException {
  PERF_LOGGER.PerfLogBegin(CLASS_NAME, PerfLogger.SPARK_GENERATE_TASK_TREE);

  GenSparkUtils utils = GenSparkUtils.getUtils();
  utils.resetSequenceNumber();

  ParseContext tempParseContext = getParseContext(pCtx, rootTasks);
  GenSparkProcContext procCtx = new GenSparkProcContext(
      conf, tempParseContext, mvTask, rootTasks, inputs, outputs, pCtx.getTopOps());

  // -------------------------------- First Pass ---------------------------------- //
  // Identify SparkPartitionPruningSinkOperators, and break OP tree if necessary

  Map<Rule, NodeProcessor> opRules = new LinkedHashMap<Rule, NodeProcessor>();
  opRules.put(new RuleRegExp("Clone OP tree for PartitionPruningSink",
          SparkPartitionPruningSinkOperator.getOperatorName() + "%"),
      new SplitOpTreeForDPP());

  Dispatcher disp = new DefaultRuleDispatcher(null, opRules, procCtx);
  GraphWalker ogw = new GenSparkWorkWalker(disp, procCtx);

  List<Node> topNodes = new ArrayList<Node>();
  topNodes.addAll(pCtx.getTopOps().values());
  ogw.startWalking(topNodes, null);

  // -------------------------------- Second Pass ---------------------------------- //
  // Process operator tree in two steps: first we process the extra op trees generated
  // in the first pass. Then we process the main op tree, and the result task will depend
  // on the task generated in the first pass.
  topNodes.clear();
  topNodes.addAll(procCtx.topOps.values());
  generateTaskTreeHelper(procCtx, topNodes);

  // If this set is not empty, it means we need to generate a separate task for collecting
  // the partitions used.
  if (!procCtx.clonedPruningTableScanSet.isEmpty()) {
    SparkTask pruningTask = SparkUtilities.createSparkTask(conf);
    SparkTask mainTask = procCtx.currentTask;
    pruningTask.addDependentTask(procCtx.currentTask);
    procCtx.rootTasks.remove(procCtx.currentTask);
    procCtx.rootTasks.add(pruningTask);
    procCtx.currentTask = pruningTask;

    topNodes.clear();
    topNodes.addAll(procCtx.clonedPruningTableScanSet);
    generateTaskTreeHelper(procCtx, topNodes);

    procCtx.currentTask = mainTask;
  }

  // -------------------------------- Post Pass ---------------------------------- //

  // we need to clone some operator plans and remove union operators still
  for (BaseWork w : procCtx.workWithUnionOperators) {
    GenSparkUtils.getUtils().removeUnionOperators(procCtx, w);
  }

  // we need to fill MapWork with 'local' work and bucket information for SMB Join.
  GenSparkUtils.getUtils().annotateMapWork(procCtx);

  // finally make sure the file sink operators are set up right
  for (FileSinkOperator fileSink : procCtx.fileSinkSet) {
    GenSparkUtils.getUtils().processFileSink(procCtx, fileSink);
  }

  // Process partition pruning sinks
  for (Operator<?> prunerSink : procCtx.pruningSinkSet) {
    utils.processPartitionPruningSink(procCtx, (SparkPartitionPruningSinkOperator) prunerSink);
  }

  PERF_LOGGER.PerfLogEnd(CLASS_NAME, PerfLogger.SPARK_GENERATE_TASK_TREE);
}

2、Hive on Spark交互类图
插入图,说明:接下去的内容的交互顺序可参考此图。
这里写图片描述
3、源码分析
3.1 创建SparkTask
1、通过SparkUtilities创建SparkTask
2、SparkTask执行时代码

public int execute(DriverContext driverContext) {

  int rc = 0;
  perfLogger = SessionState.getPerfLogger();
  SparkSession sparkSession = null;
  SparkSessionManager sparkSessionManager = null;
  try {
    printConfigInfo();
    sparkSessionManager = SparkSessionManagerImpl.getInstance();
    sparkSession = SparkUtilities.getSparkSession(conf, sparkSessionManager);

    SparkWork sparkWork = getWork();
    sparkWork.setRequiredCounterPrefix(getOperatorCounters());
…
}

3.2 创建SparkSession
1、过程中通过SparkSessionManager.getSession创建SparkSession
SparkUtilities.getSparkSession

public static SparkSession getSparkSession(HiveConf conf,
    SparkSessionManager sparkSessionManager) throws HiveException {
  SparkSession sparkSession = SessionState.get().getSparkSession();
  HiveConf sessionConf = SessionState.get().getConf();

  // Spark configurations are updated close the existing session
  // In case of async queries or confOverlay is not empty,
  // sessionConf and conf are different objects
  if (sessionConf.getSparkConfigUpdated() || conf.getSparkConfigUpdated()) {
    sparkSessionManager.closeSession(sparkSession);
    sparkSession =  null;
    conf.setSparkConfigUpdated(false);
    sessionConf.setSparkConfigUpdated(false);
  }
  sparkSession = sparkSessionManager.getSession(sparkSession, conf, true);
  SessionState.get().setSparkSession(sparkSession);
  return sparkSession;
}
1、SparkSessionManagerImpl中getSession
public SparkSession getSession(SparkSession existingSession, HiveConf conf, boolean doOpen)
    throws HiveException {
  setup(conf);

  if (existingSession != null) {
    // Open the session if it is closed.
    if (!existingSession.isOpen() && doOpen) {
      existingSession.open(conf);
    }
    return existingSession;
  }

  SparkSession sparkSession = new SparkSessionImpl();
  if (doOpen) {
    sparkSession.open(conf);
  }

  if (LOG.isDebugEnabled()) {
    LOG.debug(String.format("New session (%s) is created.", sparkSession.getSessionId()));
  }
  createdSessions.add(sparkSession);
  return sparkSession;
}

3.3 实例化SparkSessionImpl
1、实例化SparkSessionImpl,调用open方法
2、创建一个SparkClient
SparkSessionImpl:

public void open(HiveConf conf) throws HiveException {
  LOG.info("Trying to open Spark session {}", sessionId);
  this.conf = conf;
  isOpen = true;
  try {
    hiveSparkClient = HiveSparkClientFactory.createHiveSparkClient(conf, sessionId);
  } catch (Throwable e) {
    // It's possible that user session is closed while creating Spark client.
    String msg = isOpen ? "Failed to create Spark client for Spark session " + sessionId :
      "Spark Session " + sessionId + " is closed before Spark client is created";
    throw new HiveException(msg, e);
  }
  LOG.info("Spark session {} is successfully opened", sessionId);
}

1、HiveSparkClientFactory中createHiveSparkClient方法实现
HiveSparkClientFactory:

public static HiveSparkClient createHiveSparkClient(HiveConf hiveconf, String sessionId) throws Exception {
  Map<String, String> sparkConf = initiateSparkConf(hiveconf, sessionId);

  // Submit spark job through local spark context while spark master is local mode, otherwise submit
  // spark job through remote spark context.
  String master = sparkConf.get("spark.master");
  if (master.equals("local") || master.startsWith("local[")) {
    // With local spark context, all user sessions share the same spark context.
    return LocalHiveSparkClient.getInstance(generateSparkConf(sparkConf), hiveconf);
  } else {
    return new RemoteHiveSparkClient(hiveconf, sparkConf);
  }
}

3.4实例化RemoteHiveSparkClient
1、其中留意一下,initiateSparkConf会初始化Spark运行环境的一些参数,后面的内容中提到。
2、实例化RemoteHiveSparkClient:

RemoteHiveSparkClient(HiveConf hiveConf, Map<String, String> conf) throws Exception {
  this.hiveConf = hiveConf;
  sparkClientTimtout = hiveConf.getTimeVar(HiveConf.ConfVars.SPARK_CLIENT_FUTURE_TIMEOUT,
      TimeUnit.SECONDS);
  sparkConf = HiveSparkClientFactory.generateSparkConf(conf);
  this.conf = conf;
  createRemoteClient();
}

3.5 创建RemoteClient
1、调用createRemoteClient方法创建RemoteClient

private void createRemoteClient() throws Exception {
  remoteClient = SparkClientFactory.createClient(conf, hiveConf);

  if (HiveConf.getBoolVar(hiveConf, ConfVars.HIVE_PREWARM_ENABLED) &&
          (SparkClientUtilities.isYarnMaster(hiveConf.get("spark.master")) ||
           SparkClientUtilities.isLocalMaster(hiveConf.get("spark.master")))) {
    int minExecutors = getExecutorsToWarm();
    if (minExecutors <= 0) {
      return;
    }

    LOG.info("Prewarm Spark executors. The minimum number of executors to warm is " + minExecutors);

    // Spend at most HIVE_PREWARM_SPARK_TIMEOUT to wait for executors to come up.
    int curExecutors = 0;
    long maxPrewarmTime = HiveConf.getTimeVar(hiveConf, ConfVars.HIVE_PREWARM_SPARK_TIMEOUT,
        TimeUnit.MILLISECONDS);
    long ts = System.currentTimeMillis();
    do {
      try {
        curExecutors = getExecutorCount(maxPrewarmTime, TimeUnit.MILLISECONDS);
      } catch (TimeoutException e) {
        // let's don't fail on future timeout since we have a timeout for pre-warm
        LOG.warn("Timed out getting executor count.", e);
      }
      if (curExecutors >= minExecutors) {
        LOG.info("Finished prewarming Spark executors. The current number of executors is " + curExecutors);
        return;
      }
      Thread.sleep(500); // sleep half a second
    } while (System.currentTimeMillis() - ts < maxPrewarmTime);

    LOG.info("Timeout (" + maxPrewarmTime / 1000 + "s) occurred while prewarming executors. " +
        "The current number of executors is " + curExecutors);
  }
}

1、通过SparkClientFactory.createClient创建了SparkClient

public static SparkClient createClient(Map<String, String> sparkConf, HiveConf hiveConf)
    throws IOException, SparkException {
  Preconditions.checkState(server != null, "initialize() not called.");
  return new SparkClientImpl(server, sparkConf, hiveConf);
}

1、实例化SparkClientImpl,在SparkClientImpl中启动了一个RemoteDriver进程
3.6 获取SparkWork
再回到SparkTask中,上面通过SparkUtilities创建SparkSession后,通过getWork获取SparkWork
这里写图片描述
3.7 提交SparkTask
1、通过SparkSession提交封装了SparkTask的SparkWork,并返回一个SparkJobRef对象,对象封装了Spark任务的引用。
SparkSessionImpl中submit任务:

public SparkJobRef submit(DriverContext driverContext, SparkWork sparkWork) throws Exception {
  Preconditions.checkState(isOpen, "Session is not open. Can't submit jobs.");
  return hiveSparkClient.execute(driverContext, sparkWork);
}

RemoteHiveSparkClient.execute中submit

public SparkJobRef execute(final DriverContext driverContext, final SparkWork sparkWork)
    throws Exception {
  if (SparkClientUtilities.isYarnMaster(hiveConf.get("spark.master")) &&
      !remoteClient.isActive()) {
    // Re-create the remote client if not active any more
    close();
    createRemoteClient();
  }

  try {
    return submit(driverContext, sparkWork);
  } catch (Throwable cause) {
    throw new Exception("Failed to submit Spark work, please retry later", cause);
  }
}

RemoteHiveSparkClient中submit:

private SparkJobRef submit(final DriverContext driverContext, final SparkWork sparkWork) throws Exception {
  final Context ctx = driverContext.getCtx();
  final HiveConf hiveConf = (HiveConf) ctx.getConf();
  refreshLocalResources(sparkWork, hiveConf);
  final JobConf jobConf = new JobConf(hiveConf);

  //update the credential provider location in the jobConf
  HiveConfUtil.updateJobCredentialProviders(jobConf);

  // Create temporary scratch dir
  final Path emptyScratchDir = ctx.getMRTmpPath();
  FileSystem fs = emptyScratchDir.getFileSystem(jobConf);
  fs.mkdirs(emptyScratchDir);

  byte[] jobConfBytes = KryoSerializer.serializeJobConf(jobConf);
  byte[] scratchDirBytes = KryoSerializer.serialize(emptyScratchDir);
  byte[] sparkWorkBytes = KryoSerializer.serialize(sparkWork);

  JobStatusJob job = new JobStatusJob(jobConfBytes, scratchDirBytes, sparkWorkBytes);
  if (driverContext.isShutdown()) {
    throw new HiveException("Operation is cancelled.");
  }

  JobHandle<Serializable> jobHandle = remoteClient.submit(job);
  RemoteSparkJobStatus sparkJobStatus = new RemoteSparkJobStatus(remoteClient, jobHandle, sparkClientTimtout);
  return new RemoteSparkJobRef(hiveConf, jobHandle, sparkJobStatus);
}

(插入=====================================开始)
在SparkTask中:
这里写图片描述
调用SparkJobRef的monitorJob方法,循环获取Spark中Job的状态。
(插入=====================================结束)
接着RemoteHiveSparkClient中通过remoteClient的submit方法提交job。最终通过SparkClientImpl中的protocol提交任务

public <T extends Serializable> JobHandle<T> submit(Job<T> job) {
  return protocol.submit(job, Collections.<JobHandle.Listener<T>>emptyList());
}

3.8 启动RemoteDriver
接下来详细介绍一下SparkClientImpl,这个类是连接Spark的入口。也就是Spark的启动流程的实例类。
在上面我们已经提到,SparkClientImpl实例化的时候,通过startDriver方法启动了RemoteDriver:

SparkClientImpl(RpcServer rpcServer, Map<String, String> conf, HiveConf hiveConf) throws IOException, SparkException {
  this.conf = conf;
  this.hiveConf = hiveConf;
  this.jobs = Maps.newConcurrentMap();

  String clientId = UUID.randomUUID().toString();
  String secret = rpcServer.createSecret();
  this.driverThread = startDriver(rpcServer, clientId, secret);
  this.protocol = new ClientProtocol();

  try {
    // The RPC server will take care of timeouts here.
    this.driverRpc = rpcServer.registerClient(clientId, secret, protocol).get();
  } catch (Throwable e) {
    String errorMsg = null;
    if (e.getCause() instanceof TimeoutException) {
      errorMsg = "Timed out waiting for client to connect.\nPossible reasons include network " +
          "issues, errors in remote driver or the cluster has no available resources, etc." +
          "\nPlease check YARN or Spark driver's logs for further information.";
    } else if (e.getCause() instanceof InterruptedException) {
      errorMsg = "Interruption occurred while waiting for client to connect.\nPossibly the Spark session is closed " +
          "such as in case of query cancellation." +
          "\nPlease refer to HiveServer2 logs for further information.";
    } else {
      errorMsg = "Error while waiting for client to connect.";
    }
    LOG.error(errorMsg, e);
    driverThread.interrupt();
    try {
      driverThread.join();
    } catch (InterruptedException ie) {
      // Give up.
      LOG.warn("Interrupted before driver thread was finished.", ie);
    }
    throw Throwables.propagate(e);
  }

  driverRpc.addListener(new Rpc.Listener() {
      @Override
      public void rpcClosed(Rpc rpc) {
        if (isAlive) {
          LOG.warn("Client RPC channel closed unexpectedly.");
          isAlive = false;
        }
      }
  });
  isAlive = true;
}

并且注册了RpcServer,并启动了监听。
我们重点看一下startDriver,在startDriver中整合了所有需要跑Spark任务的参数,整合参数通过Spark中的SparkSubmit类去提交Spark任务。
3.8.1 SparkClient的参数
在之前的分析中,通过HiveSparkClientFactory的createHiveSparkClient方法创建SparkClient,在这个方法中,初始化了Spark的参数:
spark.master默认为yarn
这里写图片描述
任务提交模式:cluster
spark.app.name:Hive on Spark
序列化方式:kryo
3.8.2 运行参数
在SparkClientImpl中设置Spark运行内存参数:
这里写图片描述
扩展类,扩展包路径:
这里写图片描述
使用SparkSubmit方式提交Spark任务:
这里写图片描述
设置Spark executor运行core数量、内存、实例:
这里写图片描述
3.8.3 启动RemoteDriver
这里写图片描述
以上就是Hive on Spark中Spark的启动流程。
3.9 提交任务
接下来介绍一下提交任务的过程,让我们回到SparkTask中:
这里写图片描述
SparkSessionImpl

public SparkJobRef submit(DriverContext driverContext, SparkWork sparkWork) throws Exception {
  Preconditions.checkState(isOpen, "Session is not open. Can't submit jobs.");
  return hiveSparkClient.execute(driverContext, sparkWork);
}
RemoteHiveSparkClient
public SparkJobRef execute(final DriverContext driverContext, final SparkWork sparkWork)
    throws Exception {
  if (SparkClientUtilities.isYarnMaster(hiveConf.get("spark.master")) &&
      !remoteClient.isActive()) {
    // Re-create the remote client if not active any more
    close();
    createRemoteClient();
  }

  try {
    return submit(driverContext, sparkWork);
  } catch (Throwable cause) {
    throw new Exception("Failed to submit Spark work, please retry later", cause);
  }
}

private SparkJobRef submit(final DriverContext driverContext, final SparkWork sparkWork) throws Exception {
  final Context ctx = driverContext.getCtx();
  final HiveConf hiveConf = (HiveConf) ctx.getConf();
  refreshLocalResources(sparkWork, hiveConf);
  final JobConf jobConf = new JobConf(hiveConf);

  //update the credential provider location in the jobConf
  HiveConfUtil.updateJobCredentialProviders(jobConf);

  // Create temporary scratch dir
  final Path emptyScratchDir = ctx.getMRTmpPath();
  FileSystem fs = emptyScratchDir.getFileSystem(jobConf);
  fs.mkdirs(emptyScratchDir);

  byte[] jobConfBytes = KryoSerializer.serializeJobConf(jobConf);
  byte[] scratchDirBytes = KryoSerializer.serialize(emptyScratchDir);
  byte[] sparkWorkBytes = KryoSerializer.serialize(sparkWork);

  JobStatusJob job = new JobStatusJob(jobConfBytes, scratchDirBytes, sparkWorkBytes);
  if (driverContext.isShutdown()) {
    throw new HiveException("Operation is cancelled.");
  }

  JobHandle<Serializable> jobHandle = remoteClient.submit(job);
  RemoteSparkJobStatus sparkJobStatus = new RemoteSparkJobStatus(remoteClient, jobHandle, sparkClientTimtout);
  return new RemoteSparkJobRef(hiveConf, jobHandle, sparkJobStatus);
}

其中JobHandle就是任务的一个句柄。通过remoteClient.submit提交job:

public <T extends Serializable> JobHandle<T> submit(Job<T> job) {
  return protocol.submit(job, Collections.<JobHandle.Listener<T>>emptyList());
}

通过SparkClientImpl内部类ClientProtocol.submit方法,
1、通过Rpc线程池创建了Promise
2、实例化JobHandle
3、将jobId和job封装成JobRequest对象,并交给driverRpc来发送,然后返回一个promise对象来保存异步执行结果
4、保持监听

<T extends Serializable> JobHandleImpl<T> submit(Job<T> job, List<JobHandle.Listener<T>> listeners) {
  final String jobId = UUID.randomUUID().toString();
  final Promise<T> promise = driverRpc.createPromise();
  final JobHandleImpl<T> handle =
      new JobHandleImpl<T>(SparkClientImpl.this, promise, jobId, listeners);
  jobs.put(jobId, handle);

  final io.netty.util.concurrent.Future<Void> rpc = driverRpc.call(new JobRequest(jobId, job));
  LOG.debug("Send JobRequest[{}].", jobId);

  // Link the RPC and the promise so that events from one are propagated to the other as
  // needed.
  rpc.addListener(new GenericFutureListener<io.netty.util.concurrent.Future<Void>>() {
    @Override
    public void operationComplete(io.netty.util.concurrent.Future<Void> f) {
      if (f.isSuccess()) {
        // If the spark job finishes before this listener is called, the QUEUED status will not be set
        handle.changeState(JobHandle.State.QUEUED);
      } else if (!promise.isDone()) {
        promise.setFailure(f.cause());
      }
    }
  });
  promise.addListener(new GenericFutureListener<Promise<T>>() {
    @Override
    public void operationComplete(Promise<T> p) {
      if (jobId != null) {
        jobs.remove(jobId);
      }
      if (p.isCancelled() && !rpc.isDone()) {
        rpc.cancel(true);
      }
    }
  });
  return handle;
}

3.10 RemoteDriver与SparkClient交互
RemoteDriver与SparkClient进行交互,并向Spark集群提交任务。
在RemoteDriver构造函数中,处理参数,初始化环境变量,并将这些参数赋给相应的SparkConf
这里写图片描述
创建执行线程

executor = Executors.newCachedThreadPool();

将RemoteDriver使用的参数保存到mapConf中

Map<String, String> mapConf = Maps.newHashMap();
for (Tuple2<String, String> e : conf.getAll()) {
  mapConf.put(e._1(), e._2());
  LOG.debug("Remote Driver configured with: " + e._1() + "=" + e._2());
}

Rpc创建RemoteDriver端

this.clientRpc = Rpc.createClient(mapConf, egroup, serverAddress, serverPort,
  clientId, secret, protocol).get();

为clientRpc添加监听器

this.clientRpc.addListener(new Rpc.Listener() {
  @Override
  public void rpcClosed(Rpc rpc) {
    LOG.warn("Shutting down driver because RPC channel was closed.");
    shutdown(null);
  }
});

1、创建SparkContext
2、实例化JobContextImpl保存Job执行时的运行信息

try {
  JavaSparkContext sc = new JavaSparkContext(conf);
  sc.sc().addSparkListener(new ClientListener());
  synchronized (jcLock) {
    jc = new JobContextImpl(sc, localTmpDir);
    jcLock.notifyAll();
  }
} catch (Exception e) {
  LOG.error("Failed to start SparkContext: " + e, e);
  shutdown(e);
  synchronized (jcLock) {
    jcLock.notifyAll();
  }
  throw e;
}

到此,RemoteDriver的构建完成。在构建RemoteDriver过程中,还有三个内部类,我们看一下其作用以及实现。
1、JobWrapper
JobWrapper实现了Callable接口,其核心实现call方法
第一步,调用protocol的jobStarted方法发送JobStarted消息。
第二步,调用封装的Job中的call方法,monitorJob中会发送任务提交消息

jc.setMonitorCb(new MonitorCallback() {
  @Override
  public void call(JavaFutureAction<?> future,
      SparkCounters sparkCounters, Set<Integer> cachedRDDIds) {
    monitorJob(future, sparkCounters, cachedRDDIds);
  }
});
T result = req.job.call(jc);

第三步,通过jobEndReceived的值等待JobEnd
2、ClientListener
ClientListener继承自JavaSparkListener,用来监听来自Spark Scheduler的事件。
当job开始时,触发obJobStart方法,将Job的 stage id和jobId保存到stageId这个HashMap中。
当任务结束时,触发onJobEnd。
当一个task结束时,触发onTaskEnd
3、DriverProtocol
DriverProtocol中定义了消息类型处理,其中最受大家关注的应该是JobRequest了。

private void handle(ChannelHandlerContext ctx, JobRequest msg) {
  LOG.info("Received job request {}", msg.id);
  JobWrapper<?> wrapper = new JobWrapper<Serializable>(msg);
  activeJobs.put(msg.id, wrapper);
  submit(wrapper);
}

当DriverProtocol收到JobRequest消息后,将消息封装到JobWrapper中,将JobWrapper提交到任务列表中。

什么时候收到JobRequest?
SparkTask执行通过SparkClient提交任务时。
我们看一下RemoteHiveSparkClient,在RemoteHiveSparkClient中实例化的JobStatusJob

JobStatusJob job = new JobStatusJob(jobConfBytes, scratchDirBytes, sparkWorkBytes);
if (driverContext.isShutdown()) {
  throw new HiveException("Operation is cancelled.");
}

通过remoteClient提交

JobHandle<Serializable> jobHandle = remoteClient.submit(job);

在JobWrapper中调用Job的call方法:

public Serializable call(JobContext jc) throws Exception {
  JobConf localJobConf = KryoSerializer.deserializeJobConf(jobConfBytes);

  // Add jar to current thread class loader dynamically, and add jar paths to JobConf as Spark
  // may need to load classes from this jar in other threads.
  Map<String, Long> addedJars = jc.getAddedJars();
  if (addedJars != null && !addedJars.isEmpty()) {
    List<String> localAddedJars = SparkClientUtilities.addToClassPath(addedJars,
        localJobConf, jc.getLocalTmpDir());
    localJobConf.set(Utilities.HIVE_ADDED_JARS, StringUtils.join(localAddedJars, ";"));
  }

  Path localScratchDir = KryoSerializer.deserialize(scratchDirBytes, Path.class);
  SparkWork localSparkWork = KryoSerializer.deserialize(sparkWorkBytes, SparkWork.class);
  logConfigurations(localJobConf);

  SparkCounters sparkCounters = new SparkCounters(jc.sc());
  Map<String, List<String>> prefixes = localSparkWork.getRequiredCounterPrefix();
  if (prefixes != null) {
    for (String group : prefixes.keySet()) {
      for (String counterName : prefixes.get(group)) {
        sparkCounters.createCounter(group, counterName);
      }
    }
  }
  SparkReporter sparkReporter = new SparkReporter(sparkCounters);

  // Generate Spark plan
  SparkPlanGenerator gen =
    new SparkPlanGenerator(jc.sc(), null, localJobConf, localScratchDir, sparkReporter);
  SparkPlan plan = gen.generate(localSparkWork);

  jc.sc().setJobGroup("queryId = " + localSparkWork.getQueryId(), DagUtils.getQueryName(localJobConf));

  // Execute generated plan.
  JavaPairRDD<HiveKey, BytesWritable> finalRDD = plan.generateGraph();
  // We use Spark RDD async action to submit job as it's the only way to get jobId now.
  JavaFutureAction<Void> future = finalRDD.foreachAsync(HiveVoidFunction.getInstance());
  jc.monitor(future, sparkCounters, plan.getCachedRDDIds());
  return null;
}

1、反序列化得到Job的配置信息
2、设置Hive相关jar
3、反序列化本地临时路径
4、反序列化SparkWork
5、生成Spark执行计划
6、通过plan生成RDD
7、提交到Spark集群中

// Execute generated plan.
JavaPairRDD<HiveKey, BytesWritable> finalRDD = plan.generateGraph();
// We use Spark RDD async action to submit job as it's the only way to get jobId now.
JavaFutureAction<Void> future = finalRDD.foreachAsync(HiveVoidFunction.getInstance());

有兴趣的同学,可以自行下载Hive源码进行分析。
Hive项目地址:https://github.com/apache/hive
git clone https://github.com/apache/hive.git

猜你喜欢

转载自blog.csdn.net/ASAS1314/article/details/78833793