What you want to know about the Hive Query life cycle - hook functions!

foreword

Regardless of Hive Cli or HiveServer2, an HQl statement must be parsed and executed by the Driver, as shown in the following figure:

hive arch|center

The process of Driver processing is as follows:

HQL parsing (generating AST syntax tree) =>syntax analysis (getting QueryBlock) =>generating logical execution plan (Operator) =>logical optimization (Logical Optimizer Operator) =>generating physical execution plan (Task Plan) =>physical optimization (Task Tree) =>building execution plan (QueryPlan) =>table and Operation authentication =>Execution engine execution

The process involves HQL parsing, HQL compilation (syntax analysis, logical plan and physical plan, authentication), and three major aspects of executor execution. In the entire life cycle, there will be the following hook functions in the order of execution:

preDriverRun before Driver.run()

The hook function is hive.exec.driver.run.hookscontrolled . Multiple hook implementation classes are separated by commas. The hook needs to implement org.apache.hadoop.hive.ql.HiveDriverRunHookthe interface, which is described as follows:

public interface HiveDriverRunHook extends Hook {
  /**
   * Invoked before Hive begins any processing of a command in the Driver,
   * notably before compilation and any customizable performance logging.
   */
  public void preDriverRun(
    HiveDriverRunHookContext hookContext) throws Exception;

  /**
   * Invoked after Hive performs any processing of a command, just before a
   * response is returned to the entity calling the Driver.
   */
  public void postDriverRun(
    HiveDriverRunHookContext hookContext) throws Exception;
}

It can be seen that the hook also provides a postDriverRunmethod for HQL to execute and call before the data is returned, which will be discussed later

Its parameters are used in Hive HiveDriverRunHookContext's default implementation class org.apache.hadoop.hive.ql.HiveDriverRunHookContextImpl, which provides two useful parameters, namely HiveConf and Command to be executed. The calling information is as follows:

HiveDriverRunHookContext hookContext = new HiveDriverRunHookContextImpl(conf, command);
// Get all the driver run hooks and pre-execute them.
List<HiveDriverRunHook> driverRunHooks;
try {
  driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS,
      HiveDriverRunHook.class);
  for (HiveDriverRunHook driverRunHook : driverRunHooks) {
      driverRunHook.preDriverRun(hookContext);
  }
} catch (Exception e) {
  errorMessage = "FAILED: Hive Internal Error: " + Utilities.getNameMessage(e);
  SQLState = ErrorMsg.findSQLState(e.getMessage());
  downstreamError = e;
  console.printError(errorMessage + "\n"
      + org.apache.hadoop.util.StringUtils.stringifyException(e));
  return createProcessorResponse(12);
}

preAnalyze before parsing

After the Driver starts to run, HQL will enter the syntax analysis of the compilation phase after parsing, and before the syntax analysis, it will go through HiveSemanticAnalyzerHookthe preAnalyzemethod. The hook function is hive.semantic.analyzer.hookconfigured , and the hook needs to implement org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHookthe interface. The interface description is as follows:

public interface HiveSemanticAnalyzerHook extends Hook {
  public ASTNode preAnalyze(
    HiveSemanticAnalyzerHookContext context,
    ASTNode ast) throws SemanticException;

  public void postAnalyze(
    HiveSemanticAnalyzerHookContext context,
    List<Task<? extends Serializable>> rootTasks) throws SemanticException;
}

It can be seen that the hook class also provides a postAnalyzemethod for calling after the syntax analysis, which will be mentioned later

Its parameters are used in Hive HiveSemanticAnalyzerHookContextby default implementation class org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHookContextImpl, which provides HQL corresponding input, output, submit user, HiveConf and client IP and other information. Input and output table and partition information need to be parsed before they can be obtained. , preAnalyzewhich cannot be obtained in , and its calling information is as follows:

List<HiveSemanticAnalyzerHook> saHooks =
    getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK,
        HiveSemanticAnalyzerHook.class);

// Do semantic analysis and plan generation
if (saHooks != null) {
  HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
  hookCtx.setConf(conf);
  hookCtx.setUserName(userName);
  hookCtx.setIpAddress(SessionState.get().getUserIpAddress());
  hookCtx.setCommand(command);
  for (HiveSemanticAnalyzerHook hook : saHooks) {
    tree = hook.preAnalyze(hookCtx, tree);
  }
  // 此处开始进行语法分析,会涉及逻辑执行计划和物理执行计划的生成和优化
  sem.analyze(tree, ctx);
  // 更新分析器以便后续的postAnalyzer钩子执行
  hookCtx.update(sem);
  for (HiveSemanticAnalyzerHook hook : saHooks) {
    hook.postAnalyze(hookCtx, sem.getRootTasks());
  }
} else {
  sem.analyze(tree, ctx);
}

postAnalyze after parsing

From preAnalyzethe analysis of , it can be seen that postAnalyzeit belongs to the same hook class, so the configuration is also the same, the difference is that it is located after the syntax analysis of Hive, so the input and output table and partition information of HQL can be obtained, as well as the Task obtained by the syntax analysis. Information, from which you can judge whether it is a task that needs to be executed in a distributed manner, and what the execution engine is. The specific code and configuration can be seen in the above preAnalyzeanalysis

redactor hook before generating execution plan

This hook function is after the syntax analysis and before the QueryPlan is generated, so the syntax analysis has been completed when it is executed, and the specific task to be run has been determined. The purpose of this hook is to complete the replacement of QueryString, such as QueryString contains sensitive tables or fields Information can be replaced here, so that when the task is queried on Yarn's RM interface or in other ways, the replaced HQL will be displayed

The hook is hive.exec.query.redactor.hooksconfigured , and multiple implementation classes are separated by commas. The hook needs to inherit the org.apache.hadoop.hive.ql.hooks.Redactorabstract class and replace the redactQuerymethod . The interface description is as follows:

public abstract class Redactor implements Hook, Configurable {

  private Configuration conf;
  
  public void setConf(Configuration conf) {
    this.conf = conf;
  }

  public Configuration getConf() {
    return conf;
  }

  /**
   * Implementations may modify the query so that when placed in the job.xml
   * and thus potenially exposed to admin users, the query does not expose
   * sensitive information.
   */
  public String redactQuery(String query) {
    return query;
  }
}

Its calling information is as follows:

public static String redactLogString(HiveConf conf, String logString)
    throws InstantiationException, IllegalAccessException, ClassNotFoundException {

  String redactedString = logString;

  if (conf != null && logString != null) {
    List<Redactor> queryRedactors = getHooks(conf, ConfVars.QUERYREDACTORHOOKS, Redactor.class);
    for (Redactor redactor : queryRedactors) {
      redactor.setConf(conf);
      redactedString = redactor.redactQuery(redactedString);
    }
  }

  return redactedString;
}

preExecutionHook before Task execution

After the execution plan QueryPlan is generated and authenticated, the specific Task will be executed. Before the Task is executed, it will go through a hook function. The hook function is hive.exec.pre.hooksconfigured , and multiple hook implementation classes are separated by commas. The implementation method of the hook There are two, namely:

1. Implement the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContextinterface

This interface will pass org.apache.hadoop.hive.ql.hooks.HookContextthe instance of as a parameter, and the parameter class has private variables such as execution plan, HiveConf, Lineage information, UGI, submitted user name, input and output table and partition information, which provides a lot of help for us to realize our own functions.

The interface description is as follows:

public interface ExecuteWithHookContext extends Hook {

  void run(HookContext hookContext) throws Exception;
}

2. Implement the org.apache.hadoop.hive.ql.hooks.PreExecuteinterface

The incoming parameters of this interface include SessionState, UGI and HQL input and output tables and partition information. At present, this interface is marked as an outdated interface. Compared with the above ExecuteWithHookContext, the information provided by this interface may not fully meet our needs.

Its interface is described as follows:

public interface PreExecute extends Hook {

  /**
   * The run command that is called just before the execution of the query.
   *
   * @param sess
   *          The session state.
   * @param inputs
   *          The set of input tables and partitions.
   * @param outputs
   *          The set of output tables, partitions, local and hdfs directories.
   * @param ugi
   *          The user group security information.
   */
  @Deprecated
  public void run(SessionState sess, Set<ReadEntity> inputs,
      Set<WriteEntity> outputs, UserGroupInformation ugi)
    throws Exception;
}

The call information of this hook is as follows:

SessionState ss = SessionState.get();
HookContext hookContext = new HookContext(plan, conf, ctx.getPathToCS(), ss.getUserName(), ss.getUserIpAddress());
hookContext.setHookType(HookContext.HookType.PRE_EXEC_HOOK);

for (Hook peh : getHooks(HiveConf.ConfVars.PREEXECHOOKS)) {
  if (peh instanceof ExecuteWithHookContext) {
    perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());

    ((ExecuteWithHookContext) peh).run(hookContext);

    perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
  } else if (peh instanceof PreExecute) {
    perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());

    ((PreExecute) peh).run(SessionState.get(), plan.getInputs(), plan.getOutputs(),
        Utils.getUGI());

    perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
  }
}

ON_FAILURE_HOOKS when Task execution fails

After the task is executed, if the execution fails, Hive will call the failed Hook. The hook is hive.exec.failure.hooksconfigured , and multiple hook implementation classes are separated by commas. The hook needs to implement org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContextthe interface, which has been described above. This hook is mainly used to perform some measures when the task execution fails, such as statistics, etc.

The call information of this hook is as follows:

hookContext.setHookType(HookContext.HookType.ON_FAILURE_HOOK);
// Get all the failure execution hooks and execute them.
for (Hook ofh : getHooks(HiveConf.ConfVars.ONFAILUREHOOKS)) {
  perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.FAILURE_HOOK + ofh.getClass().getName());

  ((ExecuteWithHookContext) ofh).run(hookContext);

  perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.FAILURE_HOOK + ofh.getClass().getName());
}

The postExecutionHook when the Task is executed

This hook is executed after the Task task is executed. If the Task fails, the ON_FAILURE_HOOKS hook will be executed first, and then the postExecutionHook will be executed. This hook is hive.exec.post.hooksconfigured . Multiple hook implementation classes are separated by commas. There are also two implementation methods of this hook.

1. Implement the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContextinterface

This is consistent with preExecutionHook

2. Implement the org.apache.hadoop.hive.ql.hooks.PostExecuteinterface

The incoming parameters of this interface include SessionState, UGI, column-level LineageInfo, and HQL input and output tables and partition information. At present, this interface is marked as an outdated interface. Compared with the above ExecuteWithHookContext, the information provided by this interface may not be completely accurate. meet our needs

Its interface is described as follows:

public interface PostExecute extends Hook {

  /**
   * The run command that is called just before the execution of the query.
   *
   * @param sess
   *          The session state.
   * @param inputs
   *          The set of input tables and partitions.
   * @param outputs
   *          The set of output tables, partitions, local and hdfs directories.
   * @param lInfo
   *           The column level lineage information.
   * @param ugi
   *          The user group security information.
   */
  @Deprecated
  void run(SessionState sess, Set<ReadEntity> inputs,
      Set<WriteEntity> outputs, LineageInfo lInfo,
      UserGroupInformation ugi) throws Exception;
}

The call information of this hook is as follows:

hookContext.setHookType(HookContext.HookType.POST_EXEC_HOOK);
// Get all the post execution hooks and execute them.
for (Hook peh : getHooks(HiveConf.ConfVars.POSTEXECHOOKS)) {
  if (peh instanceof ExecuteWithHookContext) {
    perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());

    ((ExecuteWithHookContext) peh).run(hookContext);

    perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
  } else if (peh instanceof PostExecute) {
    perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());

    ((PostExecute) peh).run(SessionState.get(), plan.getInputs(), plan.getOutputs(),
        (SessionState.get() != null ? SessionState.get().getLineageState().getLineageInfo()
            : null), Utils.getUGI());

    perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
  }
}

After the task is executed, the result returns to the previous postDriverRun

This hook is executed after the Task is executed, but the result has not been returned. It corresponds to preDriverRun. Since it is the same interface, it will not be described in detail here.

finally

So far, the hook functions in the entire HQL execution life cycle are finished. The execution order and process can be sorted out as follows:

Driver.run()

=> HiveDriverRunHook.preDriverRun()(hive.exec.driver.run.hooks)

=> Driver.compile()

=> HiveSemanticAnalyzerHook.preAnalyze()(hive.semantic.analyzer.hook)

=> SemanticAnalyze(QueryBlock, LogicalPlan, PhyPlan, TaskTree)

=> HiveSemanticAnalyzerHook.postAnalyze () ( hive.semantic.analyzer.hook)

=> QueryString redactor(hive.exec.query.redactor.hooks)

=> QueryPlan Generation

=> Authorization

=> Driver.execute()

=> ExecuteWithHookContext.run() || PreExecute.run() (hive.exec.pre.hooks)

=> TaskRunner

=> if failed, ExecuteWithHookContext.run()(hive.exec.failure.hooks)

=> ExecuteWithHookContext.run() || PostExecute.run() (hive.exec.post.hooks)

=> HiveDriverRunHook.postDriverRun()(hive.exec.driver.run.hooks)

Welcome to read and reprint, please indicate the source: https://my.oschina.net/u/2539801/blog/1514648

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324538518&siteId=291194637