foreword
Regardless of Hive Cli or HiveServer2, an HQl statement must be parsed and executed by the Driver, as shown in the following figure:
The process of Driver processing is as follows:
HQL parsing (generating AST syntax tree) =>
syntax analysis (getting QueryBlock) =>
generating logical execution plan (Operator) =>
logical optimization (Logical Optimizer Operator) =>
generating physical execution plan (Task Plan) =>
physical optimization (Task Tree) =>
building execution plan (QueryPlan) =>
table and Operation authentication =>
Execution engine execution
The process involves HQL parsing, HQL compilation (syntax analysis, logical plan and physical plan, authentication), and three major aspects of executor execution. In the entire life cycle, there will be the following hook functions in the order of execution:
preDriverRun before Driver.run()
The hook function is hive.exec.driver.run.hooks
controlled . Multiple hook implementation classes are separated by commas. The hook needs to implement org.apache.hadoop.hive.ql.HiveDriverRunHook
the interface, which is described as follows:
public interface HiveDriverRunHook extends Hook {
/**
* Invoked before Hive begins any processing of a command in the Driver,
* notably before compilation and any customizable performance logging.
*/
public void preDriverRun(
HiveDriverRunHookContext hookContext) throws Exception;
/**
* Invoked after Hive performs any processing of a command, just before a
* response is returned to the entity calling the Driver.
*/
public void postDriverRun(
HiveDriverRunHookContext hookContext) throws Exception;
}
It can be seen that the hook also provides a postDriverRun
method for HQL to execute and call before the data is returned, which will be discussed later
Its parameters are used in Hive HiveDriverRunHookContext
's default implementation class org.apache.hadoop.hive.ql.HiveDriverRunHookContextImpl
, which provides two useful parameters, namely HiveConf and Command to be executed. The calling information is as follows:
HiveDriverRunHookContext hookContext = new HiveDriverRunHookContextImpl(conf, command);
// Get all the driver run hooks and pre-execute them.
List<HiveDriverRunHook> driverRunHooks;
try {
driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS,
HiveDriverRunHook.class);
for (HiveDriverRunHook driverRunHook : driverRunHooks) {
driverRunHook.preDriverRun(hookContext);
}
} catch (Exception e) {
errorMessage = "FAILED: Hive Internal Error: " + Utilities.getNameMessage(e);
SQLState = ErrorMsg.findSQLState(e.getMessage());
downstreamError = e;
console.printError(errorMessage + "\n"
+ org.apache.hadoop.util.StringUtils.stringifyException(e));
return createProcessorResponse(12);
}
preAnalyze before parsing
After the Driver starts to run, HQL will enter the syntax analysis of the compilation phase after parsing, and before the syntax analysis, it will go through HiveSemanticAnalyzerHook
the preAnalyze
method. The hook function is hive.semantic.analyzer.hook
configured , and the hook needs to implement org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHook
the interface. The interface description is as follows:
public interface HiveSemanticAnalyzerHook extends Hook {
public ASTNode preAnalyze(
HiveSemanticAnalyzerHookContext context,
ASTNode ast) throws SemanticException;
public void postAnalyze(
HiveSemanticAnalyzerHookContext context,
List<Task<? extends Serializable>> rootTasks) throws SemanticException;
}
It can be seen that the hook class also provides a postAnalyze
method for calling after the syntax analysis, which will be mentioned later
Its parameters are used in Hive HiveSemanticAnalyzerHookContext
by default implementation class org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHookContextImpl
, which provides HQL corresponding input, output, submit user, HiveConf and client IP and other information. Input and output table and partition information need to be parsed before they can be obtained. , preAnalyze
which cannot be obtained in , and its calling information is as follows:
List<HiveSemanticAnalyzerHook> saHooks =
getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK,
HiveSemanticAnalyzerHook.class);
// Do semantic analysis and plan generation
if (saHooks != null) {
HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();
hookCtx.setConf(conf);
hookCtx.setUserName(userName);
hookCtx.setIpAddress(SessionState.get().getUserIpAddress());
hookCtx.setCommand(command);
for (HiveSemanticAnalyzerHook hook : saHooks) {
tree = hook.preAnalyze(hookCtx, tree);
}
// 此处开始进行语法分析,会涉及逻辑执行计划和物理执行计划的生成和优化
sem.analyze(tree, ctx);
// 更新分析器以便后续的postAnalyzer钩子执行
hookCtx.update(sem);
for (HiveSemanticAnalyzerHook hook : saHooks) {
hook.postAnalyze(hookCtx, sem.getRootTasks());
}
} else {
sem.analyze(tree, ctx);
}
postAnalyze after parsing
From preAnalyze
the analysis of , it can be seen that postAnalyze
it belongs to the same hook class, so the configuration is also the same, the difference is that it is located after the syntax analysis of Hive, so the input and output table and partition information of HQL can be obtained, as well as the Task obtained by the syntax analysis. Information, from which you can judge whether it is a task that needs to be executed in a distributed manner, and what the execution engine is. The specific code and configuration can be seen in the above preAnalyze
analysis
redactor hook before generating execution plan
This hook function is after the syntax analysis and before the QueryPlan is generated, so the syntax analysis has been completed when it is executed, and the specific task to be run has been determined. The purpose of this hook is to complete the replacement of QueryString, such as QueryString contains sensitive tables or fields Information can be replaced here, so that when the task is queried on Yarn's RM interface or in other ways, the replaced HQL will be displayed
The hook is hive.exec.query.redactor.hooks
configured , and multiple implementation classes are separated by commas. The hook needs to inherit the org.apache.hadoop.hive.ql.hooks.Redactor
abstract class and replace the redactQuery
method . The interface description is as follows:
public abstract class Redactor implements Hook, Configurable {
private Configuration conf;
public void setConf(Configuration conf) {
this.conf = conf;
}
public Configuration getConf() {
return conf;
}
/**
* Implementations may modify the query so that when placed in the job.xml
* and thus potenially exposed to admin users, the query does not expose
* sensitive information.
*/
public String redactQuery(String query) {
return query;
}
}
Its calling information is as follows:
public static String redactLogString(HiveConf conf, String logString)
throws InstantiationException, IllegalAccessException, ClassNotFoundException {
String redactedString = logString;
if (conf != null && logString != null) {
List<Redactor> queryRedactors = getHooks(conf, ConfVars.QUERYREDACTORHOOKS, Redactor.class);
for (Redactor redactor : queryRedactors) {
redactor.setConf(conf);
redactedString = redactor.redactQuery(redactedString);
}
}
return redactedString;
}
preExecutionHook before Task execution
After the execution plan QueryPlan is generated and authenticated, the specific Task will be executed. Before the Task is executed, it will go through a hook function. The hook function is hive.exec.pre.hooks
configured , and multiple hook implementation classes are separated by commas. The implementation method of the hook There are two, namely:
1. Implement the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
interface
This interface will pass org.apache.hadoop.hive.ql.hooks.HookContext
the instance of as a parameter, and the parameter class has private variables such as execution plan, HiveConf, Lineage information, UGI, submitted user name, input and output table and partition information, which provides a lot of help for us to realize our own functions.
The interface description is as follows:
public interface ExecuteWithHookContext extends Hook {
void run(HookContext hookContext) throws Exception;
}
2. Implement the org.apache.hadoop.hive.ql.hooks.PreExecute
interface
The incoming parameters of this interface include SessionState, UGI and HQL input and output tables and partition information. At present, this interface is marked as an outdated interface. Compared with the above ExecuteWithHookContext, the information provided by this interface may not fully meet our needs.
Its interface is described as follows:
public interface PreExecute extends Hook {
/**
* The run command that is called just before the execution of the query.
*
* @param sess
* The session state.
* @param inputs
* The set of input tables and partitions.
* @param outputs
* The set of output tables, partitions, local and hdfs directories.
* @param ugi
* The user group security information.
*/
@Deprecated
public void run(SessionState sess, Set<ReadEntity> inputs,
Set<WriteEntity> outputs, UserGroupInformation ugi)
throws Exception;
}
The call information of this hook is as follows:
SessionState ss = SessionState.get();
HookContext hookContext = new HookContext(plan, conf, ctx.getPathToCS(), ss.getUserName(), ss.getUserIpAddress());
hookContext.setHookType(HookContext.HookType.PRE_EXEC_HOOK);
for (Hook peh : getHooks(HiveConf.ConfVars.PREEXECHOOKS)) {
if (peh instanceof ExecuteWithHookContext) {
perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
((ExecuteWithHookContext) peh).run(hookContext);
perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
} else if (peh instanceof PreExecute) {
perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
((PreExecute) peh).run(SessionState.get(), plan.getInputs(), plan.getOutputs(),
Utils.getUGI());
perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());
}
}
ON_FAILURE_HOOKS when Task execution fails
After the task is executed, if the execution fails, Hive will call the failed Hook. The hook is hive.exec.failure.hooks
configured , and multiple hook implementation classes are separated by commas. The hook needs to implement org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
the interface, which has been described above. This hook is mainly used to perform some measures when the task execution fails, such as statistics, etc.
The call information of this hook is as follows:
hookContext.setHookType(HookContext.HookType.ON_FAILURE_HOOK);
// Get all the failure execution hooks and execute them.
for (Hook ofh : getHooks(HiveConf.ConfVars.ONFAILUREHOOKS)) {
perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.FAILURE_HOOK + ofh.getClass().getName());
((ExecuteWithHookContext) ofh).run(hookContext);
perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.FAILURE_HOOK + ofh.getClass().getName());
}
The postExecutionHook when the Task is executed
This hook is executed after the Task task is executed. If the Task fails, the ON_FAILURE_HOOKS hook will be executed first, and then the postExecutionHook will be executed. This hook is hive.exec.post.hooks
configured . Multiple hook implementation classes are separated by commas. There are also two implementation methods of this hook.
1. Implement the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
interface
This is consistent with preExecutionHook
2. Implement the org.apache.hadoop.hive.ql.hooks.PostExecute
interface
The incoming parameters of this interface include SessionState, UGI, column-level LineageInfo, and HQL input and output tables and partition information. At present, this interface is marked as an outdated interface. Compared with the above ExecuteWithHookContext, the information provided by this interface may not be completely accurate. meet our needs
Its interface is described as follows:
public interface PostExecute extends Hook {
/**
* The run command that is called just before the execution of the query.
*
* @param sess
* The session state.
* @param inputs
* The set of input tables and partitions.
* @param outputs
* The set of output tables, partitions, local and hdfs directories.
* @param lInfo
* The column level lineage information.
* @param ugi
* The user group security information.
*/
@Deprecated
void run(SessionState sess, Set<ReadEntity> inputs,
Set<WriteEntity> outputs, LineageInfo lInfo,
UserGroupInformation ugi) throws Exception;
}
The call information of this hook is as follows:
hookContext.setHookType(HookContext.HookType.POST_EXEC_HOOK);
// Get all the post execution hooks and execute them.
for (Hook peh : getHooks(HiveConf.ConfVars.POSTEXECHOOKS)) {
if (peh instanceof ExecuteWithHookContext) {
perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
((ExecuteWithHookContext) peh).run(hookContext);
perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
} else if (peh instanceof PostExecute) {
perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
((PostExecute) peh).run(SessionState.get(), plan.getInputs(), plan.getOutputs(),
(SessionState.get() != null ? SessionState.get().getLineageState().getLineageInfo()
: null), Utils.getUGI());
perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());
}
}
After the task is executed, the result returns to the previous postDriverRun
This hook is executed after the Task is executed, but the result has not been returned. It corresponds to preDriverRun. Since it is the same interface, it will not be described in detail here.
finally
So far, the hook functions in the entire HQL execution life cycle are finished. The execution order and process can be sorted out as follows:
Driver.run()
=> HiveDriverRunHook.preDriverRun()(hive.exec.driver.run.hooks
)
=> Driver.compile()
=> HiveSemanticAnalyzerHook.preAnalyze()(hive.semantic.analyzer.hook
)
=> SemanticAnalyze(QueryBlock, LogicalPlan, PhyPlan, TaskTree)
=> HiveSemanticAnalyzerHook.postAnalyze () ( hive.semantic.analyzer.hook
)
=> QueryString redactor(hive.exec.query.redactor.hooks
)
=> QueryPlan Generation
=> Authorization
=> Driver.execute()
=> ExecuteWithHookContext.run() || PreExecute.run() (hive.exec.pre.hooks
)
=> TaskRunner
=> if failed, ExecuteWithHookContext.run()(hive.exec.failure.hooks
)
=> ExecuteWithHookContext.run() || PostExecute.run() (hive.exec.post.hooks
)
=> HiveDriverRunHook.postDriverRun()(hive.exec.driver.run.hooks
)
Welcome to read and reprint, please indicate the source: https://my.oschina.net/u/2539801/blog/1514648