spark3 spark-sql explain command execution process

1. SparkSQLDriver

For each SQL statement, except those defined by CommandFactory, such as dfs, a SparkSQLDriver object is created, and then its init method and run method are called.

override def run(command: String): CommandProcessorResponse = {
    
    
    try {
    
    
      val substitutorCommand = SQLConf.withExistingConf(context.conf) {
    
    
        new VariableSubstitution().substitute(command)
      }
      context.sparkContext.setJobDescription(substitutorCommand)
      val execution = context.sessionState.executePlan(context.sql(command).logicalPlan)
      hiveResponse = SQLExecution.withNewExecutionId(execution) {
    
    
        hiveResultString(execution.executedPlan)
      }
      tableSchema = getResultSetSchema(execution)
      new CommandProcessorResponse(0)
    } catch {
    
    
        // 
    }
  }

the most important is

val execution = context.sessionState.executePlan(context.sql(command).logicalPlan)

First execute context.sql(command)
Context.sql method as follows

def sql(sqlText: String): DataFrame = sparkSession.sql(sqlText)

sparkSession.sql

plan is the parsed Unsolved Logical plan.

 def sql(sqlText: String): DataFrame = withActive {
    
    
    val tracker = new QueryPlanningTracker
    val plan = tracker.measurePhase(QueryPlanningTracker.PARSING) {
    
    
      sessionState.sqlParser.parsePlan(sqlText)
    }
    Dataset.ofRows(self, plan, tracker)
  }

DataSet.ofRows

ofRows' qe.assertAnalyzed() analyzes the plan

 def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan, tracker: QueryPlanningTracker)
    : DataFrame = sparkSession.withActive {
    
    
    val qe = new QueryExecution(sparkSession, logicalPlan, tracker)
    qe.assertAnalyzed()
    new Dataset[Row](qe, RowEncoder(qe.analyzed.schema))
  }

context.sql(command).logicalPlanIt is the logicalPlan of DataSet. The code is as follows:

  @transient private[sql] val logicalPlan: LogicalPlan = {
    
    
    val plan = queryExecution.commandExecuted
    if (sparkSession.sessionState.conf.getConf(SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED)) {
    
    
      val dsIds = plan.getTagValue(Dataset.DATASET_ID_TAG).getOrElse(new HashSet[Long])
      dsIds.add(id)
      plan.setTagValue(Dataset.DATASET_ID_TAG, dsIds)
    }
    plan
  }

QueryExecution

The logicalPlan of DataSet uses queryExecution.commandExecutedthe field, which is lazy. It is initialized when it is used for the first time. Analyzed is used in it, and it is also lazy.
analyzed Convert Unsolved Execution Plan to Resolved Execution Plan. commandExecuted executes eagerlyExecuteCommands(analyzed).
The first time you access commandExecuted, a CommandResult object is generated. If you access the logicalPlan of the DataSet later, the CommandResult object will still be returned.

lazy val analyzed: LogicalPlan = executePhase(QueryPlanningTracker.ANALYSIS) {
    
    
  // We can't clone `logical` here, which will reset the `_analyzed` flag.
  sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker)
}

lazy val commandExecuted: LogicalPlan = mode match {
    
    
  case CommandExecutionMode.NON_ROOT => analyzed.mapChildren(eagerlyExecuteCommands)
  case CommandExecutionMode.ALL => eagerlyExecuteCommands(analyzed)
  case CommandExecutionMode.SKIP => analyzed
}

eagerlyExecuteCommands returns a CommandResult object

private def eagerlyExecuteCommands(p: LogicalPlan) = p transformDown {
    
    
    case c: Command =>
      val qe = sparkSession.sessionState.executePlan(c, CommandExecutionMode.NON_ROOT)
      val result = SQLExecution.withNewExecutionId(qe, Some(commandExecutionName(c))) {
    
    
        qe.executedPlan.executeCollect()
      }
      CommandResult(
        qe.analyzed.output,
        qe.commandExecuted,
        qe.executedPlan,
        result)
    case other => other
  }

SparkSQLDriver.run

Continue back to the main process. The context.sessionState.executePlan parameter is a CommandResult object.

val execution = context.sessionState.executePlan(context.sql(command).logicalPlan)
hiveResponse = SQLExecution.withNewExecutionId(execution) {
    
    
        hiveResultString(execution.executedPlan)

sessionState.executePlan

The default mode is CommandExecutionMode.ALL. plan is a CommandResult object.

 def executePlan(
      plan: LogicalPlan,
      mode: CommandExecutionMode.Value = CommandExecutionMode.ALL): QueryExecution =
    createQueryExecution(plan, mode)
}
protected def createQueryExecution:
    (LogicalPlan, CommandExecutionMode.Value) => QueryExecution =
      (plan, mode) => new QueryExecution(session, plan, mode = mode)

So val execution = context.sessionState.executePlan(context.sql(command).logicalPlan)after execution, execution is a QueryExecution object.

SparkSQLDriver.run

hiveResponse = SQLExecution.withNewExecutionId(execution) {
    
    
  hiveResultString(execution.executedPlan)
}
tableSchema = getResultSetSchema(execution)

Guess you like

Origin blog.csdn.net/houzhizhen/article/details/131686021