Apache Hive's SQL Execution Architecture

foreword

This article belongs to the column "Big Data Technology System". This column is original by the author. Please indicate the source for the citation. Please point out the deficiencies and mistakes in the comment area, thank you!

For the directory structure and references of this column, please refer to Big Data Technology System


text

This article introduces how Apache Hive converts SQL into MapReduce tasks. The entire compilation process can be divided into six stages:

  1. Perform lexical and grammatical analysis on SQL, and convert SQL into AST Tree (Abstract Syntax Tree).
  2. Traversing the AST Tree, further abstracting and structuring, transforming the AST Tree into the basic unit of SQL, QueryBlock.
  3. Traverse QueryBlock and convert it into Operator Tree (execute operation tree).
  4. Perform Operator Tree transformation through the logic layer optimizer.
  5. Traversing the Operator Tree is translated into MapReduce tasks.
  6. Transform the MapReduce task through the physical layer optimizer to generate the final execution plan.

For ease of understanding, a simple SQL statement is used to explain. We query the data on October 1 in a table:

select * from db.table where time = 20191001

This SQL will go through the following compilation process.

1. According to the grammatical rules defined by Antlr, perform lexical and grammatical analysis on SQL, and transform it into the following AST Tree:

ABSTRACT SYNTAX TREE: 
    TOK_OUERY 
        TOK_FROM
        TOK_TABREF 
            TOK_TABNAME 
                db
                    table 
        TOK_INSERT
            TOK_DESTINATION 
                TOK_DIR 
                    TOK_TMP_FILE 
            TOK_SELECT
                TOK_SELEXPR
                    TOK_ALLCOLREF 
            TOK_WHERE 
                =
                    TOK_TABLE_OR_COL
                        time 
                            20191001

2. Traversing the AST Tree to abstract the basic unit of query, the Query Block

After the AST Tree is generated, it is still very complicated, and it is not easy to translate it into a MapReduce program. It needs to be further abstracted and structured, and converted into a QueryBlock.

QueryBlock is the most basic unit of SQL, including input source, calculation process and output.

QueryBlock is generated recursively, traverses the AST Tree in order, encounters different Token nodes, and saves them in corresponding attributes, including the following processes.

  • TOK_ QUERY: Create a Query Block object, recursively recursive child nodes.
  • TOK_FROM: Save the syntax part of the table name to properties such as aliasToTabs of the QueryBlock object.
  • TOK_ INSERT: Loop recursive child nodes.
  • TOK_DESTINATION: Save the grammatical part of the output destination in the nameToDest property of the QBParseInfo object.
  • TOK_ SELECT: Save the grammatical part of the query expression in three attributes: destToSelExpr, destToAggregationExprs, and destToDistinctFuncExprs.
  • TOK_ WHERE: Save the syntax of the Where part in the destTo WhereExpr attribute of the QBParselnfo object.

3. Traverse QueryBlock and convert to Operator Tree

The MapReduce task, Map stage, and Reduce stage generated by Hive are all composed of Operator Tree.

Logical operator (Operator) is to complete a single specific operation in the Map stage or the Reduce stage.

Basic operators include TableScanOperator, SelectOperator, Filter-Operator, GroupBy Operator, JoinOperator, ReduceSinkOperator

4. Optimize the Operator Tree through the logic layer optimizer.
The logic layer optimizer achieves the purpose of reducing MapReduce Job and reducing the amount of Shuffle data by transforming Operator Tree and merging operators.

5. Traverse the OperatorTree and turn it into a MapReduce task, which is divided into the following stages:

  • Generate a MoveTask for the output table
  • Depth-first traversal downwards from one of the root nodes of the OperatorTree.
  • ReduceSinkOperator marks the boundaries of Map/Reduce and the boundaries between multiple Jobs.
  • Traversing other root book points, if encountering JoinOperator, merge MapReduceTask.
  • Generate StatTask to update metadata.
  • Cut off the Operator relationship between Map and Reduce.

6. Optimize the MapReduce task through the physical layer optimizer to generate the final execution plan.

Guess you like

Origin blog.csdn.net/Shockang/article/details/127954639