In-depth understanding of the underlying principles of Apache Hive, a big data warehouse tool

Author: Phantom

Overview

After learning the basic knowledge of Apache Hive and HIve SQL, you can understand that Hive works by converting SQL statements to generate MR programs through the underlying transformation. In order to have a deeper understanding of Hive, you also need to deeply understand the execution principle of Hive SQL. This article will help readers have a deeper understanding of the role of Hive through an in-depth interpretation of the underlying execution principles of Hive.1648475242724-92622f3b-e21a-4955-9734-7d6a7c1941fe.png

The underlying principle of Hive

After using Hive for daily demand development and use, you can roughly understand that Hive is to generate MR programs through the underlying transformation of SQL statements.

Hive execution architecture

Hive is based on Hadoop for interactive work, so according to the entire work process of Hive, there can be the following architecture diagram:Default file 1649592269371.png

core components

According to the above figure, it can be seen that the Hive job execution contains the following five components:

  1. UI (User Interface): It can be seen as the command line interface where we submit SQL statements.
  2. DRIVER: The component that receives the query. This component implements the concept of a session handle.
  3. COMPILER (compiler): Responsible for converting SQL into an execution plan executable by the platform. Semantic analysis of different query blocks and query expressions, and finally generating an execution plan with the help of tables and partition metadata looked up from META STORE.
  4. META STORE: Stores all structural information for various tables and partitions in Hive.
  5. EXECUTION ENGINE: Responsible for submitting execution plans compiled in the COMPILER phase to different platforms.

work process

Combined with the above figure, the normal workflow of Hive is divided into the following steps:

  • Step 1: When the user writes SQL through the UI interface, the submission will call the DRIVER interface;
  • Step 2: After the DRIVER receives the query task, it will create a session handle for the query, and send the query to the COMPILER to generate an execution plan;
  • Step3: The compiler obtains the metadata required for this query from the META STORE;
  • Step 4: The corresponding metadata is queried, which will be used to type check the expressions in the query tree and build partitions based on query predicates;
  • Step 5: The execution plan generated by the compiler is a staged DAG (directed acyclic graph), where the stages may be map/reduce jobs, or a metadata or an operation on HDFS. Finally, send the generated plan to DRIVER. If it is a map/reduce job, the plan includes map operator trees and a reduce operator tree, and the execution engine will send these jobs to MapReduce;
  • Step 6: The execution engine submits these stages to the appropriate components. In each task (mapper/reducer), the data associated with the table or intermediate output is read from the HDFS file and passed through the associated operator tree. Finally the data is written to a temporary HDFS file via the serializer (or in the map if the reduce phase is not needed). Temporary files are used to provide data to later map/reduce phases in the plan.
  • Step 7: The final temporary file will be moved to the location of the relevant table, ensuring that dirty data is not read (file renaming is an atomic operation in HDFS).
  • Step 8: For the user's query, the content of the temporary file is directly read from HDFS by the execution engine, and then sent to the UI through the Driver.

Hive SQL Compilation Principle

The most important job of Hive is the process of compiling SQL statements into MapReduce, and the process of compiling SQL is done in the COMPILER (compiler). The compilation and conversion process is mainly divided into the following stages:Default file 1649592362910.png

  1. Grammar analysis : define the grammar rules of SQL through the language recognition tool Antlr, complete SQL lexical, grammar analysis, and convert SQL into abstract syntax tree AST Tree;
  2. Semantic parsing : traverse the AST Tree and abstract the basic unit of query QueryBlock;
  3. Generate a logical execution plan : traverse the QueryBlock and translate it into the execution operation tree OperatorTree;
  4. Optimize the logic execution plan : The logic layer optimizer transforms the OperatorTree and merges the Operators to reduce MapReduce jobs, data transmission and shuffle data volume;
  5. Generate a physical execution plan : traverse the OperatorTree and translate it into a MapReduce task;
  6. Optimize physical execution plan : The physical layer optimizer transforms MapReduce tasks and generates the final execution plan.

Compiling case studies

The process of compiling and converting Hive SQL into an MR program is conceptually abstract. For better understanding, it is directly analyzed with a simple query case:

select * from hivedb.sys_user where id = '1001';

The first stage: grammar parsing

  • Through the sql grammar rules defined by Antlr, the relevant sql is lexically and grammatically parsed, and converted into an abstract syntax tree AST Tree:
ABSTRACT SYNTAX TREE: 
TOK_QUERY 
  TOK_FROM 
  TOK_TABREF 
    TOK_TABNAME 
      hivedb
        sys_user 
  TOK_INSERT 
    TOK_DESTINATION 
      TOK_DIR 
        TOK_TMP_FILE 
    TOK_SELECT 
      TOK_SELEXPR 
        TOK_ALLCOLREF 
    TOK_WHERE 
      = 
        TOK_TABLE_OR_COL 
          id
            '1001'

- Stage 2: **Semantic Parsing**

  • Traverse the AST Tree and abstract the QueryBlock, the basic component of the query: After the AST Tree is generated, its complexity is still high, and it is not easy to translate it into a mapreduce program. It needs to be further abstracted and structured to form a QueryBlock.
  • QueryBlock is the most basic unit of SQL, including three parts: input source, calculation process, and output. Simply put, a QueryBlock is a subquery.
  • The generation process of QueryBlock is a recursive process, traversing the AST Tree in order, encountering different Token nodes (understood as special tags), and saving them to the corresponding attributes.

-The third stage: **Generate logic execution plan** Traverse QueryBlock and translate it into the execution operation tree OperatorTree:

  • The MapReduce task finally generated by Hive, the Map stage and the Reduce stage are composed of OperatorTree.
  • Basic operators include:
    • TableScanOperator
    • SelectOperator
    • FilterOperator
    • JoinOperator
    • GroupByOperator
    • ReduceSinkOperator
  • Operator data transfer between Map Reduce stages is a streaming process. After each operator completes the operation on a row of data, it passes the data to the childOperator for calculation.
  • Since Join/GroupBy/OrderBy all need to be completed in the Reduce phase, a ReduceSinkOperator will be generated before generating the corresponding operator to combine and serialize the fields into Reduce Key/value, Partition Key.

- Stage 4: Optimizing the logical execution plan The logical query optimization in Hive can be roughly divided into the following categories:

  • Projection trimming
  • Deriving transitive predicates
  • predicate pushdown
  • Combine Select-Select, Filter-Filter into a single operation
  • Multiple Join
  • Query rewrite to accommodate join skew for certain column values

- Fifth stage: **Generate physical execution plan** Generating a physical execution plan is the process of converting the OperatorTree generated by the logical execution plan into a MapReduce Job, which is mainly divided into the following stages:

  • Generate MoveTask on output table
  • Depth-first traversal down from one of the root nodes of the OperatorTree
  • ReduceSinkOperator indicates the boundaries of Map/Reduce, the boundaries between multiple Jobs
  • Traverse other root nodes, encounter JoinOperator and merge MapReduceTask
  • Generate StatTask to update metadata
  • Cut the operator relationship between Map and Reduce

- Stage 6: Optimizing the Physical Execution Plan Physical optimization in Hive can be roughly divided into the following categories:

  • Partition Pruning
  • Scan pruning based on partitions and buckets
  • Scan pruning if query is based on sampling
  • Apply Group By on the map side in some cases
  • Execute Join on mapper
  • Optimize Union so that Union only executes on the map side
  • In multi-join, decide which table to stream last according to user prompts
  • Remove unnecessary ReduceSinkOperators
  • For queries with a Limit clause, reduce the number of files that need to be scanned for the table
  • For queries with a Limit clause, limit the output from the mapper by limiting what the ReduceSinkOperator produces
  • Reduce the number of Tez jobs required for user-submitted SQL queries
  • For simple fetch queries, avoid MapReduce jobs
  • For simple fetch queries with aggregations, perform aggregations without MapReduce tasks
  • Rewrite Group By query to use index table instead of original table
  • An index scan is used when the predicate above the table scan is an equality predicate and the columns in the predicate have an index

Through the above six stages , SQL is parsed and mapped into MapReduce tasks on the cluster.

Summarize

This article mainly introduces the underlying principles of Apache Hive's work. Through the analysis of the workflow executed by Hive SQL to the underlying execution process, it can help readers better understand the role of Hive.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3620858/blog/5514287