HIVE Introduction, Advantages and Disadvantages and Architecture Principles

1 What is Hive

  1. Introduction to Hive

hive: A data statistics tool open sourced by Facebook to solve massive structured logs.

Hive is a Hadoop-based data warehouse tool that can map structured data files into a table and provide SQL-like query functions.

  1. The essence of Hive: convert HQL into MapReduce program

Please add a picture description

Data processed by Hive is stored in HDFS

The underlying implementation of Hive analysis data is MapReduce

Executing the program on Yarn

2 Advantages and disadvantages of Hive

2.1 Advantages

  1. The operation interface adopts SQL-like syntax, which provides the ability of rapid development (simple and easy to use).
  2. It avoids writing MapReduce and reduces the learning cost of developers.
  3. Hive's execution delay is relatively high, so Hive is often used for data analysis and occasions that do not require high real-time performance.
  4. The advantage of Hive lies in processing large data, but it has no advantage in processing small data, because Hive has a relatively high execution delay.
  5. Hive supports user-defined functions, and users can implement their own functions according to their needs.

2.2 Disadvantages

  1. Hive's HQL expressiveness is limited
    1. Iterative algorithms cannot express
    2. I am not good at data mining. Due to the limitation of MapReduce data processing flow, more efficient algorithms cannot be realized.
  2. Hive is less efficient
    1. The MapReduce job automatically generated by Hive is usually not intelligent enough
    2. Hive tuning is difficult and the granularity is coarse

3 Hive Architecture Principles

Please add a picture description

  1. Metadata: Metastore

    CLI (command-line interface), JDBC/ODBC (jdbc access hive), WEBUI (browser access hive)

  2. Metadata: Metastore

    Metadata includes: table name, database to which the table belongs (default is default), table owner, column/partition field, table type (whether it is an external table), the directory where the table data is located, etc.; the default is stored in the built-in In the derby database, it is recommended to use MySQL to store the metastore.

  3. Hadoop

    Use HDFS for storage and MapReduce for calculation.

  4. Drive: Driver

    Parser (SQL Parser): convert SQL strings into abstract syntax tree AST, this step is generally done with third-party tool libraries, such as antlr; perform grammatical analysis on AST, such as whether the table exists, whether the field exists, whether the SQL semantics mistaken.

    Compiler (Physical Plan): Compile AST to generate a logical execution plan.

    Optimizer (Query Optimizer): optimize the logical execution plan.

    Execution: Convert the logical execution plan into a physical plan that can be run. For Hive, it is MR/Spark.

Please add a picture description

Through a series of interactive interfaces provided to users, Hive receives user instructions (SQL), uses its own Driver, combines metadata (MetaStore), translates these instructions into MapReduce, submits them to Hadoop for execution, and finally returns the execution The results are output to the user interface.

Guess you like

Origin blog.csdn.net/meng_xin_true/article/details/126039609