[Hive] Deeply understand the advantages and disadvantages of Hive and its architecture principles

1. What is Hive

Hive is a data warehouse tool based on Hadoop, which can map structured data files to a table and provide SQL-like query functions.

The essence is: transform HQL into MapReduce program:
Insert picture description here

(1) The data processed by Hive is stored in HDFS
(2) The underlying implementation of Hive analysis data is MapReduce
(3) The execution program runs on Yarn

Most SQL statements encapsulate MR programs (like SQL). The reason why they are encapsulated into SQL is because basic programmers can do this.

Processing based on MR: Developer—>Analyze requirements—>Write MR—>Get results
Based on Hive processing: Developer—>Analyze requirements—>Write SQL—>Hive converts SQL into MR—>Get results

Two, the advantages and disadvantages of Hive

2.1 Advantages

(1) The operation interface adopts SQL-like syntax to provide rapid development capabilities (simple and easy to use).
(2) Avoid writing MapReduce and reduce the learning cost of developers.
(3) The execution delay of Hive is relatively high, so Hive is often used for data analysis and occasions where real-time requirements are not high.
(4) Hive's advantage lies in processing big data, but has no advantage in processing small data, because Hive's execution delay is relatively high.
(5) Hive supports user-defined functions, and users can implement their own functions according to their needs.

2.2 Disadvantages

1) Hive's HQL expression ability is limited
(1) Iterative algorithms cannot be expressed, and the expression ability is limited (complex logical algorithms are not
easy to encapsulate) (2) Not good at data mining, due to the limitations of the MapReduce data processing flow (slow, because The underlying shortcomings are still there), but more efficient algorithms cannot be implemented.

2) The efficiency of Hive is relatively low
(1) The MapReduce job automatically generated by Hive is usually not intelligent enough (machine translation is relatively rigid and may not be the optimal solution, but it can be achieved)
(2) Hive tuning is difficult and granular Coarse (can only be optimized on the basis of the framework, not in-depth MR program optimization)

Three, Hive architecture principle

Hive uses other people to store things, not at all, it is only responsible for translating HQL into MR programs
Insert picture description here

3.1 User interface: Client

CLI (command-line interface), JDBC/ODBC (jdbc access hive), WEBUI (browser access hive)

3.2 Metadata: Metastore

The HDFS data file is not mapped into the table structure of the database. There is no descriptive statement (metadata) for the data in the file. This must be defined by yourself.

Metadata includes: database to which the table belongs (default is default), table owner, table name and table comments, field and field comments, column/partition field, table type (whether it is an external table), and the directory where the table data is located Wait. The specific content of the table is in HDFS. Many frameworks such as Atlas monitor the table information in the metadata database matestore to realize metadata management . Later, there will be an article on how to use the metadata database to make enterprise data warehouse data dictionary and common SQL. Articles of query sentences;

Metadata is stored in the built-in derby database by default (small but with many shortcomings, such as not supporting concurrent connections, which can be understood as a lightweight MySQL database). Generally, MySQL is used to store Metastore (that is, MySQL is used to store metadata. ).

3.3 Combine Hadoop

Use HDFS for storage and MapReduce for calculation.

3.4 Driver: Driver

(1) Parser (SQL Parser): Converts SQL strings into abstract syntax tree AST. This step is usually done with a third-party tool library, such as antlr; performs grammatical analysis on the AST, such as whether the table exists, whether the field exists, Is the SQL semantics wrong?
(2) Compiler (Physical Plan): Compile AST to generate a logical execution plan.
(3) Optimizer (Query Optimizer): optimize the logical execution plan.
(4) Execution: Convert the logical execution plan into a physical plan that can be run.

For Hive, the engine is MR/Spark/Tez, and Flink may even be supported later. The logic of SQL translation for different engines is different from the underlying program. For example, the MR engine will translate SQL into MR, and the Spark engine will Translating SQL into RDD programs, the Tez engine is less used outside, which can be understood as a memory-based shuffle optimization in the DAG direction on the basis of MR .
Insert picture description here

Through a series of interactive interfaces provided to users, Hive receives user instructions (SQL), uses its own Driver, combined with metadata (MetaStore), translates these instructions into MapReduce, submits them to Hadoop for execution, and finally returns the execution The result is output to the user interactive interface.

Four, Hive and database comparison

Hive is not a database! He is just a frame! ! It's just that the usage is similar, because more than 95% of the SQL statements encapsulate the MR program, which is actually not comparable, but the statements are very similar.

Since Hive uses the SQL-like query language HQL (Hive Query Language), it is easy to understand Hive as a database. In fact, from a structural point of view, there is no similarity between Hive and the database except that they have similar query languages. This article will explain the difference between Hive and database from many aspects. Database can be used in online applications, but Hive is designed for data warehouses. Knowing this point helps to understand the characteristics of Hive from an application perspective.

4.1 Query language

Since SQL is widely used in data warehouses, the SQL-like query language HQL is designed specifically for the characteristics of Hive. Developers familiar with SQL development can easily use Hive for development.

4.2 Data update

Since Hive is designed for data warehouse applications, the content of the data warehouse is more read and write less. Therefore, it is not recommended to rewrite historical data in Hive. All data is determined when loading. The data in the database usually needs to be modified frequently, so you can use INSERT INTO… VALUES to add data, and use UPDATE… SET to modify data. Although HQL can also be used, this will be slow. The principle is to download and modify first and then upload.

4.3 Execution delay

When Hive is querying data, since there is no index, the entire table needs to be scanned (if there is no partition and bucket, it is a brute force scan, and the complexity is ALL), so the latency is high. Another factor that causes high execution latency of Hive is the MapReduce framework. Since MapReduce itself has a high latency, there will also be high latency when using MapReduce to execute Hive queries. In contrast, the execution latency of the database is low. Of course, this low is conditional, that is, the data scale is small. When the data scale is large enough to exceed the processing capacity of the database, Hive's parallel computing can clearly show its advantages.

4.4 Data scale

Since Hive is built on a cluster and can use MapReduce for parallel computing, it can support very large-scale data. Correspondingly, the data that the database can support is relatively small. If you can use MySQL, you don't need Hive.

Guess you like

Origin blog.csdn.net/qq_43771096/article/details/109481655