What problem does hive solve?

The reason for the emergence of hive

The main reasons for the emergence of Hive are as follows:

  • Traditional data warehouses cannot handle large-scale data : Traditional data warehouses usually use relational databases as underlying storage, which are less efficient when processing large-scale data.
  • MapReduce is difficult to use : MapReduce is a distributed computing framework that can be used to process large-scale data, but the programming model of MapReduce is complex and difficult to use.
  • A unified query interface is needed : Traditional data warehouses and MapReduce both provide data query interfaces, but these interfaces are independent of each other and difficult to manage uniformly.

To solve these problems, Facebook developed Hive in 2008. Hive is a distributed data warehouse management system based on Hadoop. It provides a SQL syntax to access data stored in the Hadoop Distributed File System (HDFS). The emergence of Hive solves the problem that traditional data warehouses cannot handle large-scale data, simplifies the use of MapReduce, and provides a unified query interface.

The emergence of Hive has had a major impact on big data processing, making big data processing simpler, more efficient, and scalable.

hive execution process

  1. Writing a Hive SQL program : First, you need to write a Hive SQL program. This program can be written through tools such as Hive CLI and Hive WebUI. Hive SQL programs can contain various data query statements, such as select, insert, update, delete, etc.
  2. Submit the Hive SQL program : After the writing is completed, the Hive SQL program needs to be submitted to the Hive server. The Hive server will parse the Hive SQL program based on the syntax and logic and generate MapReduce tasks.
  3. Execute MapReduce tasks : MapReduce tasks convert query statements in Hive SQL programs into Map and Reduce tasks. The Map task splits the data into small pieces and preprocesses the data. Reduce tasks merge and aggregate the output results of Map tasks.
  4. Generate query results : After the MapReduce task is completed, the Hive server will generate the query results into HDFS.
  5. Get data from HDFS : Finally, you can get the query results from HDFS through tools such as Hive CLI and Hive WebUI.

Specifically, the execution process of the Hive SQL program is as follows:

  1. SqlParser parses Hive SQL programs into AST (Abstract Syntax Tree)
  2. SemanticAnalyzer performs semantic analysis on AST
  3. Optimizer optimizes AST
  4. Planner generates execution plan
  5. Driver sends execution plan to MapReduce framework
  6. The MapReduce framework starts Map and Reduce tasks
  7. Map and Reduce tasks generate query results
  8. The Hive server writes the query results to HDFS
  9. The user retrieves query results from HDFS

This process can be divided into two stages:

  • Hive SQL parsing and execution phase : This phase is the core phase of Hive SQL program execution, including the parsing, optimization, planning, execution and other processes of the Hive SQL program.
  • HDFS write and read phase : This phase is the process of writing query results to HDFS and retrieving query results from HDFS.

It should be noted that the execution process of the Hive SQL program can be adjusted according to the configuration of the Hive server. For example, you can control the number and parallelism of MapReduce tasks by configuring Hive parameters.

What parts does the hive server contain?

HiveServer2

HiveServer2 is the server side of Hive. It is responsible for receiving users' Hive SQL requests and converting these requests into MapReduce tasks. The
conversion steps of HiveServer2 are as follows:

  • Parsing stage : HiveServer2 will use the ANTLR parser to parse the Hive SQL request and generate an abstract syntax tree (AST). AST is a structured representation of a Hive SQL request, which contains the syntax information of a Hive SQL request.
  • Semantic analysis stage : HiveServer2 will use SemanticAnalyzer to perform semantic analysis on the AST to check whether the semantics of the Hive SQL request are correct. Semantic analysis will check whether the variables, constants, expressions, etc. in the Hive SQL request are correct, and whether the Hive SQL request complies with Hive's semantic rules.
  • Optimization phase : HiveServer2 will use Optimizer to optimize the AST to improve the execution efficiency of Hive SQL requests. Optimization will generate the optimal execution plan based on the semantics and data distribution of Hive SQL requests.
  • Generate execution plan phase : HiveServer2 will use Planner to generate execution plans. The execution plan is the execution guide for Hive SQL requests. It contains information such as the number of MapReduce tasks, partitions, input and output.
  • Execution phase : HiveServer2 will send the execution plan to the MapReduce framework, and the MapReduce framework will execute the Hive SQL request. The MapReduce framework will split Hive SQL requests into multiple Map and Reduce tasks and execute them in parallel on multiple nodes.

Hive Metastore

Hive Metastore is Hive's metadata storage. It stores metadata information such as Hive databases, tables, columns, partitions, etc.
Hive Metastore uses MySQL to store metadata and provides the following advantages:

Scalability: MySQL is a scalable database that can support a large number of concurrent connections.
Reliability: MySQL supports ACID transactions, ensuring data consistency and integrity.
Performance: MySQL is a high-performance database that can meet the performance requirements of Hive.

Performance optimization

To minimize the amount of MapReduce tasks generated, you should pay attention to the following points when writing HiveSQL:

  • Try to use join instead of union. The union operation will cause the data of the two tables to be used as the input of the MapReduce task, while the join operation will only generate one MapReduce task.
  • Try to use where clause to filter data. The where clause can filter out unnecessary data and reduce the amount of data processed by MapReduce tasks.
  • **Use partition tables as much as possible. **Partitioned tables can distribute data to multiple files, reducing the amount of data shuffle between MapReduce tasks.
  • Use coalesce function to merge small files. The coalesce function can merge multiple small files into one large file, reducing the amount of data shuffle between MapReduce tasks.
  • Use mapjoin operation. The mapjoin operation can combine Map tasks and Reduce tasks into one task, reducing the number of MapReduce tasks.

Here are some specific examples:

  • Use join instead of union:
# 使用 union,生成两个 MapReduce 任务
select * from table1 union all select * from table2;

# 使用 join,生成一个 MapReduce 任务
select * from table1 join table2 on table1.id = table2.id;
  • Use the where clause to filter data:
# 不使用 where 子句,生成一个 MapReduce 任务
select * from table1;

# 使用 where 子句,生成一个 MapReduce 任务
select * from table1 where id = 1;
  • Use partition table:
# 使用不分区表,生成一个 MapReduce 任务
select * from table1;

# 使用分区表,生成多个 MapReduce 任务
select * from table1 partition(d1, d2, d3);
  • Use coalesce function to merge small files:
# 不使用 coalesce 函数,生成多个 MapReduce 任务
select * from table1;

# 使用 coalesce 函数,生成一个 MapReduce 任务
select * from table1 coalesce(1000);
  • Use mapjoin operation:
# 不使用 mapjoin 操作,生成两个 MapReduce 任务
select * from table1 join table2 on table1.id = table2.id;

# 使用 mapjoin 操作,生成一个 MapReduce 任务
select * from table1 mapjoin table2 on table1.id = table2.id;

Summarize

In other words, hive sql converts sql into map reduce tasks, allowing developers to write sql instead of writing map reduce code. Since sql is universal, many data analysts have this technology stack, which is relatively easy to write map reduce code. Much easier to get started with. For the same data retrieval requirement, different hive sql writing methods will lead to different creation amounts of Map Reduce tasks. Therefore, writing as few SQL tasks as possible for Map Reduce tasks is also a point of concern for performance optimization.

Guess you like

Origin blog.csdn.net/xielinrui123/article/details/132772945