[Big Data] Principle and Mechanism of Hive Join

I. Overview

Hive is a Hadoop-based data warehouse solution that provides a SQL-like query language called HiveQL for working with structured data. In Hive, the JOIN operation is used to join data in two or more tables for joint query and analysis.

Join in Hive can be divided into Common Join(join completed in the Reduce phase) and Map Joinjoin completed in the Map phase.

The JOIN operation in Hive is executed through MapReduce or Tez tasks. The specific execution process is as follows:

  1. Data fragmentation : Hive fragments the tables participating in the JOIN operation according to the specified JOIN conditions. Each shard is a subset of the table for parallel processing.

  2. Map phase : In the Map phase, Hive creates a Map task for each fragment and extracts the key-value pairs required by the JOIN condition from the input data. For each key-value pair, Hive will send the key to the corresponding Reducer node and store the value in an intermediate cache.

  3. Shuffle phase : In the Shuffle phase, Hive sends key-value pairs with the same key to the same Reducer node. This process is called data shuffling, and it ensures that data with the same key is sent to the same Reducer node for processing.

  4. Reduce phase : In the Reduce phase, Hive will create a Reduce task for each Reducer node. Each Reduce task receives key-value pairs with the same key from different Mapper nodes and performs a JOIN operation. In the JOIN operation, Hive will combine the records with the same key according to the JOIN condition to generate the JOIN result.

[Note] The JOIN operation in Hive is connected through the columns of two or more tables. The JOIN condition specifies which columns are used for matching. Hive supports multiple types of JOIN, including INNER JOIN(inner join), LEFT JOIN(left join), RIGHT JOIN(right join) and FULL JOIN(full join), and you can select the appropriate JOIN type according to your needs.

In addition, Hive also provides some optimization techniques to improve the performance of JOIN operations, such as partitioning tables and compressing intermediate results. These techniques can reduce data movement and storage overhead, and speed up the execution of JOIN operations.

[Summary] The JOIN operation in Hive is executed through MapReduce or Tez tasks, including data fragmentation, Map phase, Shuffle phase, and Reduce phase. It uses JOIN conditions to combine records with the same key, producing a JOIN result. The performance of JOIN operations can be improved by choosing an appropriate JOIN type and using optimization techniques.

2. Environmental preparation

If you already have an environment, you can ignore it. If you want to quickly deploy the environment, you can refer to my article: Detailed tutorial on quickly deploying Hive through docker-compose

# 登录容器
docker exec -it hive-hiveserver2 bash
# 连接hive
beeline -u jdbc:hive2://hive-hiveserver2:10000  -n hadoop

3. Hive JOIN type

Hive is a Hadoop-based data warehouse tool for processing large-scale data sets. In Hive, JOIN is a common operation used to associate data in two or more tables according to specified conditions.

insert image description here
Hive supports various JOIN types, including:

  • Inner join( inner join 或者简写成 join) : returns only matching rows from both tables. An inner join matches rows from two tables based on one or more conditions (usually equality conditions), and returns the matching rows as results. Only rows satisfying the condition will be included in the result.

Example:

SELECT *
FROM table1
JOIN table2
ON table1.id = table2.id;
  • Left Outer Join ( left outer join 或者简写成 left join) : Returns all rows from the left table and those that match the right table. If there are no matching rows in the right table, the corresponding result column will contain NULL values.

Example:

SELECT *
FROM table1
LEFT JOIN table2
ON table1.id = table2.id;
  • Right Outer Join ( right outer join 或者简写成 right join) : Returns all rows from the right table and those that match the left table. If there are no matching rows in the left table, the corresponding result column will contain NULL values.

Example:

SELECT *
FROM table1
RIGHT JOIN table2
ON table1.id = table2.id;
  • Full Outer Join ( full outer join 或者简写成 full join) : Returns all rows in both tables, if a row has no match in the other table, the corresponding result column will contain NULL values.
SELECT *
FROM table1
FULL OUTER JOIN table2
ON table1.id = table2.id;

These JOIN types can be selected according to specific business needs. In Hive, you can use the JOIN keyword to perform the JOIN operation, and specify the tables to be joined and the join conditions. For example, use "INNER JOIN", "LEFT OUTER JOIN", "RIGHT OUTER JOIN", "FULL OUTER JOIN"etc. to specify the JOIN type.

According to specific requirements and data conditions, you can choose different JOIN types to meet query requirements.

Four, Map, Shuffle, Reduce three stages

The whole process of MapReduce is divided into three major stages, namely Map, , Shuffleand Reduce. Combining multiple materials, I finally decided to divide 11 small steps to describe this process. In the follow-up content, I will also combine part of the source code for analysis.
insert image description here

1) Map stage

In the Map stage, the original data is divided into multiple data blocks of the same size, and each data block is assigned to a Map task for processing. The Map task converts the input data into a series of key-value pairs, where the key is the object to be processed and the value is the associated data. The output results of the Map stage are saved on the local disk, waiting for the processing of the Shuffle stage.

2) Shuffle stage

In the Shuffle stage, the output results of the Map task are assigned to different Reduce tasks for processing according to the key. This process is called the Shuffle process. Specifically, each Map task distributes its output to multiple nodes according to the hash value of the key, and each node corresponds to a Reduce task. In the Shuffle process, data is transmitted through the network, and factors such as network bandwidth and network delay need to be considered to ensure that the data can reach the target node in time.

insert image description here

3) Reduce stage

In the Reduce phase, each Reduce task aggregates or sorts the received key-value pairs according to the key, and then generates the final output result. Similarly, the output results of the Reduce stage will be saved on the local disk, and finally aggregated into the final output results.

[Summary] It can be seen that the three stages in the MapReduce framework are distributed and can run in parallel on multiple computers. The MapReduce framework can effectively process large-scale data and realize efficient distributed computing. Due to the versatility and scalability of the MapReduce framework, it has been widely used in various data processing and machine learning tasks.

5. Common Join (Reduce stage)

In Hive, Common Join is performed in the Reduce phase. When performing common joins, Hive will first process the Map phase of the tables participating in the join, group and sort the data according to the join conditions, and send them to different Reduce tasks.

  • In the Reduce phase, each Reduce task receives grouped data from different tables and performs join operations. Specifically, the Reduce task pairs records with the same join key to implement the join operation. This usually involves combining records with the same join key to produce the final join result.

  • In the Reduce phase, Hive uses the MapReduce framework to perform common join operations. It achieves data matching and joining by distributing data with the same join key to the same Reduce task. This distributed computing method can effectively process large-scale data sets and realize efficient connection operations.

It should be noted that since common connection operations are performed in the Reduce phase, a large amount of intermediate data and computing overhead may be generated when performing large-scale connection operations. Therefore, optimizing the performance of the connection operation is an important consideration, and the performance of the connection operation can be improved by adjusting the configuration parameters of Hive and selecting an appropriate connection algorithm.

Take the following HQL as an example to illustrate the process:

SELECT a.id,a.dept,b.age
FROM a join b
ON (a.id = b.id);

insert image description here

6. Map Join (Map stage)

Map Join is usually used in the scenario of joining a small table and a large table. The specific size of the small table is hive.mapjoin.smalltable.filesizedetermined by the parameter, and the default value is 25M. If the conditions are met, Hive will automatically convert to MapJoin during execution, or use the hint prompt /*+ mapjoin(table) */ to execute MapJoin.
insert image description here
As shown in the flow chart above:

  • First, Task A is executed locally on the client, responsible for scanning the data of small table b, converting it into a HashTable data structure, and writing it into a local file, and then loading the file into DistributeCache.

  • The next Task B task is a MapReduce without Reduce. Start MapTasks to scan the large table a. In the Map stage, associate each record of a with the HashTable corresponding to table b in the DistributeCache, and output the result directly, because there is no Reduce. So there are as many result files as there are Map Tasks.

【Note】Map JOINNot suitable FULL/RIGHT OUTER JOIN.

This is the first introduction to the principle and mechanism of Hive Join. If you have any questions, please leave me a message or pay attention to my official account [Big Data and Cloud Native Technology Sharing] Add group communication or private message to ask questions and wait~

Guess you like

Origin blog.csdn.net/qq_35745940/article/details/130536728