Hive_ Hive DISTINCT, JOIN operation process

 

  Students who have a certain understanding of Hive must make a certain understanding of the internal operating mechanism of several commonly used statements in Hive. In this way, we can better optimize Hive and make certain adjustments to SQL.

 

Below we mainly pass  

  DISTINCT 

  JOIN 

Two types of operations, detailed introduction to the underlying MapReduce execution process of Hive.

 

 

DISTINCT flow chart:

 

 

You can see that the first stage above is Mapping, which divides the data into several fragments and reads them in.

 

 

The second stage is Shuffling

When writing from the buffer to the disk, the partition will be partitioned and sorted. The partition refers to which partition a certain key should enter, and the keys in the same partition will be sorted

 

 

The third stage is reducing. Since we are DISTINCT, we only need to select one of the same key list.

 

 

 

 

 

JOIN flow chart 

 

(JOIN / LEFT JOIN / RIGHT JOIN) belongs to this category

MAP JOIN does not belong to this category

 

 

The Mapping stage is mainly for reading data, dividing the data, and marking.

such as

The data from source A is

a : [a,3,4,A]

b: [b,2,3,A]

c: [c,5,6,A]

The data from source B is

d:[d,20,B]

a:[a,2,B]

b:[b,9,B]

 

The Shuffling stage involves sorting data, putting together data with the same key

In the Reducing stage, the same key is put together,

For example key: a

Composed of a list from A

a:{[a,3,4]}

Composed of a list from B

a{[a,2]}

Since there is only one key with the same source from different sources, it is a one-to-one relationship.

For many-to-many relationships, a cycle will form, and eventually a Cartesian product will occur.

 

 

Similarly, the processes of LEFT JOIN and RIGHT JOIN are similar. Only the screening conditions are different.

 

 

 

 

Published 519 original articles · praised 1146 · 2.83 million views

Guess you like

Origin blog.csdn.net/u010003835/article/details/105253639
Recommended