Students who have a certain understanding of Hive must make a certain understanding of the internal operating mechanism of several commonly used statements in Hive. In this way, we can better optimize Hive and make certain adjustments to SQL.
Below we mainly pass
DISTINCT
JOIN
Two types of operations, detailed introduction to the underlying MapReduce execution process of Hive.
DISTINCT flow chart:
You can see that the first stage above is Mapping, which divides the data into several fragments and reads them in.
The second stage is Shuffling
When writing from the buffer to the disk, the partition will be partitioned and sorted. The partition refers to which partition a certain key should enter, and the keys in the same partition will be sorted
The third stage is reducing. Since we are DISTINCT, we only need to select one of the same key list.
JOIN flow chart
(JOIN / LEFT JOIN / RIGHT JOIN) belongs to this category
MAP JOIN does not belong to this category
The Mapping stage is mainly for reading data, dividing the data, and marking.
such as
The data from source A is
a : [a,3,4,A]
b: [b,2,3,A]
c: [c,5,6,A]
The data from source B is
d:[d,20,B]
a:[a,2,B]
b:[b,9,B]
The Shuffling stage involves sorting data, putting together data with the same key
In the Reducing stage, the same key is put together,
For example key: a
Composed of a list from A
a:{[a,3,4]}
Composed of a list from B
a{[a,2]}
Since there is only one key with the same source from different sources, it is a one-to-one relationship.
For many-to-many relationships, a cycle will form, and eventually a Cartesian product will occur.
Similarly, the processes of LEFT JOIN and RIGHT JOIN are similar. Only the screening conditions are different.