The underlying mr diagram of join operation in hive

Join is divided into mapjoin and common (ordinary) join

Mapjoin means that there is no reduce phase, only the map phase, and the join operation is performed in the map phase.

Common join is the join operation in the reduce phase, and the whole process includes map shuffle reduce.

Take common join as an example:

Map阶段

读取源表的数据,Map输出时候以 Join on 条件中的列为作为key,如果Join有多个关联键,则以这些关联键的组合作为key;

Map输出的 value 为 join 之后所关心的(select或者where中需要用到的)列;同时在value中还会包含表的 Tag 信息,用于标明此value对应哪个表;

按照key进行排序;

Shuffle阶段
根据key的值进行hash,并将key/value按照hash值推送至不同的reduce中,这样确保两个表中相同的key位于同一个reduce中

Reduce阶段
根据key的值完成join操作,期间通过Tag来识别不同表中的数据。

Table a:

id name
1 Xiao Wang
2 Xiao Zhang

Table b:

id age
1 32
2 22
select a.id,name,age from a join b on a.id=b.id;

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_47699191/article/details/115266572
Recommended