Hive中join关键字运行机制及使用方法

 

1、join的原理和机制

 

Hive中的Join可分为Common Join(Reduce阶段完成join)和Map Join(Map阶段完成join)。

 

1.1 Hive Common Join

 

如果不指定MapJoin或者不符合MapJoin的条件,那么Hive解析器会默认把执行Common Join,即在Reduce阶段完成join。整个过程包含Map、Shuffle、Reduce阶段。

 

阶段流程图:

 

image

 

Map阶段:读取源表的数据,Map输出的key为关联字段,输出的value以<k,v>形式存在,k代表表的标识,v代表取出的字段值。

 

Shuffle阶段:根据key的值进行hash分发,将其推送至不同的reduce中,确保两个表中相同的key位于同一个reduce中。

 

Reduce阶段:根据key的值完成join操作,期间通过value值中参数k来识别不同表中的数据。

 

1.2 Hive Map Join

 

运行原理图:

 

image

 

MapJoin函数适用于大小表关联的场景,小表的大小限制可由参数hive.mapjoin.smalltable.filesize来设定,默认值为25M。如果条件满足,Hive在执行时候会自动转化为MapJoin,或在写sql时就加上/+ mapjoin(table) / 参数。主要分为以下两步:

 

1.Task A在客户端本地执行,负责扫描小表b的数据,将其转换成一个HashTable的数据结构,并写入本地的文件中,之后将该文件加载到DistributeCache中。

2.Task B任务是一个没有Reduce的MapReduce,启动MapTasks扫描大表a,在Map阶段,根据a的每一条记录去和DistributeCache中b表对应的HashTable关联,并直接输出结果,因为没有Reduce,所以有多少个Map Task,就有多少个结果文件。

注意:Map JOIN不适合FULL/RIGHT OUTER JOIN。

 

2、join的连接方式

 

Hive中中连接方式主要是内关联(INNER JOIN)、左关联(LEFT JOIN)、右关联(RIGHT JOIN)、全关联(FULL JOIN)、左半关联(LEFT SEMI JOIN)和笛卡尔积(CROSS JOIN)。为了更好的解释join参数的用法,特建立两张表以便测试。

 
--建立用户详单表
create table user_list_detail(
user_id    string,    --用户ID
user_name  string,    --用户名
city       string)    --归属城市
row format delimited fields terminated by ',';

--建立用户信息表
create table user_list_info(
user_id    string,    --用户ID
sex        string,    --性别
age        string)    --年龄
row format delimited fields terminated by ',';

--查看数据信息
hive> select * from user_list_detail;
OK
01  Maggie  HangZhou
02  Timu    BeiJing
03  Tom ShangHai
Time taken: 0.083 seconds, Fetched: 3 row(s)

hive> select * from user_list_info;
OK
01  Male    26
02  Female  18
04  Male    28
Time taken: 0.068 seconds, Fetched: 3 row(s)
 

2.1 内连接([INNER] JOIN)

 

最常规的JOIN方式,类似两张表通过关联字段取有交集的行,且显示关联成功的行,INNER可省略不写。

 
hive> select 
    >     a.user_id,
    >     a.user_name,
    >     a.city,
    >     b.sex,
    >     b.age
    > from 
    >     user_list_detail a
    > inner join 
    >     user_list_info b
    > on a.user_id=b.user_id;
Query ID = hive_20200102164926_15a71df5-b4d3-4bad-81df-5dc8a3db5f8f
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1576045532570_7331)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Map 2 ..........   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 9.58 s     
--------------------------------------------------------------------------------
OK
01  Maggie  HangZhou    Male    26
02  Timu    BeiJing Female  18
Time taken: 21.52 seconds, Fetched: 2 row(s)
 

2.2 左外连接(LEFT [OUTER] JOIN)

 

以LEFT [OUTER] JOIN关键字前面的表作为主表,和其他表进行关联,返回记录和主表的记录数一致,关联不上的字段置为NULL,关键字OUTER可省略不写。

 
hive> select 
    >     a.user_id,
    >     a.user_name,
    >     a.city,
    >     b.sex,
    >     b.age
    > from 
    >     user_list_detail a
    > left outer join 
    >     user_list_info b
    > on a.user_id=b.user_id;
Query ID = hive_20200102165511_ddfd01c0-0601-443f-adc9-65edaa5cd771
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1576045532570_7331)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Map 2 ..........   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 9.42 s     
--------------------------------------------------------------------------------
OK
01  Maggie  HangZhou    Male    26
02  Timu    BeiJing Female  18
03  Tom ShangHai    NULL    NULL
Time taken: 10.155 seconds, Fetched: 3 row(s)
 

2.3 右外关联(RIGHT [OUTER] JOIN)

 

和左外关联相反,以RIGHT [OUTER] JOIN关键字后面的表作为主表,和其他表进行关联,返回记录和主表的记录数一致,关联不上的字段置为NULL,关键字OUTER可省略不写。

 
hive> select 
    >     a.user_id,
    >     a.user_name,
    >     a.city,
    >     b.sex,
    >     b.age
    > from 
    >     user_list_detail a
    > right outer join 
    >     user_list_info b
    > on a.user_id=b.user_id;
Query ID = hive_20200102165700_6af21b84-60a2-4fe7-817f-2eab9e3c355d
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1576045532570_7331)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Map 2 ..........   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 5.09 s     
--------------------------------------------------------------------------------
OK
01  Maggie  HangZhou    Male    26
02  Timu    BeiJing Female  18
NULL    NULL    NULL    Male    28
Time taken: 5.768 seconds, Fetched: 3 row(s)
 

2.4 全外关联(FULL [OUTER] JOIN)

 

以两个表的记录为基准,返回两个表的所有记录,类似并集,关联不上的字段为NULL,关键字OUTER可省略不写。

 
hive> select 
    >     a.user_id,
    >     a.user_name,
    >     a.city,
    >     b.sex,
    >     b.age
    > from 
    >     user_list_detail a
    > full outer join 
    >     user_list_info b
    > on a.user_id=b.user_id;
Query ID = hive_20200102165812_cc6246ab-2444-4bc6-9c22-54a7d421f889
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1576045532570_7331)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Map 3 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 9.18 s     
--------------------------------------------------------------------------------
OK
01  Maggie  HangZhou    Male    26
02  Timu    BeiJing Female  18
03  Tom ShangHai    NULL    NULL
NULL    NULL    NULL    Male    28
Time taken: 9.887 seconds, Fetched: 4 row(s)
 

2.5左半连接(LEFT SEMI JOIN)

 

以LEFT SEMI JOIN关键字前面的表为主表,返回主表的KEY也在副表中的记录。左半连接存在两个限制,一类似于两张表取交集,二只能显示主表中的数据。(类似于IN或EXISTS的功能)

 
hive> select 
    >     a.user_id,
    >     a.user_name,
    >     a.city
    > from 
    >     user_list_detail a
    > left semi join 
    >     user_list_info b
    > on a.user_id=b.user_id;
Query ID = hive_20200102170238_7629007e-a1b6-46a3-bc1a-00c7f7e4ae98
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1576045532570_7331)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Map 2 ..........   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 6.97 s     
--------------------------------------------------------------------------------
OK
01  Maggie  HangZhou
02  Timu    BeiJing
Time taken: 7.923 seconds, Fetched: 2 row(s)
 

2.6 笛卡尔积关联(CROSS JOIN)

 

返回两个表的笛卡尔积结果,不需要指定关联键。左表有3行数据,右表有3行数据,三三见九。即9种组合方式。

 
hive> select 
    >     a.user_id,
    >     a.user_name,
    >     a.city,
    >     b.sex,
    >     b.age
    > from 
    >     user_list_detail a
    > cross join 
    >     user_list_info b;
Warning: Map Join MAPJOIN[7][bigTable=a] in task 'Map 1' is a cross product
Query ID = hive_20200102170518_16230e9a-e9ae-4189-a74a-cc0a00f9b8ef
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1576045532570_7331)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Map 2 ..........   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 5.77 s     
--------------------------------------------------------------------------------
OK
01  Maggie  HangZhou    Male    26
01  Maggie  HangZhou    Male    28
01  Maggie  HangZhou    Female  18
02  Timu    BeiJing Male    26
02  Timu    BeiJing Male    28
02  Timu    BeiJing Female  18
03  Tom ShangHai    Male    26
03  Tom ShangHai    Male    28
03  Tom ShangHai    Female  18
Time taken: 6.571 seconds, Fetched: 9 row(s)
 

注意:Hive中Join的关联字段必须在ON中指定,不能在Where中指定,且on的优先级比where高。

 

3、join的优化原则

 

1.优先在where限制条件内过滤后再进行Join操作,最大限度的减少参与关联的数据量。

2.小表关联大表,小表在前,大表在后。后者使用map join函数指定小表。

3.关联的条件相同的话,最好放入同一个job,并且 join 表的排列顺序从小到大。

 

4、总结

 

本文主要讲了hive中join参数的运行原理和使用方法,具体的优化技巧可参看文章Hive数据倾斜及优化方案。博文中若有错误和不足之处,欢迎指正。

猜你喜欢

转载自www.cnblogs.com/Maggieli/p/12134050.html