Hive 表的连接

Hive表常用连接

对于直接在mapReduce中用join相比，hive的好处是简化了繁琐的处理工作，hive表的连接操作就是如此，本文主要讲解hive的4中主要连接：内连接、外连接、半连接、map连接。

我们用如下的sales,things表的数据来举例说明各种连接的作用，方便大家理解。

（图1 sales表）（图2 things表）

内连接

内连接是最简单的一种连接，它就是将表匹配的行显示出来。通过join关键字对表连接，然后是通过on关键字进行谓语动词的连接，等值的条件在on语句中进行限定，当然我们可以在条件中用and，or等分割限定的条件。

如：select sales.*,things.* from sales join things on (sales.id=things.id);

（图3 查询结果）

添加and限定：select sales.*,things.* from sales join things on (sales.id=things.id and sales.id>2);

（图4 查询结果）

通常单个的连接是执行一个mapredce，可以通过explain来看执行了多少个mapreduce

如：explain extended select sales.*,things.* from sales join things on (sales.id=things.id);

外连接

外连接可以显示表中不能匹配的行，外连接可以分为left outer join，right outer join，full outer join三种

left outer join

左连接是显示左表的字段，将join表的字段不能匹配的行null来显示

比如：select sales.*,things.* from sales left outer join things on (sales.id=things.id);

joe	2	shuit	2
hank	3	milk	3
wangwu	4	water	4
lisi	0	NULL	NULL
daic	2	shuit	2

right outer join

相对于left outer join相比，right outer join是交换两表的连接关系

比如：select sales.*,things.* from sales right outer join things on (sales.id=things.id);

joe	2	shuit	2
daic	2	shuit	2
wangwu	4	water	4
NULL	NULL	air	1
hank	3	milk	3

full outer join

顾名思义就是将所有表所在的行都有对应的行输出

比如：

select sales.*,things.* from sales full outer join things on (sales.id=things.id);

lisi	0	NULL	NULL
wangwu	4	water	4
NULL	NULL	air	1
joe	2	shuit	2
daic	2	shuit	2
hank	3	milk	3

半连接，半连接类似于左连接，不过并不会输出右表的值：

比如：select * from sales left semi join things on (sales.id=things.id);

joe	2
hank	3
wangwu	4
daic	2

map连接

当一个表足够小，比如sales表，适合放在内存中，就可以将其放在内存中做连接操作。如果需要指定map，就需要通过注释的方式来做。

不如：select /* + mapjoin(sales) */ sales.*,things.* from sales join things on (sales.id=things.id);

joe	2	shuit	2
hank	3	milk	3
wangwu	4	water	4
daic	2	shuit	2

最后查看下执行过程。

比如:explain select /* + mapjoin(sales) */ sales.*,things.* from sales join things on (sales.id=things.id);

STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      DagName: hadoop_20190126120909_7f4e37ab-c15f-465e-89d7-14f2b8283d6a:32
      Vertices:
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: things
                  Statistics: Num rows: 1 Data size: 29 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 29 Basic stats: COMPLETE Column stats: NONE
                    Spark HashTable Sink Operator
                      keys:
                        0 id (type: string)
                        1 id (type: string)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
      DagName: hadoop_20190126120909_7f4e37ab-c15f-465e-89d7-14f2b8283d6a:31
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: sales
                  Statistics: Num rows: 1 Data size: 36 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 36 Basic stats: COMPLETE Column stats: NONE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      keys:
                        0 id (type: string)
                        1 id (type: string)
                      outputColumnNames: _col0, _col1, _col5, _col6
                      input vertices:
                        1 Map 2
                      Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col0 (type: string), _col1 (type: string), _col5 (type: string), _col6 (type: string)
                        outputColumnNames: _col0, _col1, _col2, _col3
                        Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
                          table:
                              input format: org.apache.hadoop.mapred.TextInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Hive表常用连接

内连接

外连接

left outer join

right outer join

full outer join

map连接

猜你喜欢