Hive 表的连接

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/humanity11/article/details/86655066

Hive表常用连接

对于直接在mapReduce中用join相比,hive的好处是简化了繁琐的处理工作,hive表的连接操作就是如此,本文主要讲解hive的4中主要连接:内连接、外连接、半连接、map连接。

我们用如下的sales,things表的数据来举例说明各种连接的作用,方便大家理解。

                             

(图1 sales表)                                                                       (图2 things表)

内连接

内连接是最简单的一种连接,它就是将表匹配的行显示出来。通过join关键字对表连接,然后是通过on关键字进行谓语动词的连接,等值的条件在on语句中进行限定,当然我们可以在条件中用and,or等分割限定的条件。

如:select sales.*,things.* from sales join things on (sales.id=things.id);

(图3 查询结果)

添加and限定:select sales.*,things.* from sales join things on (sales.id=things.id and sales.id>2);

(图4 查询结果)

通常单个的连接是执行一个mapredce,可以通过explain来看执行了多少个mapreduce

如:explain extended select sales.*,things.* from sales join things on (sales.id=things.id);

外连接

外连接可以显示表中不能匹配的行,外连接可以分为left outer join,right outer join,full outer join三种

left outer join

左连接是显示左表的字段,将join表的字段不能匹配的行null来显示

比如:select sales.*,things.* from sales left outer join things on (sales.id=things.id);

joe	2	shuit	2
hank	3	milk	3
wangwu	4	water	4
lisi	0	NULL	NULL
daic	2	shuit	2

right outer join

相对于left outer join相比,right outer join是交换两表的连接关系

比如:select sales.*,things.* from sales right outer join things on (sales.id=things.id);

joe	2	shuit	2
daic	2	shuit	2
wangwu	4	water	4
NULL	NULL	air	1
hank	3	milk	3

full outer join

顾名思义就是将所有表所在的行都有对应的行输出

比如:

select sales.*,things.* from sales full outer join things on (sales.id=things.id);

lisi	0	NULL	NULL
wangwu	4	water	4
NULL	NULL	air	1
joe	2	shuit	2
daic	2	shuit	2
hank	3	milk	3

 半连接,半连接类似于左连接,不过并不会输出右表的值:

比如:select * from sales left semi join things on (sales.id=things.id);

joe	2
hank	3
wangwu	4
daic	2

map连接

当一个表足够小,比如sales表,适合放在内存中,就可以将其放在内存中做连接操作。如果需要指定map,就需要通过注释的方式来做。

不如:select /* + mapjoin(sales) */ sales.*,things.* from sales join things on (sales.id=things.id);

joe	2	shuit	2
hank	3	milk	3
wangwu	4	water	4
daic	2	shuit	2

最后查看下执行过程。

比如:explain  select /* + mapjoin(sales) */ sales.*,things.* from sales join things on (sales.id=things.id);

STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      DagName: hadoop_20190126120909_7f4e37ab-c15f-465e-89d7-14f2b8283d6a:32
      Vertices:
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: things
                  Statistics: Num rows: 1 Data size: 29 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 29 Basic stats: COMPLETE Column stats: NONE
                    Spark HashTable Sink Operator
                      keys:
                        0 id (type: string)
                        1 id (type: string)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
      DagName: hadoop_20190126120909_7f4e37ab-c15f-465e-89d7-14f2b8283d6a:31
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: sales
                  Statistics: Num rows: 1 Data size: 36 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 36 Basic stats: COMPLETE Column stats: NONE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      keys:
                        0 id (type: string)
                        1 id (type: string)
                      outputColumnNames: _col0, _col1, _col5, _col6
                      input vertices:
                        1 Map 2
                      Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col0 (type: string), _col1 (type: string), _col5 (type: string), _col6 (type: string)
                        outputColumnNames: _col0, _col1, _col2, _col3
                        Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 1 Data size: 39 Basic stats: COMPLETE Column stats: NONE
                          table:
                              input format: org.apache.hadoop.mapred.TextInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

猜你喜欢

转载自blog.csdn.net/humanity11/article/details/86655066