关于union和join区别和联系

union和join是需要联合多张表时常见的关联词，具体概念我就不说了，想知道上网查就行，因为我也记不准确。

先说差别：union对两张表的操作是合并数据条数，等于是纵向的，要求是两张表字段必须是相同的(Schema of both sidesof union should match.)。也就是说如果A表中有三条数据，B表中有两条数据，那么A unionB就会有五条数据。说明一下union 和union all的差别，对于union如果存在相同的数据记录会被合并，而unionall不会合并相同的数据记录，该有多少条记录就会有多少条记录。例如在mysql下执行以下语句：

select * from tmp_libingxue_a;

name number

libingxue 1001

yuwen 1002

select * from tmp_libingxue_b;

name number

libingxue 1001

feiyao 1003

select * from tmp_libingxue_a union select * fromtmp_libingxue_b;

libingxue 1001

yuwen 1002

feiyao 1003

select * from tmp_libingxue_a union all select * fromtmp_libingxue_b;

libingxue 1001

yuwen 1002

libingxue 1001

feiyao 1003

但是这样在hive里面是不能执行的，执行select * from tmp_libingxue_a union all select* from tmp_libingxue_b;会failed，hive中union必须在子查询中进行。如

select * from (select * from tmp_yuwen_a union all select * fromtmp_yuwen_b) t1;

注意，必须是unionall，单独用union它会提示你缺少ALL，而且后面的t1必须写，你可以写成a或者b，但是一定要写，不写会出错。

而join则是偏于横向的联合，仅仅是偏向于，等下详细说明。join跟union比起来显得更宽松，对两个表的字段不做要求，没有限制条件的join等于两个表的笛卡尔乘积，所有join需要有限制条件来约束，经过限制的join就是横向的扩张了。对于满足限制条件的join会被提取出来，不满足的直接过滤掉。用法可以很灵活，下面有两个简单的例子：

select * from (select * from tmp_yuwen_a）t1 join (select * fromtmp_yuwen_b) t2;

select * from tmp_yuwen_a t1 join (select * from tmp_yuwen_b)t2;

left outer join和right outer join用法类似，区别就是left outerjoin会把左边表的字段全部选择出来，右边表的字段把符合条件的也选择出来，不满足的全部置空，也就是说以左边表为参照。rightouter join同理以右边表为参照。这三个join之间的差别说过很多次，网上也有更详细的解释，不再赘述。

相同点：在某些特定的情况下，可以用join实现union all的功能，这种情况是有条件的，当出现这种情况的时候选择unionall还是groupby就可以看情况或者看两者的消耗而决定。sql虽然就在那么几个关键词，但变化多端、功能强大，只要能实现想要的功能，怎么用随便你。需求情况sql简单重现如下

drop table tmp_libingxue_resource;

create external table if not exists tmp_libingxue_resource(

user_id string,

shop_id string,

auction_id string,

search_time string

)partitioned by (pt string)

row format delimited fields terminated by '\t'

lines terminated by '\n'

stored as sequencefile;

drop table tmp_libingxue_result;

create external table if not exists tmp_libingxue_result(

user_id string,

shop_id string,

auction_id string,

search_time string

)partitioned by (pt string)

row format delimited fields terminated by '\t'

lines terminated by '\n'

stored as sequencefile;

insert overwrite table tmp_libingxue_result where(pt=20041104)select * from tmp_libingxue_resource;

sudo -u taobao hadoop dfs -rmr/group/tbads/warehouse/tmp_libingxue_result/pt=20041104

sudo -u taobao hadoop jar/home/taobao/dataqa/framework/DailyReport.jarcom.alimama.loganalyzer.tool.SeqFileLoadertmp_libingxue_resource.txthdfs://v039182.sqa.cm4:54310/group/tbads/warehouse/tmp_libingxue_result/pt=20041104/part-00000

hive< select * from tmp_libingxue_resource;

2001 0 11 101 20041104

2002 0 11 102 20041104

hive< select * from tmp_libingxue_result;

2001 0 12 103 20041104

2002 0 12 104 20041104

select user_id,shop_id,max(auction_id),max(search_time)

from

(select * from tmp_libingxue_resource

union all

select * from tmp_libingxue_result )t1

group by user_id,shop_id;

2001 0 12 103

2002 0 12 104

select t1.user_id,t1.shop_id,t2.auction_id,t2.search_time

from

(select * from tmp_libingxue_resource) t1

join

(select * from tmp_libingxue_result) t2

on t1.user_id=t2.user_id and t1.shop_id=t2.shop_id;

2001 0 12 103

2002 0 12 104

写这么个东西花的时间比我想象中的要长很多，但是执行别人的sql跟执行自己的sql感觉不一样，是的，太不一样了。

关于union和join区别和联系

猜你喜欢