分桶抽样查询的解释

分桶

首先要开启分桶，不然一顿操作猛如虎，回头一看250
set hive.enforce.bucketing=true;
创建表时 + clustered by (id) into 4 buckets 根据id分成4个桶

查询时抽样 select * from tbname tablesample (bucket 1 out of 4 on id)；

create table tb_user2
(
id int,
name string,
likes array,
addrs map<string, string>
)
clustered by (id) into 4 buckets
row format
delimited
fields terminated by ‘,’
collection items terminated by ‘-’
map keys terminated by ‘:’
lines terminated by ‘\n’;

insert into table tb_user2 select id, name, likes, addrs from tb_user1;
此时hdfs 中 tb_user2 000000_0，000001_0，,000002_0，000003_0 共4个文件

问题1：分桶的作用，说的明白吗？

抽样查询：
select * from tb_user2 tablesample(bucket 3 out of 4 on id);
select * from tb_user2 tablesample(bucket 3 out of 8 on id);
select * from tb_user2 tablesample(bucket 3 out of 2 on id);
select * from tb_user2 tablesample(bucket 1 out of 2 on id);

问题2： tablesample（bucket 3 out of 4 on id ）怎么理解？

id列的hash值对分桶的个数 4 取模，结果有4种情况：

id%4 余数有 0 1 2 3 共4种结果，代表了4个分桶

bucket 1 out of 4
查出来的数据的id哈希值都能被4 整除，也就是说余数是0的数据，放入了编号为1的桶（第一个桶）

（整数的哈希值还是整数）
bucket 3 out of 4
查出来的数据的id哈希值都%4余数3，也就是说余数是3的数据，放入了编号为4的桶（第四个桶）

问题3：数据时怎么选出来的呢？

x out of y y个桶我抽取其中的x个

例如 y=32 3 out of 32

扫描二维码关注公众号，回复： 8527912 查看本文章

1》 3 out of 32
源文件分成32个桶的数据，3 out of 32 ，从32个桶总把3号桶中的数据拿出来

2》 3 out of 8
源文件分成32个桶的数据，3 out of 8 ，从8个桶总把第3个桶中的数据拿出来。32个桶/8=4 需要4个桶的数据。从4部分数据中，分别找出第3个桶
1 2 3 4 5 6 7 8 | 9 10 11 12 13 14 15 16 | 17 18 19 20 21 22 23 24 | 25 26 27 28 29 30 31 32
分别取第3个 3 11 19 27 桶中的数据被选出来

3》 3 out of 16
源文件分成32个桶的数据，3 out of 16 ，从32个桶总把3号桶中的数据拿出来

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
分别取第3个 3 19 桶中的数据被选出来

4》 3 out of 64
源文件分成32个桶的数据，3 out of 64 ，32/64=0.5 现在需要半个桶的数据。结果小于1，那就不像上面那样分部分取数据了。
仍然是从第3个桶中抽取数据只抽取其中的一半就好了

这里的逻辑容易忘记多看看

李大海的幸福生活

发布了20 篇原创文章 · 获赞 0 · 访问量 183

私信关注