Hive分桶和抽样查询

一、分桶

分区针对的是数据的存储路径；分桶针对的是数据文件，就相当于hadoop里面的真正的分区。

★怎么选择桶？默认时对某一列进行hash，使用hashcode对桶的个数求模取余，确定哪一条记录进入哪一个桶。分桶后，桶内有序，整体不一定有序。

分区提供一个隔离数据和优化查询的便利方式。不过，并非所有的数据集都可形成合理的分区，特别是之前所提到过的要确定合适的划分大小这个疑虑。

分桶是将数据集分解成更容易管理的若干部分的另一个技术。

分桶的目的一部分也是为了抽样调查

案例实操：

1、创建分桶表

create table stu_buk（id int, name string）

clustered by(id) #根据什么分桶-----

into 4 buckets #分几个桶

row format delimited fields terminated by '\t';

2、导入数据

1001   ss1
1002   ss2
1003   ss3
1004   ss4
1005   ss5
1006   ss6
1007   ss7
1008   ss8

hive (default)> load data local inpath '/opt/module/datas/student.txt' into table stu_buck;

3、查看创建的分桶表中是否分成4个桶

居然发现没有4个桶？原因是分桶表是不能通过load上传数据的，试想一下，数据上传到hdfs上，这个过程怎么识别分桶字段，有怎么取hash值呢

★综上所述可得，分通表只能是使用insert select的方式

步骤：

（1）先建一个普通的stu表

create table stu(id int, name string)

row format delimited fields terminated by '\t';

（2）向普通的stu表中导入数据

load data local inpath '/opt/module/datas/student.txt' into table stu;

（3）导入数据到分桶表，通过子查询的方式

insert into table stu_buck

select id, name from stu;

注意：做这些的前提是，reduce必须是只有一个

所以必要时要设置：

hive (default)> set hive.enforce.bucketing=true;

hive (default)> set mapreduce.job.reduces=-1;

hive (default)> insert into table stu_buck

select id, name from stu;

二、抽样查询

对于非常大的数据集，有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive可以通过对表进行抽样来满足这个需求。

查询表stu_buck中的数据。

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);

注：tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y) 。

y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例。例如，table总共分了4份，当y=2时，抽取(4/2=)2个bucket的数据，当y=8时，抽取(4/8=)1/2个bucket的数据。

x表示从哪个bucket开始抽取，如果需要取多个分区，以后的分区号为当前分区号加上y。例如，table总bucket数为4，tablesample(bucket 1 out of 2)，表示总共抽取（4/2=）2个bucket的数据，抽取第1(x)个和第3(x+y)个bucket的数据。

注意：x的值必须小于等于y的值，否则

会经常报：FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck，这是原因是x的值大于了y的值