Hive bucket

create table buck(id string,name string)

Clustered by (id) Adding ed means that the table has been partitioned and sorted when the table is created, which is just a format

sorted by (id)

into 4 buckets

row format delimited fields terminated by ',' ;

load data local

But it is still a whole file on hdfs

truncate table xx clears the table data,

The bucket table will not be automatically divided into buckets

#Open bucket

set hive.enforce.bucketing=true; #Set the number of reducers
to be the same as the number of buckets

set mapreduce.job.reduces=4;

hive> insert into table buck
> select id,name from p distribute by (id)
> sort by (id) ;

Distribgute indicates what to use for the hashpartition and what to use for partitioning. The number of reduce is set to 4.

sort by by what order in each partition

Yaka

insert into table buck

select id,name from p cluster by (id);

That is, cluster by is equivalent to distributed by +sort by

But the latter is more flexible, if there are multiple reduce sorts, it is the data sorting in each reduce

The partitioner of hive and the partitioner of mr are not the same thing.

The partition of hive just separates the file upload load according to the specified directory

The bucket clusterd by is divided according to the hashpartition, that is, the partitioner of mr.

If you upload the file directly by load, it will not be partitioned directly

Instead, by querying data from other tables and then placing the queried data into different buckets by partition Bucket > Partition

1 create table ordinary (id int,name string) row format delimited fields terminated by ',' ;

Create a normal table

load data local inpath 'xxx' into(overwrite) table ordinary;

Load local file into normal table

2create table buck (id int, name string) clustered by (id) sorted by (id) into 4 buckets row format delimited fields terminated by ',' ;

Create a partitioned table

insert into table buck select id,name from ordinary cluster by(id)

The role of bucketing is for the convenience of joining, because cluster distribute must have the same id in the same bucket according to the hash algorithm, which improves efficiency. The premise is that both tables must be bucketed tables.

如join select a.id a.name b.id b.name from a join b on a.id=b.id;

The partition is a sub-directory for the convenience of query

Guess you like