create table buck(id string,name string)
Clustered by (id) Adding ed means that the table has been partitioned and sorted when the table is created, which is just a format
sorted by (id)
into 4 buckets
row format delimited fields terminated by ',' ;
load data local
But it is still a whole file on hdfs
truncate table xx clears the table data,
The bucket table will not be automatically divided into buckets
#Open bucket
set hive.enforce.bucketing=true; #Set the number of reducers
to be the same as the number of buckets
set mapreduce.job.reduces=4;
hive> insert into table buck
> select id,name from p distribute by (id)
> sort by (id) ;
Distribgute indicates what to use for the hashpartition and what to use for partitioning. The number of reduce is set to 4.
sort by by what order in each partition
Yaka
insert into table buck
select id,name from p cluster by (id);
That is, cluster by is equivalent to distributed by +sort by
But the latter is more flexible, if there are multiple reduce sorts, it is the data sorting in each reduce
The partitioner of hive and the partitioner of mr are not the same thing.
The partition of hive just separates the file upload load according to the specified directory
The bucket clusterd by is divided according to the hashpartition, that is, the partitioner of mr.
If you upload the file directly by load, it will not be partitioned directly
Instead, by querying data from other tables and then placing the queried data into different buckets by partition Bucket > Partition
1 create table ordinary (id int,name string) row format delimited fields terminated by ',' ;
Create a normal table
load data local inpath 'xxx' into(overwrite) table ordinary;
Load local file into normal table
2create table buck (id int, name string) clustered by (id) sorted by (id) into 4 buckets row format delimited fields terminated by ',' ;
Create a partitioned table
insert into table buck select id,name from ordinary cluster by(id)
The role of bucketing is for the convenience of joining, because cluster distribute must have the same id in the same bucket according to the hash algorithm, which improves efficiency. The premise is that both tables must be bucketed tables.
如join select a.id a.name b.id b.name from a join b on a.id=b.id;
The partition is a sub-directory for the convenience of query