前言
1、分区表支持hash分区和range分区,根据主键列上的分区模式将table划分为 tablets 。每个 tablet 由至少一台 tablet server提供。理想情况下,一张table分成多个tablets分布在不同的tablet servers ,以最大化并行操作。
2、Kudu目前没有在创建表之后拆分或合并 tablets 的机制。
3、创建表时,必须为表提供分区模式。
4、在设计表格时,使用主键,就可以将table分为以相同速率增长的 tablets 。
5、您可以使用 Impala 的 PARTITION BY 关键字对表进行分区,该关键字支持 RANGE 或 HASH分发。分区方案可以包含零个或多个 HASH 定义,后面是可选的 RANGE 定义。 RANGE 定义可以引用一个或多个主键列
一、PARTITION BY RANGE ( 按范围划分 )
优点:允许根据所选分区键的特定值或值的范围拆分表。这样可以平衡并行写入与扫描效率
缺点:如果您在其值单调递增的列上按范围进行分区,则最后一个tablet的增长将远大于其他的,此外,插入的所有数据将一次写入单个 tablet ,限制了数据摄取的可扩展性
例子:
create table apex_report.can_ci_che_dui_detail
(
order_no string,
certification_item string,
order_time string,
bill_date string,
goods_code string,
batch string,
mold string,
ssc_region_id string,
ssc_region_name string,
ssc_trade_id string,
ssc_trade_name string,
sfc_center_node string,
sfc_center_name_node string,
ssc_center_node string,
ssc_center_name_node string,
order_type string,
install_way string,
install_way_name string,
vehicle_lisence string,
vehicle_name string,
car_number string,
source_sn string,
model_source string,
model_code string,
channel_code string,
channel_name string,
product string,
big_product string,
product_desc string,
sp_code string,
center_name string,
bill_type string,
bill_cnt double,
bill_cnt_ck double,
LAST_UPDATE_TIME string,
PRIMARY KEY (order_no,certification_item,order_time,bill_date,goods_code,batch)
)
PARTITION BY HASH (order_no) PARTITIONS 4
STORED AS KUDU
TBLPROPERTIES ('kudu.master_addresses'='10.138.xxx.xx:7051,10.138.xxx.xx:7051,10.138.xxx.xx:7051', 'kudu.table_name'='APEX_REPORT.CAN_CI_CHE_DUI_DETAIL');
二、PARTITION BY HASH ( 哈希分区 )
优点:数据均匀地分布在数据桶之间
缺点:对值的查询可能要读取所有的tablet,也就是自定义的3个
例子:
create table apex_report.can_ci_che_dui_detail
(
order_no string,
certification_item string,
order_time string,
bill_date string,
goods_code string,
batch string,
mold string,
ssc_region_id string,
ssc_region_name string,
ssc_trade_id string,
ssc_trade_name string,
sfc_center_node string,
sfc_center_name_node string,
ssc_center_node string,
ssc_center_name_node string,
order_type string,
install_way string,
install_way_name string,
vehicle_lisence string,
vehicle_name string,
car_number string,
source_sn string,
model_source string,
model_code string,
channel_code string,
channel_name string,
product string,
big_product string,
product_desc string,
sp_code string,
center_name string,
bill_type string,
bill_cnt double,
bill_cnt_ck double,
LAST_UPDATE_TIME string,
PRIMARY KEY (order_no,certification_item,order_time,bill_date,goods_code,batch)
)
PARTITION BY RANGE (bill_date) (
PARTITION VALUES < '2022-01-01',
PARTITION '2022-01-01' <= VALUES < '2022-07-01',
PARTITION '2022-07-01' <= VALUES < '2023-01-01',
PARTITION '2023-01-01' <= VALUES < '2023-07-01',
PARTITION '2023-07-01' <= VALUES < '2024-01-01',
PARTITION '2024-01-01' <= VALUES < '2024-07-01',
PARTITION '2024-07-01' <= VALUES < '2025-01-01',
PARTITION '2025-01-01' <= VALUES < '2025-07-01',
PARTITION '2025-07-01' <= VALUES
)
STORED AS KUDU
TBLPROPERTIES ('kudu.master_addresses'='10.138.xxx.xx:7051,10.138.xxx.xx:7051,10.138.xxx.xx:7051', 'kudu.table_name'='APEX_REPORT.CAN_CI_CHE_DUI_DETAIL');
三、高级分区
PARTITION BY HASH and RANGE
优点:既可以数据分布均匀,又可以在每个分片中保留指定的数据
例子:
create table apex_report.can_ci_che_dui_detail
(
order_no string,
certification_item string,
order_time string,
bill_date string,
goods_code string,
batch string,
mold string,
ssc_region_id string,
ssc_region_name string,
ssc_trade_id string,
ssc_trade_name string,
sfc_center_node string,
sfc_center_name_node string,
ssc_center_node string,
ssc_center_name_node string,
order_type string,
install_way string,
install_way_name string,
vehicle_lisence string,
vehicle_name string,
car_number string,
source_sn string,
model_source string,
model_code string,
channel_code string,
channel_name string,
product string,
big_product string,
product_desc string,
sp_code string,
center_name string,
bill_type string,
bill_cnt double,
bill_cnt_ck double,
LAST_UPDATE_TIME string,
PRIMARY KEY (order_no,certification_item,order_time,bill_date,goods_code,batch)
)
PARTITION BY HASH (order_no) PARTITIONS 4,
RANGE (bill_date) (
PARTITION VALUES < '2022-01-01',
PARTITION '2022-01-01' <= VALUES < '2022-07-01',
PARTITION '2022-07-01' <= VALUES < '2023-01-01',
PARTITION '2023-01-01' <= VALUES < '2023-07-01',
PARTITION '2023-07-01' <= VALUES < '2024-01-01',
PARTITION '2024-01-01' <= VALUES < '2024-07-01',
PARTITION '2024-07-01' <= VALUES < '2025-01-01',
PARTITION '2025-01-01' <= VALUES < '2025-07-01',
PARTITION '2025-07-01' <= VALUES
)
STORED AS KUDU
TBLPROPERTIES ('kudu.master_addresses'='10.138.xxx.xx:7051,10.138.xxx.xx:7051,10.138.xxx.xx:7051', 'kudu.table_name'='APEX_REPORT.CAN_CI_CHE_DUI_DETAIL');
HASH分区有利于最大限度地提高写入吞吐量,而RANGE分区可避免 tablet 无限增长的问题;hash分区和range分区结合,可以极大提升kudu性能。