kudu的hash和range分区

前言

1、分区表支持hash分区和range分区,根据主键列上的分区模式将table划分为 tablets 。每个 tablet 由至少一台 tablet server提供。理想情况下,一张table分成多个tablets分布在不同的tablet servers ,以最大化并行操作。
2、Kudu目前没有在创建表之后拆分或合并 tablets 的机制。
3、创建表时,必须为表提供分区模式。
4、在设计表格时,使用主键,就可以将table分为以相同速率增长的 tablets 。
5、您可以使用 Impala 的 PARTITION BY 关键字对表进行分区,该关键字支持 RANGE 或 HASH分发。分区方案可以包含零个或多个 HASH 定义,后面是可选的 RANGE 定义。 RANGE 定义可以引用一个或多个主键列

一、PARTITION BY RANGE ( 按范围划分 )


优点:允许根据所选分区键的特定值或值的范围拆分表。这样可以平衡并行写入与扫描效率
缺点:如果您在其值单调递增的列上按范围进行分区,则最后一个tablet的增长将远大于其他的,此外,插入的所有数据将一次写入单个 tablet ,限制了数据摄取的可扩展性
例子:

create table apex_report.can_ci_che_dui_detail 
(
  order_no string,            
  certification_item string,  
  order_time string,          
  bill_date string,           
  goods_code string,          
  batch string,               
  mold string,                
  ssc_region_id string,       
  ssc_region_name string,     
  ssc_trade_id string,        
  ssc_trade_name string,      
  sfc_center_node string,     
  sfc_center_name_node string,
  ssc_center_node string,     
  ssc_center_name_node string,
  order_type string,          
  install_way string,         
  install_way_name string,    
  vehicle_lisence string,     
  vehicle_name string,        
  car_number string,          
  source_sn string,           
  model_source string,        
  model_code string,          
  channel_code string,        
  channel_name string,        
  product string,             
  big_product string,         
  product_desc string,        
  sp_code string,             
  center_name string,         
  bill_type string,           
  bill_cnt double,            
  bill_cnt_ck double,         
  LAST_UPDATE_TIME string,    
  PRIMARY KEY (order_no,certification_item,order_time,bill_date,goods_code,batch)
)
PARTITION BY HASH (order_no) PARTITIONS 4
STORED AS KUDU
TBLPROPERTIES ('kudu.master_addresses'='10.138.xxx.xx:7051,10.138.xxx.xx:7051,10.138.xxx.xx:7051', 'kudu.table_name'='APEX_REPORT.CAN_CI_CHE_DUI_DETAIL');

二、PARTITION BY HASH ( 哈希分区 )

优点:数据均匀地分布在数据桶之间
缺点:对值的查询可能要读取所有的tablet,也就是自定义的3个
例子:

create table apex_report.can_ci_che_dui_detail
(
  order_no string,                 
  certification_item string,       
  order_time string,               
  bill_date string,                
  goods_code string,               
  batch string,                    
  mold string,                     
  ssc_region_id string,            
  ssc_region_name string,          
  ssc_trade_id string,             
  ssc_trade_name string,           
  sfc_center_node string,          
  sfc_center_name_node string,     
  ssc_center_node string,          
  ssc_center_name_node string,     
  order_type string,               
  install_way string,              
  install_way_name string,         
  vehicle_lisence string,          
  vehicle_name string,             
  car_number string,               
  source_sn string,                
  model_source string,             
  model_code string,               
  channel_code string,             
  channel_name string,             
  product string,                  
  big_product string,              
  product_desc string,             
  sp_code string,                  
  center_name string,              
  bill_type string,                
  bill_cnt double,                 
  bill_cnt_ck double,              
  LAST_UPDATE_TIME string,         
  PRIMARY KEY (order_no,certification_item,order_time,bill_date,goods_code,batch)
)
PARTITION BY RANGE (bill_date) (
PARTITION VALUES < '2022-01-01',
PARTITION '2022-01-01' <= VALUES < '2022-07-01', 
PARTITION '2022-07-01' <= VALUES < '2023-01-01', 
PARTITION '2023-01-01' <= VALUES < '2023-07-01', 
PARTITION '2023-07-01' <= VALUES < '2024-01-01', 
PARTITION '2024-01-01' <= VALUES < '2024-07-01', 
PARTITION '2024-07-01' <= VALUES < '2025-01-01', 
PARTITION '2025-01-01' <= VALUES < '2025-07-01', 
PARTITION '2025-07-01' <= VALUES 
)
STORED AS KUDU
TBLPROPERTIES ('kudu.master_addresses'='10.138.xxx.xx:7051,10.138.xxx.xx:7051,10.138.xxx.xx:7051', 'kudu.table_name'='APEX_REPORT.CAN_CI_CHE_DUI_DETAIL');

三、高级分区

PARTITION BY HASH and RANGE
优点:既可以数据分布均匀,又可以在每个分片中保留指定的数据
例子:

create table apex_report.can_ci_che_dui_detail
(
  order_no string,                 
  certification_item string,       
  order_time string,               
  bill_date string,                
  goods_code string,               
  batch string,                    
  mold string,                     
  ssc_region_id string,            
  ssc_region_name string,          
  ssc_trade_id string,             
  ssc_trade_name string,           
  sfc_center_node string,          
  sfc_center_name_node string,     
  ssc_center_node string,          
  ssc_center_name_node string,     
  order_type string,               
  install_way string,              
  install_way_name string,         
  vehicle_lisence string,          
  vehicle_name string,             
  car_number string,               
  source_sn string,                
  model_source string,             
  model_code string,               
  channel_code string,             
  channel_name string,             
  product string,                  
  big_product string,              
  product_desc string,             
  sp_code string,                  
  center_name string,              
  bill_type string,                
  bill_cnt double,                 
  bill_cnt_ck double,              
  LAST_UPDATE_TIME string,         
  PRIMARY KEY (order_no,certification_item,order_time,bill_date,goods_code,batch)
)
PARTITION BY HASH (order_no) PARTITIONS 4,
RANGE (bill_date) (
PARTITION VALUES < '2022-01-01',
PARTITION '2022-01-01' <= VALUES < '2022-07-01', 
PARTITION '2022-07-01' <= VALUES < '2023-01-01', 
PARTITION '2023-01-01' <= VALUES < '2023-07-01', 
PARTITION '2023-07-01' <= VALUES < '2024-01-01', 
PARTITION '2024-01-01' <= VALUES < '2024-07-01', 
PARTITION '2024-07-01' <= VALUES < '2025-01-01', 
PARTITION '2025-01-01' <= VALUES < '2025-07-01', 
PARTITION '2025-07-01' <= VALUES 
)
STORED AS KUDU
TBLPROPERTIES ('kudu.master_addresses'='10.138.xxx.xx:7051,10.138.xxx.xx:7051,10.138.xxx.xx:7051', 'kudu.table_name'='APEX_REPORT.CAN_CI_CHE_DUI_DETAIL');

HASH分区有利于最大限度地提高写入吞吐量,而RANGE分区可避免 tablet 无限增长的问题;hash分区和range分区结合,可以极大提升kudu性能。

猜你喜欢

转载自blog.csdn.net/Allenzyg/article/details/121955051
今日推荐