Hive日分区表如何快速导入到StarRocks

1、背景

业务现状:集团使用FineBI做数据呈现及报表分析工具,经过近两年的BI建设,供应链域及营销域的BI建设已初具规模并体系化。数仓规模60TB,FineBI数据集约8000个,BI挂出报表数约1600个,报表月增幅在40左右。

技术现状:数据加工链路:业务系统数据库->Hive数仓->PG导出库->FineBI抽取数据集->BI报表。该数据链路下,遇到了一些问题。本文不展开全部问题,仅讨论PG导出库数据规模过大,导致FineBI每日数据更新任务卡住问题。

问题描述:集团供应链的物料的帐龄表、库龄表按日分区,日增量100W,且存储了所有历史数据,该账龄、库龄数据在一年内单表接近160GB,FineBI每日早晨抽取数据的更新任务卡住(FineBI只配置了4个后台更新线程),导致大部分数据更新延迟。即使开放更多的后台数据更新线程(按CPU核数推荐经验值,过多会影响BI的报表查看性能),该数据的抽取对PG导出库也有压力。

解决思路:1)从报表的逻辑优化减少数据量;2)切换查询引擎,使用StarRocks替换PG,FineBI直连StarRocks,减少BI数据抽取。本文仅讨论切换查询引擎的方案。

2、最初实验

思路:

使用Broker Load导入Hive中的按日动态分区表数据导出到StarRocks中。StarRocks中的表也按日动态分区。

相关文档:

  • 参考Broker Load官方文档:Broker load研究导入配置。
  • 参考动态分区官方文档:动态分区研究StarRocks按日动态分区表建表语句。

问题:

使用Brocker Load方式遇到以下问题:

  • hdfs parquet文件有几年的分区数据,有的是月分区,有的是天分区
  • 一些分区的字段不同,有的有A BC字段,有的有ABD字段有的ABCD字段
  • StarRocks导入时,不允许StarRocks的字段比原表字段多

官方给出了以下脚本,创建一张支持动态分区的表,表名为 site_access,动态分区通过 PROPERTIES 进行配置。分区的区间为当前时间的前后 3 天,总共 6 天。

CREATE TABLE site_access(
event_day DATE,
site_id INT DEFAULT '10',
city_code VARCHAR(100),
user_name VARCHAR(32) DEFAULT '',
pv BIGINT DEFAULT '0'
)
DUPLICATE KEY(event_day, site_id, city_code, user_name)
PARTITION BY RANGE(event_day)(
PARTITION p20200321 VALUES LESS THAN ("2020-03-22"),
PARTITION p20200322 VALUES LESS THAN ("2020-03-23"),
PARTITION p20200323 VALUES LESS THAN ("2020-03-24"),
PARTITION p20200324 VALUES LESS THAN ("2020-03-25")
)
DISTRIBUTED BY HASH(event_day, site_id) BUCKETS 32
PROPERTIES(
    "dynamic_partition.enable" = "true",
    "dynamic_partition.time_unit" = "DAY",
    "dynamic_partition.start" = "-3",
    "dynamic_partition.end" = "3",
    "dynamic_partition.prefix" = "p",
    "dynamic_partition.buckets" = "32"
);

如果是Hive使用经验的用户,第一次使用StarRocks,参考上面的SQL逻辑设计自己的表逻辑时,可能有疑问:PARTITION BY RANGE(event_day)(这里的内容改如何填写呢)?比如,以下需要迁移的Hive表,

 -- drop table if exists bda${db_para}.bda_inv_item_age_dtl;
create table if not exists bda${db_para}.bda_inv_item_age_dtl (
    stat_date             string comment '统计日期',
    entr_date             string comment '入库日期',
    item_id               string comment '物料id',
    itm_cd                string comment '物料编码',
    org_id                string comment '库存组织id',
    org_cd                string comment '库存组织编码',
    sub_invtr_cd          string comment '子库编码',
    bch_nbr               string comment '批次号',
    entr_qty              string comment '入库数量',
    total_qty             string comment '倒算总入库数',
    left_qty              string comment '剩余库存数',
    alloc_qty             string comment '分摊库存数',
    stock_age             string comment '库龄',
    item_cost             string comment 'pac成本',
    actual_cost           string comment '实际成本',
    item_business         string comment '最近三个月使用事业部',
    pch_big_ctg           string comment '采购品类大类',
    pch_med_ctg           string comment '采购品类中类',
    pch_sml_ctg           string comment '采购品类小类',
    ship_customer_name    string comment '订单收货客户',
    team_bu_name          string comment '战队',
    real_customer_name    string comment '真实客户',
    item_category         string comment '物料类别',
    prod_model_id         string comment '产品型号id',
    prod_model            string comment '产品型号',
    job_belong_bu         string comment '工单所属事业部',
    item_bu               string comment '近三个使用事业部',
    pac_cost              string comment 'pac成本单价',
    item_category_bg      string comment '物料大类')
comment '库龄'
partitioned by(part_dt string)
row format delimited fields terminated by '\036'
stored as parquet;

-- add by tjl 2022.03.10
-- alter table bda${db_para}.bda_inv_item_age_dtl add columns(dept_name string comment '工单所属部门') cascade;

-- 2022.04.18:xxx:新增:工单制单时间、制单人
alter table bda${db_para}.bda_inv_item_age_dtl add columns(wdj_creation_date string comment '工单制单时间') cascade;
alter table bda${db_para}.bda_inv_item_age_dtl add columns(created_by string comment '工单制单人') cascade;

请注意:这里面有表字段表更。Parquet格式的分区表使用Alter更新字段或新增字段,Parquet文件会如何变化?旧分区的历史数据要如何处理?

其hdfs文件如下:

drwxrwx--x+  - hive hive          0 2018-08-30 03:51 warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2018-08-29
#此处省略一万行
drwxrwx--x+  - hive hive          0 2022-11-26 03:55 warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2022-11-25
drwxrwx--x+  - hive hive          0 2022-11-27 03:52 warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2022-11-26

讨论:

  • StarRocks有多种数据表模型,该场景应该选择什么数据表模型呢?
  • 应该如何建StarRocks的表呢?
  • 应该如何建Brocker load任务呢?

3、问题分析

  • Parquet格式的Hive表,执行add columns的工作流程及底层执行机制如何?

1、Hive元数据表中,在表末尾,追加一列。

2、对于Hive分区表,需要cascade修饰词,管理旧分区的元数据信息。

3、如果是使用Parquet、ORC等存储格式,文件中存储的内容没有重写,则不会变动。

4、Hive如何识别就分区表中没有新增列的?

参考:[Hive] Alter Table/Partition/Column - Huawei Enterprise Support Community

简单而言,如果不使用cascade修饰词,只是修改了Hive中bda_inv_item_age_dtl表元信息(表结构)。写入新分区的数据,hive能自动识别并正常显示。对于旧分区,如果包含了新列,即使旧分区使用了INSERT OVERWRITE重写数据,如果没有使用Drop 并 recreate分区,hive查询有不能正常工作。

参考:partitioning - how to add columns to existing hive partitioned table? - Stack Overflow

  • Parquet文件中的schema变化情况

查看最早分区2016-06-30的数据结构:

[cloud@dp-master001 ~]$ sudo hdfs dfs -get warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2016-06-30/000000_0
[cloud@dp-master001 ~]$ /opt/cloudera/parcels/CDH/lib/parquet/bin/parquet-tools schema -d 000000_0 >> hive20160630.sql
[cloud@dp-master001 ~]$ vcat hive20160630.sql 
-bash: vcat: command not found
[cloud@dp-master001 ~]$ cat hive20160630.sql 
message hive_schema {
  optional binary stat_date (UTF8);
  optional binary entr_date (UTF8);
  optional binary item_id (UTF8);
  optional binary itm_cd (UTF8);
  optional binary org_id (UTF8);
  optional binary org_cd (UTF8);
  optional binary sub_invtr_cd (UTF8);
  optional binary bch_nbr (UTF8);
  optional binary entr_qty (UTF8);
  optional binary total_qty (UTF8);
  optional binary left_qty (UTF8);
  optional binary alloc_qty (UTF8);
  optional binary stock_age (UTF8);
  optional binary item_cost (UTF8);
  optional binary actual_cost (UTF8);
  optional binary item_business (UTF8);
  optional binary pch_big_ctg (UTF8);
  optional binary pch_med_ctg (UTF8);
  optional binary pch_sml_ctg (UTF8);
  optional binary ship_customer_name (UTF8);
  optional binary team_bu_name (UTF8);
  optional binary real_customer_name (UTF8);
  optional binary item_category (UTF8);
  optional binary prod_model_id (UTF8);
  optional binary prod_model (UTF8);
  optional binary job_belong_bu (UTF8);
  optional binary item_bu (UTF8);
  optional binary pac_cost (UTF8);
  optional binary item_category_bg (UTF8);
}
creator: parquet-mr version 1.5.0-cdh5.15.2 (build ${buildNumber})

file schema: hive_schema
----------------------------------------------------------------------------------------------------
stat_date: OPTIONAL BINARY O:UTF8 R:0 D:1
entr_date: OPTIONAL BINARY O:UTF8 R:0 D:1
item_id: OPTIONAL BINARY O:UTF8 R:0 D:1
itm_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
org_id: OPTIONAL BINARY O:UTF8 R:0 D:1
org_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
sub_invtr_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
bch_nbr: OPTIONAL BINARY O:UTF8 R:0 D:1
entr_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
total_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
left_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
alloc_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
stock_age: OPTIONAL BINARY O:UTF8 R:0 D:1
item_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
actual_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
item_business: OPTIONAL BINARY O:UTF8 R:0 D:1
pch_big_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
pch_med_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
pch_sml_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
ship_customer_name: OPTIONAL BINARY O:UTF8 R:0 D:1
team_bu_name: OPTIONAL BINARY O:UTF8 R:0 D:1
real_customer_name: OPTIONAL BINARY O:UTF8 R:0 D:1
item_category: OPTIONAL BINARY O:UTF8 R:0 D:1
prod_model_id: OPTIONAL BINARY O:UTF8 R:0 D:1
prod_model: OPTIONAL BINARY O:UTF8 R:0 D:1
job_belong_bu: OPTIONAL BINARY O:UTF8 R:0 D:1
item_bu: OPTIONAL BINARY O:UTF8 R:0 D:1
pac_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
item_category_bg: OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:190899 TS:8780200
----------------------------------------------------------------------------------------------------
stat_date:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:204/204/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
entr_date:  BINARY UNCOMPRESSED DO:0 FPO:208 SZ:249018/249018/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
item_id:  BINARY UNCOMPRESSED DO:0 FPO:249226 SZ:245992/245992/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
itm_cd:  BINARY UNCOMPRESSED DO:0 FPO:495218 SZ:409890/409890/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
org_id:  BINARY UNCOMPRESSED DO:0 FPO:905108 SZ:3641/3641/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
org_cd:  BINARY UNCOMPRESSED DO:0 FPO:908749 SZ:3637/3637/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
sub_invtr_cd:  BINARY UNCOMPRESSED DO:0 FPO:912386 SZ:187698/187698/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
bch_nbr:  BINARY UNCOMPRESSED DO:0 FPO:1100084 SZ:286227/286227/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
entr_qty:  BINARY UNCOMPRESSED DO:0 FPO:1386311 SZ:495330/495330/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
total_qty:  BINARY UNCOMPRESSED DO:0 FPO:1881641 SZ:59/59/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
left_qty:  BINARY UNCOMPRESSED DO:0 FPO:1881700 SZ:486445/486445/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
alloc_qty:  BINARY UNCOMPRESSED DO:0 FPO:2368145 SZ:536856/536856/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
stock_age:  BINARY UNCOMPRESSED DO:0 FPO:2905001 SZ:243903/243903/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
item_cost:  BINARY UNCOMPRESSED DO:0 FPO:3148904 SZ:2322210/2322210/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,PLAIN,RLE
actual_cost:  BINARY UNCOMPRESSED DO:0 FPO:5471114 SZ:2922009/2922009/1.00 VC:190899 ENC:BIT_PACKED,PLAIN,RLE
item_business:  BINARY UNCOMPRESSED DO:0 FPO:8393123 SZ:1817/1817/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
pch_big_ctg:  BINARY UNCOMPRESSED DO:0 FPO:8394940 SZ:4757/4757/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
pch_med_ctg:  BINARY UNCOMPRESSED DO:0 FPO:8399697 SZ:13669/13669/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
pch_sml_ctg:  BINARY UNCOMPRESSED DO:0 FPO:8413366 SZ:27595/27595/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
ship_customer_name:  BINARY UNCOMPRESSED DO:0 FPO:8440961 SZ:26641/26641/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
team_bu_name:  BINARY UNCOMPRESSED DO:0 FPO:8467602 SZ:12107/12107/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
real_customer_name:  BINARY UNCOMPRESSED DO:0 FPO:8479709 SZ:33/33/1.00 VC:190899 ENC:BIT_PACKED,PLAIN,RLE
item_category:  BINARY UNCOMPRESSED DO:0 FPO:8479742 SZ:399/399/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
prod_model_id:  BINARY UNCOMPRESSED DO:0 FPO:8480141 SZ:36922/36922/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
prod_model:  BINARY UNCOMPRESSED DO:0 FPO:8517063 SZ:33472/33472/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
job_belong_bu:  BINARY UNCOMPRESSED DO:0 FPO:8550535 SZ:3769/3769/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
item_bu:  BINARY UNCOMPRESSED DO:0 FPO:8554304 SZ:33/33/1.00 VC:190899 ENC:BIT_PACKED,PLAIN,RLE
pac_cost:  BINARY UNCOMPRESSED DO:0 FPO:8554337 SZ:225484/225484/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
item_category_bg:  BINARY UNCOMPRESSED DO:0 FPO:8779821 SZ:383/383/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE

查看最新分区2022-11-26,Parquet文件的数据结构

[cloud@dp-master001 ~]$ sudo hdfs dfs -get warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2022-11-24/000000_0
[cloud@dp-master001 ~]$ /opt/cloudera/parcels/CDH/lib/parquet/bin/parquet-tools schema -d 000000_0 >> hive.sql
[cloud@dp-master001 ~]$ cat hive.sql 
message hive_schema {
  optional binary stat_date (UTF8);
  optional binary entr_date (UTF8);
  optional binary item_id (UTF8);
  optional binary itm_cd (UTF8);
  optional binary org_id (UTF8);
  optional binary org_cd (UTF8);
  optional binary sub_invtr_cd (UTF8);
  optional binary bch_nbr (UTF8);
  optional binary entr_qty (UTF8);
  optional binary total_qty (UTF8);
  optional binary left_qty (UTF8);
  optional binary alloc_qty (UTF8);
  optional binary stock_age (UTF8);
  optional binary item_cost (UTF8);
  optional binary actual_cost (UTF8);
  optional binary item_business (UTF8);
  optional binary pch_big_ctg (UTF8);
  optional binary pch_med_ctg (UTF8);
  optional binary pch_sml_ctg (UTF8);
  optional binary ship_customer_name (UTF8);
  optional binary team_bu_name (UTF8);
  optional binary real_customer_name (UTF8);
  optional binary item_category (UTF8);
  optional binary prod_model_id (UTF8);
  optional binary prod_model (UTF8);
  optional binary job_belong_bu (UTF8);
  optional binary item_bu (UTF8);
  optional binary pac_cost (UTF8);
  optional binary item_category_bg (UTF8);
  optional binary dept_name (UTF8);
  optional binary wdj_creation_date (UTF8);
  optional binary created_by (UTF8);
}

creator: parquet-mr version 1.5.0-cdh5.15.2 (build ${buildNumber})

file schema: hive_schema
----------------------------------------------------------------------------------------------------
stat_date: OPTIONAL BINARY O:UTF8 R:0 D:1
entr_date: OPTIONAL BINARY O:UTF8 R:0 D:1
item_id: OPTIONAL BINARY O:UTF8 R:0 D:1
itm_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
org_id: OPTIONAL BINARY O:UTF8 R:0 D:1
org_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
sub_invtr_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
bch_nbr: OPTIONAL BINARY O:UTF8 R:0 D:1
entr_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
total_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
left_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
alloc_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
stock_age: OPTIONAL BINARY O:UTF8 R:0 D:1
item_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
actual_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
item_business: OPTIONAL BINARY O:UTF8 R:0 D:1
pch_big_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
pch_med_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
pch_sml_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
ship_customer_name: OPTIONAL BINARY O:UTF8 R:0 D:1
team_bu_name: OPTIONAL BINARY O:UTF8 R:0 D:1
real_customer_name: OPTIONAL BINARY O:UTF8 R:0 D:1
item_category: OPTIONAL BINARY O:UTF8 R:0 D:1
prod_model_id: OPTIONAL BINARY O:UTF8 R:0 D:1
prod_model: OPTIONAL BINARY O:UTF8 R:0 D:1
job_belong_bu: OPTIONAL BINARY O:UTF8 R:0 D:1
item_bu: OPTIONAL BINARY O:UTF8 R:0 D:1
pac_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
item_category_bg: OPTIONAL BINARY O:UTF8 R:0 D:1
dept_name: OPTIONAL BINARY O:UTF8 R:0 D:1
wdj_creation_date: OPTIONAL BINARY O:UTF8 R:0 D:1
created_by: OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:428480 TS:21535478
----------------------------------------------------------------------------------------------------
stat_date:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:381/381/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
entr_date:  BINARY UNCOMPRESSED DO:0 FPO:385 SZ:608962/608962/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
item_id:  BINARY UNCOMPRESSED DO:0 FPO:609347 SZ:774164/774164/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
itm_cd:  BINARY UNCOMPRESSED DO:0 FPO:1383511 SZ:1185783/1185783/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
org_id:  BINARY UNCOMPRESSED DO:0 FPO:2569294 SZ:49829/49829/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
org_cd:  BINARY UNCOMPRESSED DO:0 FPO:2619123 SZ:49810/49810/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
sub_invtr_cd:  BINARY UNCOMPRESSED DO:0 FPO:2668933 SZ:539203/539203/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
bch_nbr:  BINARY UNCOMPRESSED DO:0 FPO:3208136 SZ:582198/582198/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
entr_qty:  BINARY UNCOMPRESSED DO:0 FPO:3790334 SZ:1028459/1028459/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
total_qty:  BINARY UNCOMPRESSED DO:0 FPO:4818793 SZ:141/141/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
left_qty:  BINARY UNCOMPRESSED DO:0 FPO:4818934 SZ:1074391/1074391/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
alloc_qty:  BINARY UNCOMPRESSED DO:0 FPO:5893325 SZ:1063130/1063130/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
stock_age:  BINARY UNCOMPRESSED DO:0 FPO:6956455 SZ:599509/599509/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
item_cost:  BINARY UNCOMPRESSED DO:0 FPO:7555964 SZ:5625018/5625018/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,PLAIN,RLE
actual_cost:  BINARY UNCOMPRESSED DO:0 FPO:13180982 SZ:6410120/6410120/1.00 VC:428480 ENC:BIT_PACKED,PLAIN,RLE
item_business:  BINARY UNCOMPRESSED DO:0 FPO:19591102 SZ:114122/114122/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
pch_big_ctg:  BINARY UNCOMPRESSED DO:0 FPO:19705224 SZ:33/33/1.00 VC:428480 ENC:BIT_PACKED,PLAIN,RLE
pch_med_ctg:  BINARY UNCOMPRESSED DO:0 FPO:19705257 SZ:33/33/1.00 VC:428480 ENC:BIT_PACKED,PLAIN,RLE
pch_sml_ctg:  BINARY UNCOMPRESSED DO:0 FPO:19705290 SZ:33/33/1.00 VC:428480 ENC:BIT_PACKED,PLAIN,RLE
ship_customer_name:  BINARY UNCOMPRESSED DO:0 FPO:19705323 SZ:37444/37444/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
team_bu_name:  BINARY UNCOMPRESSED DO:0 FPO:19742767 SZ:17370/17370/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
real_customer_name:  BINARY UNCOMPRESSED DO:0 FPO:19760137 SZ:18033/18033/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
item_category:  BINARY UNCOMPRESSED DO:0 FPO:19778170 SZ:917/917/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
prod_model_id:  BINARY UNCOMPRESSED DO:0 FPO:19779087 SZ:148421/148421/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
prod_model:  BINARY UNCOMPRESSED DO:0 FPO:19927508 SZ:113394/113394/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
job_belong_bu:  BINARY UNCOMPRESSED DO:0 FPO:20040902 SZ:16902/16902/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
item_bu:  BINARY UNCOMPRESSED DO:0 FPO:20057804 SZ:91023/91023/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
pac_cost:  BINARY UNCOMPRESSED DO:0 FPO:20148827 SZ:791340/791340/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
item_category_bg:  BINARY UNCOMPRESSED DO:0 FPO:20940167 SZ:673/673/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
dept_name:  BINARY UNCOMPRESSED DO:0 FPO:20940840 SZ:7675/7675/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
wdj_creation_date:  BINARY UNCOMPRESSED DO:0 FPO:20948515 SZ:574110/574110/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
created_by:  BINARY UNCOMPRESSED DO:0 FPO:21522625 SZ:12857/12857/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE

4、方案验证

4.1、StarRocks建表

  • 选择什么样的StarRocks表模型?

根据BI报表的分析逻辑,主要用于监控日/月维度库龄、帐龄的趋势(影响物料/产成品跌价计提),主要是明细表做分析。故选择StarRocks明细表模型。期望按照Hive的按日分区建表。考虑到Brocker load只支持忽略源表中的字段,不支持目标表多的字段忽略/或者填充默认值的机制。根据探源发现,最早分区2016-06-30与最新分区2022-11-26字段不同(DDL中显示,2022.03.10,2022.04.18增加了三个字段。)可以考虑,03.10前,建立一个StarRocks表,之后建立一个StarRocks表。数据同步到StarRocks后,做数据汇聚。为了简单0310前的存储到一个分区表。

  • 如何建StarRocks的表

0310前一个分区表:

drop table  if  exists bda.bda_inv_item_age_dtl0310;
create table if not exists bda.bda_inv_item_age_dtl0310(
    stat_date date,
	entr_date date,
	item_id varchar(10),
	itm_cd varchar(500),
    org_id varchar(5),
    org_cd varchar(500),
    sub_invtr_cd varchar(20),
    bch_nbr varchar(50),
	entr_qty decimal,
	total_qty decimal,
	left_qty decimal,
	alloc_qty decimal,
	stock_age decimal,
	item_cost decimal,
	actual_cost decimal,
	item_business varchar(500),
	pch_big_ctg varchar(500),
	pch_med_ctg varchar(500),
	pch_sml_ctg varchar(500),
	ship_customer_name varchar(500),
	team_bu_name varchar(500),
	real_customer_name varchar(500),
	item_category varchar(500),
	prod_model_id varchar(500),
	prod_model varchar(500),
	job_belong_bu varchar(500),
	item_bu varchar(500),
	pac_cost decimal,
	item_category_bg varchar(500)
)
DUPLICATE KEY(stat_date,entr_date,item_id)
DISTRIBUTED BY HASH(entr_date, item_id) BUCKETS 32;

0310后动态分区

create table if not exists bda.bda_inv_item_age_dtl_part(
	stat_date date,
    entr_date date,
    item_id varchar(10),
    itm_cd varchar(500),
    org_id varchar(5),
	org_cd varchar(500),
    sub_invtr_cd varchar(20),
    bch_nbr varchar(50),
	entr_qty varchar(500),
	total_qty varchar(500),
	left_qty varchar(500),
	alloc_qty varchar(500),
	stock_age varchar(500),
	item_cost varchar(500),
    actual_cost varchar(500),
	item_business varchar(500),
	pch_big_ctg varchar(500),
	pch_med_ctg varchar(500),
	pch_sml_ctg varchar(500),
	ship_customer_name varchar(500),
	team_bu_name varchar(500),
	real_customer_name varchar(500),
	item_category varchar(500),
	prod_model_id varchar(500),
	prod_model varchar(500),
	job_belong_bu varchar(500),
	item_bu varchar(500),
	pac_cost varchar(500),
	item_category_bg varchar(500),
	dept_name varchar(300),
	wdj_creation_date varchar(200),
	created_by varchar(200)
)
DUPLICATE KEY(stat_date,entr_date,item_id)
PARTITION BY RANGE(stat_date)()
DISTRIBUTED BY HASH(entr_date, item_id) BUCKETS 32
PROPERTIES(
    "dynamic_partition.enable" = "true",
    "dynamic_partition.time_unit" = "DAY",
    "dynamic_partition.end" = "1",
    "dynamic_partition.prefix" = "p",
    "dynamic_partition.buckets" = "32"
);

4.1、Brocker 同步任务

  • 如何建Brocker load任务

参考文档:FileSystem (Apache Hadoop Main 3.3.4 API)

0310之前分区的抽取任务

LOAD LABEL bda.bda_inv_item_age_dtl0310_before
(
DATA INFILE("hdfs://10.21.25.161:8020/user/hive/warehouse/bda.db/bda_inv_item_age_dtl/part_dt=[2016-2021]-*/*")
INTO TABLE `bda_inv_item_age_dtl_part`
COLUMNS TERMINATED BY "\036"
FORMAT AS "parquet"
where entr_date is not null 
)
WITH BROKER broker198
(
"hadoop.security.authentication"="kerberos",
"kerberos_principal"="hdfs",
"kerberos_keytab"="/opt/StarRocks/kerberos/hdfs.keytab"
);

执行发现,依旧报错,字段缺少,通过二分法,逐一排查hdfs分区文件,发现2022-03-10之前的分区,有的是按月分区,有的是按日分区,一些分区的字段各不相同,例如2018年分区情况、2022-02分区情况:

[cloud@dp-master001 ~]$ /opt/cloudera/parcels/CDH/lib/parquet/bin/parquet-tools schema -d 000000_0_2018_12_02 >> hive20181202.sql
[cloud@dp-master001 ~]$ cat hive20181202.sql 
message hive_schema {
  optional binary stat_date (UTF8);
  optional binary entr_date (UTF8);
  optional binary item_id (UTF8);
  optional binary itm_cd (UTF8);
  optional binary org_id (UTF8);
  optional binary org_cd (UTF8);
  optional binary sub_invtr_cd (UTF8);
  optional binary bch_nbr (UTF8);
  optional binary entr_qty (UTF8);
  optional binary total_qty (UTF8);
  optional binary left_qty (UTF8);
  optional binary alloc_qty (UTF8);
  optional binary stock_age (UTF8);
  optional binary item_cost (UTF8);
  optional binary actual_cost (UTF8);
  optional binary item_business (UTF8);
}

creator: parquet-mr version 1.5.0-cdh5.15.2 (build ${buildNumber})

file schema: hive_schema
----------------------------------------------------------------------------------------------------
stat_date: OPTIONAL BINARY O:UTF8 R:0 D:1
entr_date: OPTIONAL BINARY O:UTF8 R:0 D:1
item_id: OPTIONAL BINARY O:UTF8 R:0 D:1
itm_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
org_id: OPTIONAL BINARY O:UTF8 R:0 D:1
org_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
sub_invtr_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
bch_nbr: OPTIONAL BINARY O:UTF8 R:0 D:1
entr_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
total_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
left_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
alloc_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
stock_age: OPTIONAL BINARY O:UTF8 R:0 D:1
item_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
actual_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
item_business: OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:244224 TS:12749582
----------------------------------------------------------------------------------------------------
stat_date:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:263/263/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
entr_date:  BINARY UNCOMPRESSED DO:0 FPO:267 SZ:354158/354158/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
item_id:  BINARY UNCOMPRESSED DO:0 FPO:354425 SZ:746753/746753/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
itm_cd:  BINARY UNCOMPRESSED DO:0 FPO:1101178 SZ:1059151/1059151/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
org_id:  BINARY UNCOMPRESSED DO:0 FPO:2160329 SZ:52235/52235/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
org_cd:  BINARY UNCOMPRESSED DO:0 FPO:2212564 SZ:52222/52222/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
sub_invtr_cd:  BINARY UNCOMPRESSED DO:0 FPO:2264786 SZ:279733/279733/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
bch_nbr:  BINARY UNCOMPRESSED DO:0 FPO:2544519 SZ:633935/633935/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
entr_qty:  BINARY UNCOMPRESSED DO:0 FPO:3178454 SZ:690868/690868/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
total_qty:  BINARY UNCOMPRESSED DO:0 FPO:3869322 SZ:100/100/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
left_qty:  BINARY UNCOMPRESSED DO:0 FPO:3869422 SZ:686921/686921/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
alloc_qty:  BINARY UNCOMPRESSED DO:0 FPO:4556343 SZ:747099/747099/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
stock_age:  BINARY UNCOMPRESSED DO:0 FPO:5303442 SZ:345381/345381/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
item_cost:  BINARY UNCOMPRESSED DO:0 FPO:5648823 SZ:3316772/3316772/1.00 VC:244224 ENC:RLE,PLAIN,BIT_PACKED
actual_cost:  BINARY UNCOMPRESSED DO:0 FPO:8965595 SZ:3783958/3783958/1.00 VC:244224 ENC:RLE,PLAIN,BIT_PACKED
item_business:  BINARY UNCOMPRESSED DO:0 FPO:12749553 SZ:33/33/1.00 VC:244224 ENC:RLE,PLAIN,BIT_PACKED
[cloud@dp-master001 ~]$ sudo hdfs dfs -get warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2018-12-15/000000_0
get: `warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2018-12-15/000000_0': No such file or directory
[cloud@dp-master001 ~]$ sudo hdfs dfs -get warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2020-02-26/000000_0
[cloud@dp-master001 ~]$ mv 000000_0 000000_0_2020_02_26
[cloud@dp-master001 ~]$ /opt/cloudera/parcels/CDH/lib/parquet/bin/parquet-tools schema -d 000000_0_2020_02_26 >> hive20200226.sql
[cloud@dp-master001 ~]$ cat hive20200226.sql 
message hive_schema {
  optional binary stat_date (UTF8);
  optional binary entr_date (UTF8);
  optional binary item_id (UTF8);
  optional binary itm_cd (UTF8);
  optional binary org_id (UTF8);
  optional binary org_cd (UTF8);
  optional binary sub_invtr_cd (UTF8);
  optional binary bch_nbr (UTF8);
  optional binary entr_qty (UTF8);
  optional binary total_qty (UTF8);
  optional binary left_qty (UTF8);
  optional binary alloc_qty (UTF8);
  optional binary stock_age (UTF8);
  optional binary item_cost (UTF8);
  optional binary actual_cost (UTF8);
  optional binary item_business (UTF8);
  optional binary pch_big_ctg (UTF8);
  optional binary pch_med_ctg (UTF8);
  optional binary pch_sml_ctg (UTF8);
}

creator: parquet-mr version 1.5.0-cdh5.15.2 (build ${buildNumber})

file schema: hive_schema
----------------------------------------------------------------------------------------------------
stat_date: OPTIONAL BINARY O:UTF8 R:0 D:1
entr_date: OPTIONAL BINARY O:UTF8 R:0 D:1
item_id: OPTIONAL BINARY O:UTF8 R:0 D:1
itm_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
org_id: OPTIONAL BINARY O:UTF8 R:0 D:1
org_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
sub_invtr_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
bch_nbr: OPTIONAL BINARY O:UTF8 R:0 D:1
entr_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
total_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
left_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
alloc_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
stock_age: OPTIONAL BINARY O:UTF8 R:0 D:1
item_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
actual_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
item_business: OPTIONAL BINARY O:UTF8 R:0 D:1
pch_big_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
pch_med_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
pch_sml_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:242410 TS:9017086
----------------------------------------------------------------------------------------------------
stat_date:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:263/263/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
entr_date:  BINARY UNCOMPRESSED DO:0 FPO:267 SZ:317433/317433/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
item_id:  BINARY UNCOMPRESSED DO:0 FPO:317700 SZ:562585/562585/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
itm_cd:  BINARY UNCOMPRESSED DO:0 FPO:880285 SZ:900712/900712/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
org_id:  BINARY UNCOMPRESSED DO:0 FPO:1780997 SZ:42974/42974/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
org_cd:  BINARY UNCOMPRESSED DO:0 FPO:1823971 SZ:42965/42965/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
sub_invtr_cd:  BINARY UNCOMPRESSED DO:0 FPO:1866936 SZ:294784/294784/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
bch_nbr:  BINARY UNCOMPRESSED DO:0 FPO:2161720 SZ:437366/437366/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
entr_qty:  BINARY UNCOMPRESSED DO:0 FPO:2599086 SZ:691103/691103/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
total_qty:  BINARY UNCOMPRESSED DO:0 FPO:3290189 SZ:100/100/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
left_qty:  BINARY UNCOMPRESSED DO:0 FPO:3290289 SZ:707584/707584/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
alloc_qty:  BINARY UNCOMPRESSED DO:0 FPO:3997873 SZ:754179/754179/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
stock_age:  BINARY UNCOMPRESSED DO:0 FPO:4752052 SZ:310402/310402/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
item_cost:  BINARY UNCOMPRESSED DO:0 FPO:5062454 SZ:33/33/1.00 VC:242410 ENC:RLE,PLAIN,BIT_PACKED
actual_cost:  BINARY UNCOMPRESSED DO:0 FPO:5062487 SZ:3631076/3631076/1.00 VC:242410 ENC:RLE,PLAIN,BIT_PACKED
item_business:  BINARY UNCOMPRESSED DO:0 FPO:8693563 SZ:67646/67646/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
pch_big_ctg:  BINARY UNCOMPRESSED DO:0 FPO:8761209 SZ:63343/63343/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
pch_med_ctg:  BINARY UNCOMPRESSED DO:0 FPO:8824552 SZ:83151/83151/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
pch_sml_ctg:  BINARY UNCOMPRESSED DO:0 FPO:8907703 SZ:109387/109387/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
[cloud@dp-master001 ~]$ 

这就导致了同步的数据导入的难度,需要检查每个分区的schema。为什么会出现这种情况呢?出现了这种情况如何快速同步呢?

0310以后的分区抽取任务

LOAD LABEL bda.bda_inv_item_age_dtl0310_after
(
DATA INFILE("hdfs://10.21.25.161:8020/user/hive/warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2022-[03-12]-*/*")
INTO TABLE `bda_inv_item_age_dtl_part`
COLUMNS TERMINATED BY "\036"
FORMAT AS "parquet"
where entr_date is not null 
)
WITH BROKER broker198
(
"hadoop.security.authentication"="kerberos",
"kerberos_principal"="hdfs",
"kerberos_keytab"="/opt/StarRocks/kerberos/hdfs.keytab"
);

最终方案思考:

方案1:

  修改StarRocks源码,支持StarRocks目标表字段多于原表字段时,填充默认值。

方案2:

  • 使用python 读取hdfs 分区的schema,根据schema建立StarRocks分区临时表
  • 各分区导入成功后,自动合并成StarRocks目标表

方案3:

  • 使用DataX同步Hive表,每个分区配置一个同步作业。有点不用考虑表字段问题。但是分区较多时,配置同步任务也比较麻烦。可以使用脚本自动生成。如果数据量较大,同步效率也不如Brocks Load方式。

猜你喜欢

转载自blog.csdn.net/zdsx1104/article/details/128062646