When doing data ETL, the original data may be stored in columnar storage Hbase. At this time, if we want to clean the data, we can consider mapping the Hbase table to the Hive table, and then use Hive's HQL to clear and process the data. For the specific process, please refer to the following example:
step
1. Create Hbase table
2. Map Hive table
step one
Description: cf column cluster name, only put a few test columns
create 'cofeed_info',{NAME => 'cf', REPLICATION_SCOPE => 1}
put 'cofeed_info', '100001', 'cf:id', '101'
put 'cofeed_info', '100001', 'cf:title', 'This is test data'
put 'cofeed_info', '100001', 'cf:insert_time', '45679848161564'
Step 2
Description: Although many columns are not currently in the Hbase table, it does not matter, :key means rowkey
CREATE EXTERNAL TABLE cofeed_info
(
rowkey string,
id string,
title string,
tourl string,
content string,
data_provider string,
b_class string,
b_cadogory string,
source string,
insert_time timestamp,
dt string
) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES (“hbase.columns.mapping” =
“:key,
cf:id,
cf:title,
cf:tourl,
cf:content,
cf:data_provider,
cf:b_class,
cf:b_cadogory,
cf:source,
cf:insert_time,
cf:dt”) TBLPROPERTIES (“hbase.table.name” = “cofeed_info”);
result
hive> desc cofeed_info;
OK
rowkey string from deserializer
id string from deserializer
title string from deserializer
tourl string from deserializer
content string from deserializer
data_provider string from deserializer
b_class string from deserializer
b_catogory string from deserializer
source string from deserializer
insert_time timestamp from deserializer
dt string from deserializer
说明:Hbase中没有的列簇为NULL了.
hive> select * from cofeed_info;
OK
100001 101 这是测试用的数据 NULL NULL NULL NULL NULL NULL NULL NULL