hive映射到hbase及性能分析

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接: https://blog.csdn.net/weixin_43680708/article/details/90314789

HBase中的表
表’lxw1234’有三个列族f1,f2,f3

create 'lxw1234',{NAME => 'f1',VERSIONS => 1},{NAME => 'f2',VERSIONS => 1},
{NAME => 'f3',VERSIONS => 1}	

下面设置hive映射hbase

SET hbase.zookeeper.quorum=zkNode1,zkNode2,zkNode3; 
SET zookeeper.znode.parent=/hbase;
ADD jar /usr/local/apache-hive-0.13.1-bin/lib/hive-hbase-handler-0.13.1.jar;

CREATE EXTERNAL TABLE lxw1234 (
rowkey string,
f1 map<STRING,STRING>,
f2 map<STRING,STRING>,
f3 map<STRING,STRING>
) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:,f2:,f3:")
TBLPROPERTIES ("hbase.table.name" = "lxw1234");

这里使用外部表映射到HBase中的表,这样,在Hive中删除表,并不会删除HBase中的表,否则,就会删除。

另外,除了rowkey,其他三个字段使用Map结构来保存HBase中的每一个列族。

其中,参数解释如下:

hbase.zookeeper.quorum:
指定HBase使用的zookeeper集群,默认端口是2181,可以不指定,如果指定,格式为zkNode1:2222,zkNode2:2222,zkNode3:2222

zookeeper.znode.parent
指定HBase在zookeeper中使用的根目录

hbase.columns.mapping
Hive表和HBase表的字段映射关系,分别为:Hive表中第一个字段映射:key(rowkey),第二个字段映射列族f1,第三个字段映射列族f2,第四个字段映射列族f3

hbase.table.name
HBase中表的名字

也可以直接在Hive中创建表的同时,完成在HBase中创建表。

加入之前没有在HBase中创建表lxw1234,那么使用上面的语句在Hive创建表的时候,会同时在HBase中创建。

Hive中查询HBase表

hive> select * from lxw1234;
OK
lxw1234.com     {"c1":"name1","c2":"name2"}     {"c1":"age1","c2":"age2"}       {"c1":"job1","c2":"job2","c3":"job3"}

可以看到,Hive中只有一行数据,因为只有一个rowkey,每一个列族的列和值,分别被存储到Map结构中。

Hive中插入数据到HBase表
可以在Hive表中通过Insert语句,完成对HBase表数据的插入。

INSERT INTO TABLE lxw1234 
SELECT 'row1' AS rowkey,
map('c3','name3') AS f1,
map('c3','age3') AS f2,
map('c4','job3') AS f3 
FROM DUAL 
limit 1;

在HBase中查看数据:

hbase(main):028:0* scan 'lxw1234'
ROW            COLUMN+CELL             
row1          column=f1:c3, timestamp=1435625971410, value=name3                                                         
row1          column=f2:c3, timestamp=1435625971410, value=age3                                                          
row1          column=f3:c4, timestamp=1435625971410, value=job3                                                          
1 row(s) in 0.0420 seconds

Hive中的外部表lxw1234,就和其他外部表一样,只有一份元数据,真正的数据是在HBase表中,Hive通过hive-hbase-handler来操作HBase中的表。

===
测试Hbase 表映射成 Hive表查询效率
一、准备工作:
1、编写程序将1000万条数据写到Hbase表中;
2、将对应的Hbase表映射成Hive表。
在Hive 的shell中执行类似如下的命令

hive> CREATE EXTERNAL TABLE 
IF NOT EXISTS t_hbase_person_his10(id string, NAME String, salary string,START_DATE string,END_DATE string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,info:id,info:name,info:salary,info:start_date,info:end_date') 
 TBLPROPERTIES ('hbase.table.name' ='t_hbase_person_his10');

复制一份一样的数据到Hive表,这份Hive数据是实际存在Hive中的。通过类似Sql复制

create table t_person_his10 as select * from t_hbase_person_his10;

二、通过Hive Jdbc 方式对比其查询时间,对比结果如下所示
其中t_hbase_person_his10 表为Hbase关联生成的Hive表,t_person_history10为实际的Hive表,来源通过create table t_person_his10 as select * from t_hbase_person_his10;

1、查当前数据,默认返回30条

sql = "select * from t_hbase_person_his10 where end_date='9999-12-31' limit 30";// use statTime:353ms 

sql = "select * from t_person_history10 where end_date='9999-12-31' limit 30";//use statTime:119ms

查指定日期数据,默认返回30条

sql = "select * from t_hbase_person_his10 where start_date<='2017-09-18' and end_date>='2017-09-18' and salary>990000 limit 30";//use statTime:411ms

sql = "select * from t_hbase_person_his10 where start_date<='2017-09-20' and end_date>='2017-09-20' and salary>990000 limit 30";//use statTime:908ms

sql = "select * from t_person_history10 where start_date<='2017-09-18' and end_date>='2017-09-18' and salary>990000 limit 30";//use statTime:147ms
sql = "select * from t_person_history10 where start_date<='2017-09-20' and end_date>='2017-09-20' and salary>990000 limit 30";//use statTime:266ms

order by效率很低

sql = "select * from t_hbase_person_his10 where start_date<='2017-09-20' and end_date>='2017-09-20' and salary>990000  order by salary limit 30";// use statTime:95000ms 

sql = "select * from t_person_history10 where start_date<='2017-09-20' and end_date>='2017-09-20' and salary>990000 order by salary limit 30";//use statTime:35836ms

between and

sql = "select * from t_hbase_person_his10 where end_date='9999-12-31' and salary between 500000 and 600000 limit 30";//use statTime:338ms

sql = "select * from t_person_history10 where end_date='9999-12-31' and salary between 500000 and 600000 limit 30";//use statTime:166ms

对指定用户进行溯源,这里以用户名为唯一标识,效率极低,(可以考虑用rowkey做为唯一标识)

sql = "SELECT mobile,start_date FROM t_hbase_person_his10 where name='hehe98'";//use statTime:86701ms 13901173602,2017-09-04 13201382515,2017-09-07 15107963040,2017-09-11
sql = "SELECT mobile,start_date FROM t_hbase_person_his10 where rowkey='1298'";//use statTime:316ms 

sql = "SELECT mobile,start_date FROM t_person_history10 where name='hehe98'";//use statTime:6326ms 13901173602,2017-09-04 13201382515,2017-09-07 15107963040,2017-09-11
sql = "SELECT mobile,start_date FROM t_person_history10 where rowkey='1298'";//use statTime:6288ms

group by

sql = "select start_date,count(1) from t_hbase_person_his10 group by start_date";//use statTime:100330ms

sql = "select start_date,count(1) from t_person_history10 group by start_date";//use statTime:25857ms

模糊查询

sql = "select * from t_hbase_person_his10 where name like '%hehe111%' limit 30";// use statTime:2738ms
sql = "select * from t_hbase_person_his10 where name like '%hehe111%'  and start_date>'2017-09-18' limit 10";// use statTime:2745ms
sql = "select * from t_hbase_person_his10 where rowkey like '%10059%'  and start_date>'2017-09-18' limit 10";// use statTime:665ms

sql = "select * from t_person_history10 where name like '%hehe111%'  and start_date>'2017-09-18' limit 10";// use statTime:257ms
sql = "select * from t_person_history10 where rowkey like '%10059%'  and start_date>'2017-09-18' limit 10";// use statTime:135ms
sql = "select * from t_person_history10 where name like '%hehe111%' limit 30";// use statTime:225ms

查询指定rowkey(这个还不错)

sql = "select * from t_hbase_person_his10 where rowkey='11123'";//use statTime:342ms

sql = "select * from t_person_history10 where rowkey='11123'";//use statTime:8386ms

对Hive表进行关联查询

sql = "select th.mobile,th.start_date,tb.mobile from t_person_history10 th, t_hbase_person_his10 tb where th.name=tb.name limit 10";//use statTime:88614ms

sql = "select th.mobile,th.start_date,tb.mobile from t_person_history10 th left outer join t_hbase_person_his10 tb on  th.name=tb.name limit 10";//use statTime:88614ms

综合上述结果:在将Hbase表映射成Hive表查询效率会降低不少。但如果数据量只有1000万级,普通查询影响并不大。比如关联查询与聚合排查等效率就非常低了,个人建议对于大数据量的表还是不要关联成Hive表来查询,因为这样对应的Hive表分区等原先的功能用不了了。

转自
http://lxw1234.com/archives/2015/06/319.htm
https://blog.csdn.net/u013850277/article/details/78472568

猜你喜欢

转载自blog.csdn.net/weixin_43680708/article/details/90314789