drill1.0配置hive storage plugin及测试
drill,hive
截止到目前本博客发布前,apache drill最新发布版本是1.0.0,对与此版本的数据源支持和文件格式的支持:
- avro
- parquet
- hive
- hbase
- csv tsv psv
-
File system
对于目前我的需求:snappy+sequencefile 的hdfs存储格式,drill没有直接的支持,想到hive支持查询snappy+sequencefile,而drill支持hive,由此产生了是否可以通过hive storage plugin的方式来读取snappy+sequencefile? 经查证是可以的,配置如下:- hive开启metastore的thrift服务:在hive-site.xml中加入如下配置
<property>
<name>hive.metastore.uris</name>
<value>thrift://10.170.250.47:9083</value>
</property>
<property>
<name>hive.metastore.local</name>
<value>false</value>
</property>
启动metastore服务:
[hadoop@gateway local]$ ../hive-1.2.1/bin/hive --service metastore &
- 从drill的web ui上配置hive的plugin:
{
"type":"hive",
"enabled":true,
"configProps":{
"hive.metastore.uris":"thrift://10.170.250.47:9083",#hive的metastore服务地址和端口
"javax.jdo.option.ConnectionURL":"jdbc:mysql://xxx:3306/hive_database",
"hive.metastore.warehouse.dir":"/user/hive/warehouse",#为hive在hdfs上的warehouse目录
"fs.default.name":"hdfs://xxx:9000",
"hive.metastore.sasl.enabled":"false"
}
}
保存退出后,重启drillbit服务
[hadoop@gateway drill-1.1.0]$ bin/drillbit.sh restart
```
3. 查询sequencefile测试:
``` shell
[hadoop@gateway drill-1.1.0]$ bin/sqlline -u jdbc:drill:zk=10.172.171.229:2181
apache drill 1.0.0
"the only truly happy people are children, the creative minority and drill users"
0: jdbc:drill:zk=10.172.171.229:2181>use hive.ai;
+-------+--------------------------------------+
| ok | summary |
+-------+--------------------------------------+
|true|Default schema changed to [hive.ai]|
+-------+--------------------------------------+
1 row selected (0.188 seconds)
0: jdbc:drill:zk=10.172.171.229:2181>!table
+------------+---------------------+---------------------+-------------+----------+-----------+-------------+------------+----------------------------+-----------------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE | REMARKS | TYPE_CAT | TYPE_SCHEM | TYPE_NAME | SELF_REFERENCING_COL_NAME | REF_GENERATION |
+------------+---------------------+---------------------+-------------+----------+-----------+-------------+------------+----------------------------+-----------------+
| DRILL | INFORMATION_SCHEMA | CATALOGS | TABLE |||||||
| DRILL | INFORMATION_SCHEMA | COLUMNS | TABLE |||||||
| DRILL | INFORMATION_SCHEMA | SCHEMATA | TABLE |||||||
| DRILL | INFORMATION_SCHEMA | TABLES | TABLE |||||||
| DRILL | INFORMATION_SCHEMA | VIEWS | TABLE |||||||
| DRILL | hive.ai | metric_data_entity | TABLE |||||||
| DRILL | sys | boot | TABLE |||||||
| DRILL | sys | drillbits | TABLE |||||||
| DRILL | sys | memory | TABLE |||||||
| DRILL | sys | options | TABLE |||||||
| DRILL | sys | threads | TABLE |||||||
| DRILL | sys | version | TABLE |||||||
+------------+---------------------+---------------------+-------------+----------+-----------+-------------+------------+----------------------------+-----------------+
0: jdbc:drill:zk=10.172.171.229:2181> SELECT count(1) FROM metric_data_entity where pt='2015080510';
+-----------+
| EXPR$0 |
+-----------+
|40455402|
+-----------+
1 row selected (14.482 seconds)
0: jdbc:drill:zk=10.172.171.229:2181>
```
以上查询已经可以支持sequencefile查询,但是查询有压缩的snappy的文件就报错:
```
2015-08-0516:34:49,067[WorkManager-2] ERROR o.apache.drill.exec.work.WorkManager- org.apache.drill.exec.work.WorkManager$WorkerBee$1.run() leaked an exce
ption.
java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(NativeMethod)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)~[na:1.7.0_85]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)[na:1.7.0_85]
at java.lang.Thread.run(Thread.java:745)[na:1.7.0_85]
2015-08-0516:39:05,781[UserServer-1] INFO o.a.drill.exec.work.foreman.Foreman-State change requested. RUNNING --> CANCELLATION_REQUESTED
很明显要配置snappy的本地库:LD_LIBRARY_PATH环境变量,请配置下面的第四步
- hive开启metastore的thrift服务:在hive-site.xml中加入如下配置
- 配置LD_LIBRARY_PATH=/oneapm/local/hadoop-2.7.1/lib/native的 系统环境变量 并加入到CLASSPATH中