hive创建外部表映射hbase中已存在表问题

hbase中的建表脚本:
create 'HisDiagnose',{ NAME => 'diagnoseFamily'}

通过往hive中创建外部表来映射hbase中已经存在的表结构，从而可以通过Hive QL查询hbase表中的数据，从而使得hbase这种NOSQL数据库具备SQL的能力，脚本脚本为:
CREATE EXTERNAL TABLE HisDiagnose(key string, doctorId int, patientId int, description String, rtime int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,diagnoseFamily:doctorId,diagnoseFamily:patientId,diagnoseFamily:description,diagnoseFamily:rtime")
TBLPROPERTIES("hbase.table.name" = "HisDiagnose");
问题描述:
通过hbase client api往hbase的HisDiagnose插入数据，字段doctorId、patientId、rtime类型int，在hive中通过select * from HisDiagnose查询得到doctorId、patientId、rtime三个字段的值为null，代码如下:
/**
* 插入数据
* @param tablename
*/
public static void insertData(String tablename) {
  System.out.println("开始插数据 ....");
  HTablePool pool = new HTablePool(conf, 1000);
  HTableInterface table = pool.getTable(tablename);
  try {
   for(int i=1; i<=1; i++){
    Put put = new Put(("2013-03-0" + i).getBytes());//一个PUT代表一行数据，再NEW一个PUT表示第二行数据，每行一个唯一的ROWKEY,此处ROWKEY为put构造方法中传入的值
    put.add("diagnoseFamily".getBytes(), "doctorId".getBytes(), new Date().getTime(), Bytes.toBytes(i));
    put.add("diagnoseFamily".getBytes(), "patientId".getBytes(), new Date().getTime(), Bytes.toBytes(i));
    put.add("diagnoseFamily".getBytes(), "description".getBytes(), new Date().getTime(), "描述".getBytes());
    put.add("diagnoseFamily".getBytes(), "rtime".getBytes(), new Date().getTime(), Bytes.toBytes(new Date().getTime()));
    table.put(put);
   }

  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  System.out.println("插数据结束 ....");
}
问题解决:
根据官网Wiki文档，https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration关于Column Mapping的说明如下:
There are two SERDEPROPERTIES that control the mapping of HBase columns to Hive:
(1)、hbase.columns.mapping
(2)、hbase.table.default.storage.type: Can have a value of either string (the default) or binary, this option is only available as of Hive 0.9 and the string behavior is the only one available in earlier versions
The column mapping support currently available is somewhat cumbersome and restrictive:

(1)、for each Hive column, the table creator must specify a corresponding entry in the comma-delimited hbase.columns.mapping string (so for a Hive table with n columns, the string should have n entries); whitespace should not be used in between entries since these will be interperted as part of the column name, which is almost certainly not what you want
(2)、a mapping entry must be either :key or of the form column-family-name:[column-name][#(binary|string) (the type specification that delimited by # was added in Hive 0.9.0, earlier versions interpreted everything as strings)
(3)、If no type specification is given the value from hbase.table.default.storage.type will be used
(4)、Any prefixes of the valid values are valid too (i.e. #b instead of #binary)
(5)、If you specify a column as binary the bytes in the corresponding HBase cells are expected to be of the form that HBase's Bytes class yields.
(6)、there must be exactly one :key mapping (we don't support compound keys yet)
(7)、(note that before HIVE-1228 in Hive 0.6, :key was not supported, and the first Hive column implicitly mapped to the key; as of Hive 0.6, it is now strongly recommended that you always specify the key explictly; we will drop support for implicit key mapping in the future)
(8)、if no column-name is given, then the Hive column will map to all columns in the corresponding HBase column family, and the Hive MAP datatype must be used to allow access to these (possibly sparse) columns
(9)、there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.
(10)、Since HBase does not associate datatype information with columns, the serde converts everything to string representation before storing it in HBase; there is currently no way to plug in a custom serde per column
(11)、it is not necessary to reference every HBase column family, but those that are not mapped will be inaccessible via the Hive table; it's possible to map multiple Hive tables to the same HBase table
The next few sections provide detailed examples of the kinds of column mappings currently possible.

根据以上得知:当在hive中创建hbase已经存在的外部表时，默认的hbase.table.default.storage.type类型为string,而hbase中的doctorId、patientId、rtime三个字段值为int类型的，难怪映射过来的值为null，将hive中的外部表删除，
hbase.table.default.storage.type的值设置为binary即可，重建脚本如下:
CREATE EXTERNAL TABLE HisDiagnose(key string, doctorId int, patientId int, description String, rtime int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,diagnoseFamily:doctorId,diagnoseFamily:patientId,diagnoseFamily:description,diagnoseFamily:rtime","hbase.table.default.storage.type"="binary")
TBLPROPERTIES("hbase.table.name" = "HisDiagnose");

hive创建外部表映射hbase中已存在表问题

猜你喜欢