解决使用 HBase Bulk Loading 工具出现超过32个hfile的问题，针对一个region的family

导入指标时遇到 importtsv.bulk.output 目录输出的hfile文件个数超过32个时，需要分为多步操作，

第一步：先把超过的文件个数（bulk.output的目录下的hfile文件个数保持在32之内）移到别的目录下。

第二步：执行 hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles output wordcount 把output目录下的hfile文件导入到 "wordcount" hbase表中.

第三步：把之前移到别的目录下的文件重新移入至 output (bulk.output 该目录仍然不能超过32个hfile，如果多个，请反复操作该步) 目录, 执行之前第二步直到所有文件导入至 "wordcount" hbase表

@黄坤通过执行 completebulkload 工具的错误日志输出：

----------------------------------------

Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.hbase.mapreduce.Driver.main(Driver.java:55)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Trying to load more than 32 hfiles to one family of one region
        at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:377)
        at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:960)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.main(LoadIncrementalHFiles.java:967)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
        at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
        ... 11 more
------------------------------------------

追踪源码发现如下代码： this.maxFilesPerRegionPerFamily = conf.getInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 32);

通过在执行加载数据时添加如下参数： ” -Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=64 （64匹配 bulk.output输出目录下 hfile的文件个数） “ 即可一次导入数据至hbase，无需多次操作

例如执行命令：

hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=64 output wordcount

or

sudo -u hbase hadoop jar $HBASE_HOME/hbase-server-1.0.0-cdh5.4.0.jar completebulkload -Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=64 output wordcount

测试重现思路：

根据之前的描述，我们更改 "panda"集群的hbase RegionServer 的 hbase.hregion.max.filesize 的值或在执行 importtvs 生成hfile工具是设定 -Dhbase.hregion.max.filesize=20971520 参数即可. 具体测试验证步骤：

第一步：设置 hbase.hregion.max.filesize 该值由10G 更改为 20MB.

第二步：准备导入hbase的文件大小是 2G （目的执行 importtvs 产生超过 32个hfile ） ( 计算方式： 20MB * 32 ( completebulkload 工具默认的hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily值是32) )

执行 importtvs 命令运行时设置 hbase.hregion.max.filesize 的文件大小，只需添加 -Dhbase.hregion.max.filesize=20971520 （20971520 == 20MB）参数值即可也可以通过设置集群中RegionServer 的 hbase.hregion.max.filesize 的值 ,

执行importtvs命令后生成多少个文件通过读入文件的大小(File Input Format Counters / Bytes Read)/hbase.hregion.max.filesize 设定的值 = 得到hfile文件输出的个数.

hadoop jar $HBASE_HOME/hbase-server-1.0.0-cdh5.4.0.jar importtsv -Dimporttsv.bulk.output=output1 -Dhbase.hregion.max.filesize=20971520 -Dimporttsv.columns=HBASE_ROW_KEY,f:data wordcountexample 2013-09-25.csv

解决使用 HBase Bulk Loading 工具出现超过32个hfile的问题，针对一个region的family

猜你喜欢