hbase批量入库的总结

   最近这一段时间一直在研究hbase的批量入库,看似简单的问题其实埋着无数的坑......

  接下来就把我遇到的一些问题和解决的办法分享给大家,希望能让那些新接触到的人不至于像我一样走这么多弯路。

   hbase一般的插入过程都使用HTable对象,将数据封装在Put对象中,Put在new创建的时候需要传入rowkey,并将列族,列名,列值add进去。然后HTable调用put方法,通过rpc请求提交到Regionserver端。

 写入的方式可以分为以下几种:


   
   
  1. 单条 put
  2. 批量 put
  3. 使用Mapreduce
  4. bluckload

进行批量入库之前,首先要连接到正确的连接到hbase


   
   
  1. static{
  2. conf= HBaseConfiguration.create();
  3. // 可以连接 hbase
  4. //zookeeper 给客户端的端口
  5. conf. set( "hbase.zookeeper.property.clientPort", "2181");
  6. conf. set( "hbase.zookeeper.quorum", "192.168.137.138,192.168.137.139");
  7. conf. set( "hbase.master", "192.168.10.138:60000");
  8. }

然后开始建立我们的表结构:


   
   
  1. public static void createTable(String tableName){
  2. try {
  3. ha = new HBaseAdmin(conf);
  4. if(ha.tableExists(tableName)){
  5. ha.disableTable(tableName);
  6. ha.deleteTable(tableName);
  7. }
  8. // 建立表结构
  9. HTableDescriptor hd = new HTableDescriptor(tableName);
  10. // 添加列族
  11. hd.addFamily( new HColumnDescriptor( "family1".getBytes()));
  12. hd.addFamily( new HColumnDescriptor( "family2".getBytes()));
  13. ha.createTable(hd);
  14. } catch (Exception e) {
  15. System. out.println(e);
  16. }
  17. }

有了上面的基础后,可以正式开始进行数据的插入

  1. 单条put

   
   
  1. // 插入内容,行键,列族,列名,值,插入的表名
  2. public static void insertData( String rowkey, String cf,
  3. String clomun, String content, String tableName)
  4. throws IOException {
  5. htable = new HTable(conf, tableName);
  6. Put put = new Put(rowkey.getBytes());
  7. put.add(cf.getBytes(), clomun.getBytes(), content.getBytes());
  8. htable.put(put);
  9. }

  这种方式是批量插入数据最慢的方式,它更合适的应用场景是一般是线上业务运行时,记录单条插入,如报文记录,处理记录,写入后htable对象即释放。每次提交就是一次rpc请求.

  1. 多条Put

  也就是将每一个put对象,放入List集合里面,然后对这个List集合进行入库,相比于单条Put,这种方式在入库效率上明显会有所提升. 应用场景一般在数据量稍多的环境下,通过批量提交减少请求次数


   
   
  1. public static void insertData(String rowkey, String cf,
  2. String clomun, String content, String tableName)
  3. throws IOException {
  4. htable = new HTable(conf, tableName);
  5. List<Put> list = new ArrayList<Put>();
  6. Put put = new Put(rowkey.getBytes());
  7. put.add(cf.getBytes(), clomun.getBytes(), content.getBytes());
  8. list.add(put);
  9. htable.put(list);
  10. }

  在主方法里面调用该方法并且输入相关参数就可以实现用put方式对数据的批量插入了


   
   
  1. public static void main(String[] args) throws IOException {
  2. createTable( "insertTest");
  3. try {
  4. for ( int i = 0; i < 10; i++) {
  5. String rowkey = UUID.randomUUID().toString();
  6. // 因为不能动态增加列簇,所以只能动态添加列
  7. for ( int j = 0; j <= 10; j++) {
  8. insertData(rowkey, "family1", "column",
  9. new SimpleDateFormat( "yyyy-MM-dd hh:mm:ss").
  10. format( new Date()), "insertTest");
  11. insertData(rowkey, "family2", "column",
  12. new SimpleDateFormat( "yyyy-MM-dd hh:mm:ss").
  13. format( new Date()), "insertTest");
  14. }
  15. }
  16. } catch (Exception e) {
  17. e.printStackTrace();
  18. }
  19. }

 以上的put插入数据,因为不适合处理大批量的数据,所以都是在自己搭建的集群上进行的测试,接下来介绍的两种方式用的是公司的集群.

  1. 使用mapReduce批量插入

  因为hbase是寄托在hadoop集群上的分布式非关系型数据库,而Hadoop又是处理大规模数据的典范,所以使用MapReduce来实现hbase的批量入库自然是个不错的选择,废话不多说,直接贴代码


   
   
  1. import org.apache.hadoop.conf.Configuration;
  2. import org.apache.hadoop.hbase.client.Put;
  3. import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
  4. import org.apache.hadoop.hbase.mapreduce.TableReducer;
  5. import org.apache.hadoop.hbase.util.Bytes;
  6. import org.apache.hadoop.io.LongWritable;
  7. import org.apache.hadoop.io.NullWritable;
  8. import org.apache.hadoop.io.Text;
  9. import org.apache.hadoop.mapreduce.Job;
  10. import org.apache.hadoop.mapreduce.Mapper;
  11. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  12. import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
  13. import java.io.IOException;
  14. import java.text.SimpleDateFormat;
  15. import java.util.Date;
  16. public class Hbase_MapReduceTest {
  17. static class BatchMapper extends Mapper<LongWritable, Text, Text, Text> {
  18. SimpleDateFormat simpleDateFormat = new SimpleDateFormat( "yyyyMMddHHmmssSS");
  19. Text text = new Text();
  20. protected void map(LongWritable key, Text value, Context context) {
  21. try {
  22. final String[] spliteds = value. toString(). split( "::");
  23. Date date = new Date();
  24. String dateFormat = simpleDateFormat.format(date);
  25. final String rowKey = spliteds[ 0] + "_" + dateFormat;
  26. text. set(rowKey);
  27. context.write(text, value);
  28. } catch ( IOException e) {
  29. e.printStackTrace();
  30. } catch ( InterruptedException e) {
  31. e.printStackTrace();
  32. }
  33. }
  34. }
  35. static class BatchReducer extends TableReducer<Text, Text, NullWritable> {
  36. protected void reduce(Text key, Iterable<Text> values, Context context) {
  37. for ( Text tx : values) {
  38. try {
  39. final String[] arrays = tx. toString(). split( "::");
  40. Put put = new Put(key.getBytes());
  41. put.addColumn( "info".getBytes(), "name".getBytes(),
  42. arrays[ 1].getBytes());
  43. context.write( NullWritable. get(), put);
  44. } catch ( IOException e) {
  45. e.printStackTrace();
  46. } catch ( InterruptedException e) {
  47. e.printStackTrace();
  48. }
  49. }
  50. }
  51. }
  52. public static void main(String[] args) throws Exception {
  53. final Configuration configuration = new Configuration();
  54. configuration. set( "hbase.zookeeper.quorum", "master");
  55. configuration. set( "hbase.zookeeper.property.clientPort", "4180");
  56. // 设置hbase表名称
  57. configuration. set( TableOutputFormat. OUTPUT_TABLE, "HBASE_INSERT");
  58. configuration. set( "dfs.socket.timeout", "180000");
  59. final Job job = new Job(configuration, "HBaseBatchImport");
  60. // 设置reduce的个数
  61. job.setNumReduceTasks( 3);
  62. job.setMapperClass( BatchMapper. class);
  63. job.setReducerClass( BatchReducer. class);
  64. // 设置map的输出,不设置reduce的输出类型
  65. job.setMapOutputKeyClass( Text. class);
  66. job.setMapOutputValueClass( Text. class);
  67. job.setInputFormatClass( TextInputFormat. class);
  68. // 不再设置输出路径,而是设置输出格式类型
  69. job.setOutputFormatClass( TableOutputFormat. class);
  70. // 设置数据的输入路径
  71. FileInputFormat.setInputPaths(job, args[ 0]);
  72. // hdfs://master:9000/input
  73. System.exit(job.waitForCompletion( true) ? 0 : 1);
  74. }
  75. }

  这里和普通的MapReduce最大的区别就在于reduce过程的输出:不是常规的将reduce的结果输出到hdfs上面,而是直接输出到表里面,而且不用设置reduce的输出类型 这里是最关键的两行代码:


   
   
  1. //设置输出的hbase的表名
  2. configuration. set( TableOutputFormat. OUTPUT_TABLE, "HBASE_INSERT");
  3. //设置数据的输出格式类型
  4. job.setOutputFormatClass( TableOutputFormat. class);

    这种方式最终会调用Tableoutputformat类,核心的原理还是使用htable的put方法,不过由于使用了mapreduce分布式提交到hbase,速度比单线程效率高出许多.

    但是这种方式也不是万能的,put提交的速度太快时会给hbase造成比较大的压力,容易发生gc,造成节点挂掉,尤其是初始化表到hbase时,一般都会有很多的历史数据需要入库,容易造成比较大的压力,这种情况下建议使用下面的方式bulkload方式入库,减少给hbase压力。上面这种方式是直接在map中生成put然后交给TableOutputformat去提交的,因为这里几乎不需要逻辑处理,如果需要做逻辑处理,那么一般会在reduce端去生成put对象,在map端做业务逻辑处理,比如数据关联,汇总之类的.

  1. 采用bulkLoad方法批量入库

    这是应用最广泛的,也是经过官方认证的最快捷使用的hbase 批量入库的方式, hbase官方文档对这一块的介绍如下(经过google翻译之后的文档...):

散货装载 bulkload


   
   
  1. 1. 概观 HBase包含几种将数据加载到表中的方法。
  2. 最直接的方法是使用TableOutputFormatMapReduce作业中的类,
  3. 或者使用普通的客户端API; 然而,这些并不总是最有效的方法。
  4. 批量加载功能使用MapReduce作业以HBase内部数据格式输出表格数据,
  5. 然后直接将生成的StoreFiles加载到正在运行的集群中。
  6. 使用批量加载将比使用HBase API使用更少的CPU和网络资源。
  7. 2. 大容量装载限制 当批量加载绕过写入路径时,WAL不会被写入作为过程的一部分。
  8. 复制通过读取WAL文件来工作,所以它不会看到批量加载的数据 -
  9. 对于使用的编辑也是如此Put.setDurability(SKIP_WAL)。
  10. 处理该问题的一种方法是将原始文件或HFile发送到其他群集,并在那里进行其他处理。
  11. 3. 批量加载架构 HBase批量加载过程包含两个主要步骤。
  12. 1. 通过MapReduce作业准备数据
  13. 批量加载的第一步是使用MapReduce作业生成HBase数据文件(StoreFiles)HFileOutputFormat2。
  14. 这种输出格式将数据写入HBase的内部存储格式,以便以后可以非常高效地将其加载到群集中。
  15. 为了高效工作,HFileOutputFormat2必须对每个输出HFile进行配置,使其适合单个区域。
  16. 为了做到这一点,输出将被批量加载到HBase中的作业使用Hadoop的
  17. TotalOrderPartitioner类将映射输出分区到键空间的不相交范围,对应于表中区域的键范围。
  18. HFileOutputFormat2包括一个便利功能,configureIncrementalLoad()它 '
  19. TotalOrderPartitioner根据当前的表格区域边界自动设置一个。
  20. 2. 完成数据加载 在准备好数据导入之后,通过使用importtsv具有“importtsv.bulk.output”
  21. 选项的工具或使用其他某个MapReduce作业HFileOutputFormat,
  22. 该completebulkload工具可用于将数据导入到正在运行的集群中。
  23. 这个命令行工具遍历准备好的数据文件,每个文件确定文件所属的区域。
  24. 然后,它会联系采用HFile的相应RegionServer,
  25. 将其移动到其存储目录中,并将数据提供给客户端。
  26. 如果在批量装载准备过程中,或者在准备和完成步骤之间区域边界发生了变化,
  27. completebulkload公用程序将自动将数据文件分割成对应于新边界的片段。
  28. 这个过程并不是最佳的,所以用户应该小心地减少准备批量加载和导入到群集之间的延迟,
  29. 尤其是当其他客户端同时通过其他方式加载数据时。
  30. $ hadoop jar hbase- server-VERSION.jar completebulkload
  31. [-c /path/ to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
  32. 该-c config-file选项可用于指定包含相应hbase参数的文件(例如,hbase-site.xml)
  33. (如果CLASSPATH中尚未提供此参数)
  34. (另外,如果zookeeper不是,则CLASSPATH必须包含具有zookeeper配置文件的目录由HBase管理)。
  35. 如果目标表在HBase中不存在,该工具将自动创建表。
  36. 4. 也可以看看 有关引用的实用程序的更多信息,请参阅ImportTsv和 CompleteBulkLoad。
  37. 请参阅操作方法:使用HBase批量加载,以及为什么选择最近一次有关批量加载状态的博客。
  38. 5. 高级用法
  39. 尽管该importtsv工具在很多情况下都很有用,但是高级用户可能希望以编程方式生成数据
  40. ,或者从其他格式导入数据。
  41. 要开始这样做,挖掘ImportTsv.java并检查JavaDoc for HFileOutputFormat。
  42. 批量加载的导入步骤也可以通过编程来完成。

  我做测试的时候采用的是将wordcount的结果进行批量入库,测试的数据量是2个G的文本文件,wordcount的代码就不贴出来了,直接贴批量入库的代码


   
   
  1. import java.io.IOException;
  2. import org.apache.hadoop.conf.Configuration;
  3. import org.apache.hadoop.fs.FileSystem;
  4. import org.apache.hadoop.fs.Path;
  5. import org.apache.hadoop.hbase.HBaseConfiguration;
  6. import org.apache.hadoop.hbase.HColumnDescriptor;
  7. import org.apache.hadoop.hbase.HTableDescriptor;
  8. import org.apache.hadoop.hbase.TableName;
  9. import org.apache.hadoop.hbase.client.Admin;
  10. import org.apache.hadoop.hbase.client.ConnectionFactory;
  11. import org.apache.hadoop.hbase.client.HTable;
  12. import org.apache.hadoop.hbase.client.Put;
  13. import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
  14. import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
  15. import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
  16. import org.apache.hadoop.hbase.util.Bytes;
  17. import org.apache.hadoop.io.LongWritable;
  18. import org.apache.hadoop.io.Text;
  19. import org.apache.hadoop.mapreduce.Job;
  20. import org.apache.hadoop.mapreduce.Mapper;
  21. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  22. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  23. import org.apache.hadoop.util.GenericOptionsParser;
  24. public class WorldCount_Hbase {
  25. public static class ConvertWordCountOutToHFileMapper
  26. extends Mapper< LongWritable, Text, ImmutableBytesWritable, Put> {
  27. @ Override
  28. protected void map( LongWritable key, Text value, Context context)
  29. throws IOException, InterruptedException {
  30. // 上一个 WordCount 的输出格式是 :a 100 b 20 这样的形式 ,
  31. // 按行读取后 , 行号作为 key, 每一行的内容作为 value
  32. String wordCountStr = value. toString();
  33. String[] wordCountArray = wordCountStr. split( "\t");
  34. String word = wordCountArray[ 0];
  35. int count = Integer.valueOf(wordCountArray[ 1]);
  36. // 创建 HBase 中的 RowKey
  37. byte[] rowKey = Bytes.toBytes(word);
  38. ImmutableBytesWritable rowKeyWritable = new ImmutableBytesWritable(rowKey);
  39. byte[] family = Bytes.toBytes( "cf");
  40. byte[] qualifier = Bytes.toBytes( "count");
  41. byte[] hbaseValue = Bytes.toBytes( count);
  42. // Put 用于列簇下的多列提交,若只有一个列,则可以使用 KeyValue 格式
  43. // KeyValue keyValue = new KeyValue(rowKey, family, qualifier,
  44. // hbaseValue);
  45. Put put = new Put(rowKey);
  46. put.add(family, qualifier, hbaseValue);
  47. context.write(rowKeyWritable, put);
  48. }
  49. }
  50. public static void main( String[] args) throws Exception {
  51. Long starttime = System.currentTimeMillis();
  52. Configuration hadoopConfiguration = new Configuration();
  53. hadoopConfiguration. set( "fs.defaultFS", "hdfs://192.168.1.31:9000");
  54. // hadoopConfiguration.set("mapreduce.map.memory.mb", "512");
  55. // hadoopConfiguration.set("mapreduce.reduce.memory.mb", "512");
  56. String[] dfsArgs = new GenericOptionsParser(hadoopConfiguration, args).
  57. getRemainingArgs();
  58. FileSystem fs = FileSystem. get(hadoopConfiguration);
  59. Path input = new Path( "/tmp/xmr/hbase/test");
  60. Path output = new Path( "/tmp/xmr/resultdata/test");
  61. // Path input = new Path(dfsArgs[0]);
  62. // Path output = new Path(dfsArgs[1]);
  63. Job convertWordCountJobOutputToHFileJob = new Job(hadoopConfiguration, "wordCount_bulkload");
  64. convertWordCountJobOutputToHFileJob.setJarByClass(
  65. WorldCount_Hbase. class);
  66. convertWordCountJobOutputToHFileJob.setMapperClass(
  67. ConvertWordCountOutToHFileMapper. class);
  68. convertWordCountJobOutputToHFileJob.setMapOutputKeyClass(
  69. ImmutableBytesWritable. class);
  70. convertWordCountJobOutputToHFileJob.setMapOutputValueClass( Put. class);
  71. if (fs.exists(output)) {
  72. fs.delete(output);
  73. }
  74. FileInputFormat.addInputPath(convertWordCountJobOutputToHFileJob, input);
  75. FileOutputFormat.setOutputPath(convertWordCountJobOutputToHFileJob, output);
  76. // 创建 HBase 的配置对象
  77. Configuration hbaseConfiguration = HBaseConfiguration.create();
  78. hbaseConfiguration. set( "hbase.zookeeper.quorum", "master,node001,node002");
  79. hbaseConfiguration. set( "hbase.zookeeper.property.clientPort", "4180");
  80. System.out. println(hbaseConfiguration. toString());
  81. // 创建目标表对象
  82. Admin admin = ConnectionFactory.createConnection(hbaseConfiguration).getAdmin();
  83. if (!admin.isTableAvailable( TableName.valueOf
  84. ( "wordcount"))) {
  85. HTableDescriptor hbaseTable = new HTableDescriptor( TableName.valueOf( "word1count"));
  86. hbaseTable.addFamily(new HColumnDescriptor( "cf"));
  87. admin.createTable(hbaseTable);
  88. }
  89. HTable wordCountTable = new HTable(hbaseConfiguration, "word1count");
  90. HFileOutputFormat.configureIncrementalLoad(
  91. convertWordCountJobOutputToHFileJob, wordCountTable);
  92. int convertWordCountJobOutputToHFileJobResult =
  93. convertWordCountJobOutputToHFileJob.waitForCompletion( true) ? 0
  94. : 1;
  95. // 调用 BulkLoad 方式来将 MR 结果批量入库
  96. LoadIncrementalHFiles loader = new LoadIncrementalHFiles(hbaseConfiguration);
  97. // 第一个参数为第二个 Job 的输出目录即保存 HFile 的目录,第二个参数为目标表
  98. loader.doBulkLoad(output, wordCountTable);
  99. Long endtime = System.currentTimeMillis();
  100. System.out. println( " 程序的执行时间为 :" + (endtime - starttime));
  101. System.exit(convertWordCountJobOutputToHFileJobResult);
  102. }
  103. }

这里面需要注意的地方:

  1. map的输出格式必须为ImmutableBytesWritable, Put或者ImmutableBytesWritable, KeyValue 如果有多个列就选用Put,如果只有一个列可以选用KeyValue
  2. 不用自己写reduce过程,自然也不用写reduce的输入输出路径
  3. 使用BulkLoad方法将生成的Hfile进行批量入库 // 调用BulkLoad方式来将MR结果批量入库

   
   
  1. LoadIncrementalHFiles loader = new LoadIncrementalHFiles(hbaseConfiguration);
  2. // 第一个参数为第二个Job的输出目录即保存HFile的目录,第二个参数为目标表
  3. loader.doBulkLoad(output, wordCountTable);

  以上是对常用的几种批量入库方式进行的基本介绍,对于每一种方法,也用了几个G的数据进行过测试无误..

网上也能搜到一大堆雷同相似的代码,但是Hbase是用来存储海量数据的数据库,到了实际的应用中,面临很大数据量一大,也不会出问题么?

答案当然是NO,Hbase批量入库的坑才刚刚开始!!!我就被这些坑折腾的怀疑人生!!

那些奇奇怪怪的小问题在这里就不多赘述了,把使用BulkLoad进行批量入库遇到的最大的几个坑分享给大家!!!

  1. 首先就是reduce相关的问题:

  在实际的应用中你就会发现,对于稍大一点的数据量,map过程的执行效率还是比较让人满意的,但是到了reduce阶段就会出现比较严重的卡顿,我的困惑就是?我的代码里明明没有reduce过程,为什么还会有reduce过程来影响我入库的效率呢?

  于是,我尝试着在job里,设置reduce的数量,把它设置为0,可是重新执行的时候就会发现还是会有那个烦人的reduce过程, 既然设置它为0没有效果,那我把它的数量设置的多一点,提高它的并行度总能加快效率了吧 于是我又修改了reduce的数量,执行的时候发现还是只有一个..... 后来我才知道, 在这种情况下,我们不用自己写reduce过程,但是会使用Hbase给我们提供的reduce,也就是说,无论你怎么设置reduce数量,都是无效的. 这样我也就释然了

  1. 效率严重低下的问题!!!

      首先我用100M的数据量做测试,居然需要30s才能入库完毕!用几个G的数据量测试,效率也没有明显的提升! 也就是说平均每秒的插入速度还不到15000条.,这甚至比mysql的入库还要慢很多,这种效率在实际生产中是完全不能接受的 说好的这是入库最快的方式呢?我不仅产生了怀疑.. 说到底,这种问题还是因为reduce数量只有一个这个蛋疼的问题所导致的,也就是说,不管你的集群有多牛,都值相当于单机版,这显然是不合适的...那么该如何解决这个问题呢????

就是在建表的时候进行合理的预分区!!!预分区的数目会决定你的reduce过程的数目!简单来说,在一定的范围内,进行合适预分区的话,reduce的数量增加多少,效率就提高多少倍!!!

    有关于hbase的预分区,进行合适的预分区,实际上是一个很复杂的问题,也不是本篇文章讨论的重点. 感兴趣的话可以去看看这位大神写的东西,给了我很大的启发

大神的博客链接

我只简单介绍一下hbase建表时预分区的shell语句和执行的结果:

create 'XUE_BULKLOAD','info',{SPLITS => [ '1','2','3', '4','5','6','7','8','9']}

   
   

这样就成功的将表名为 'XUE_BULKLOAD',列簇名为'info'的表在建表时预分了10个分区

预分区结束之后进行测试:发现reduce的数量为预分区的数量+1,而且执行效率大大提高! 插入效率大致在10W/s~20W/s之间,已经勉强能达到实际工作的要求!

  1. 数据量超过某个范围就会导致插入数据库失败的问题!

经过各种各样的调试,效率已经可以接受! 然后开始调大数据量测试,发现哪怕几十个G的数据量,在执行完MapReduce过程之后都会报错,去表里面查看数据,一条记录都没有!!报错信息如下


   
   
  1. Trying to load more than 32 hfiles to one family of one region
  2. 18/ 01/ 18 23: 20: 36 ERROR mapreduce .LoadIncrementalHFiles: Trying to load
  3. more than 32 hfiles to family info of region with start key
  4. Exception in thread "main" java .io .IOException: Trying to load more than
  5. 32 hfiles to one family of one region
  6. at org .apache .hadoop .hbase .mapreduce .LoadIncrementalHFiles .doBulkLoad
  7. (LoadIncrementalHFiles .java: 377)
  8. at hbase_Insert .Hbase_Insert .main(Hbase_Insert .java: 241)
  9. at sun .reflect .NativeMethodAccessorImpl .invoke0( Native Method)
  10. at sun .reflect .NativeMethodAccessorImpl .invoke(
  11. NativeMethodAccessorImpl .java: 57)
  12. at sun .reflect .DelegatingMethodAccessorImpl .invoke(
  13. DelegatingMethodAccessorImpl .java: 43)
  14. at java .lang .reflect .Method .invoke(Method .java: 606)
  15. at org .apache .hadoop .util .RunJar .run(RunJar .java: 221)
  16. at org .apache .hadoop .util .RunJar .main(RunJar .java: 136)

报错的大致意思就是试图将超过32个Hfile文件导入到hbase里面的一个region导致失败 那这个问题该如何解决呢?实际上就是两个重要的参数限制的一个是:

hbase.hregion.max.filesize

单个ColumnFamily的region大小,若按照ConstantSizeRegionSplitPolicy策略,超过设置的该值则自动split 默认的大小是1G hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily

允许的hfile的最大个数,默认配置是32 也就是说:这两个参数的默认值决定了,每次批量入库的数据量不能超过1*32也就是32个G,超过这个数量就会导致入库失败

可以在代码里,或者在hbase安装路径下conf目录下的hbase-site.xml里面针对这两个参数进行设置 为了一劳永逸,我选择在hbase-site.xml里面进行设置,设置结果如下:


   
   
  1. <property>
  2. <name>hbase.hregion.max.filesize </name>
  3. <value>10737418240 </value>
  4. </property>
  5. <property>
  6. <name>hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily </name>
  7. <value>3200 </value>
  8. </property>

这样,每次能够批量入库的数据就达到了32个T,符合公司的数据量需要! 配置完毕后重启集群进行测试,不在报这个错误,执行结果如下:


   
   
  1. 18/ 01/ 20 12: 17: 31 INFO mapreduce.Job: map 1% reduce 0%
  2. 18/ 01/ 20 12: 17: 35 INFO mapreduce.Job: map 2% reduce 0%
  3. 18/ 01/ 20 12: 17: 42 INFO mapreduce.Job: map 3% reduce 0%
  4. 18/ 01/ 20 12: 17: 45 INFO mapreduce.Job: map 4% reduce 0%
  5. 18/ 01/ 20 12: 17: 51 INFO mapreduce.Job: map 5% reduce 0%
  6. 18/ 01/ 20 12: 17: 55 INFO mapreduce.Job: map 6% reduce 0%
  7. 18/ 01/ 20 12: 17: 59 INFO mapreduce.Job: map 7% reduce 0%
  8. 18/ 01/ 20 12: 18: 03 INFO mapreduce.Job: map 8% reduce 0%
  9. 18/ 01/ 20 12: 18: 06 INFO mapreduce.Job: map 9% reduce 0%
  10. 18/ 01/ 20 12: 18: 11 INFO mapreduce.Job: map 10% reduce 0%
  11. 18/ 01/ 20 12: 18: 16 INFO mapreduce.Job: map 11% reduce 0%
  12. 18/ 01/ 20 12: 18: 20 INFO mapreduce.Job: map 12% reduce 0%
  13. 18/ 01/ 20 12: 18: 27 INFO mapreduce.Job: map 13% reduce 0%
  14. 18/ 01/ 20 12: 18: 32 INFO mapreduce.Job: map 14% reduce 0%
  15. 18/ 01/ 20 12: 18: 37 INFO mapreduce.Job: map 15% reduce 0%
  16. 18/ 01/ 20 12: 18: 42 INFO mapreduce.Job: map 16% reduce 0%
  17. 18/ 01/ 20 12: 18: 47 INFO mapreduce.Job: map 17% reduce 0%
  18. 18/ 01/ 20 12: 18: 53 INFO mapreduce.Job: map 18% reduce 0%
  19. 18/ 01/ 20 12: 18: 58 INFO mapreduce.Job: map 19% reduce 0%
  20. 18/ 01/ 20 12: 19: 03 INFO mapreduce.Job: map 20% reduce 0%
  21. 18/ 01/ 20 12: 19: 08 INFO mapreduce.Job: map 21% reduce 0%
  22. 18/ 01/ 20 12: 19: 14 INFO mapreduce.Job: map 22% reduce 0%
  23. 18/ 01/ 20 12: 19: 18 INFO mapreduce.Job: map 23% reduce 0%
  24. 18/ 01/ 20 12: 19: 23 INFO mapreduce.Job: map 24% reduce 0%
  25. 18/ 01/ 20 12: 19: 29 INFO mapreduce.Job: map 25% reduce 0%
  26. 18/ 01/ 20 12: 19: 33 INFO mapreduce.Job: map 26% reduce 0%
  27. 18/ 01/ 20 12: 19: 38 INFO mapreduce.Job: map 27% reduce 0%
  28. 18/ 01/ 20 12: 19: 43 INFO mapreduce.Job: map 28% reduce 0%
  29. 18/ 01/ 20 12: 19: 48 INFO mapreduce.Job: map 29% reduce 0%
  30. 18/ 01/ 20 12: 19: 53 INFO mapreduce.Job: map 30% reduce 0%
  31. 18/ 01/ 20 12: 19: 58 INFO mapreduce.Job: map 31% reduce 0%
  32. 18/ 01/ 20 12: 20: 04 INFO mapreduce.Job: map 32% reduce 0%
  33. 18/ 01/ 20 12: 20: 08 INFO mapreduce.Job: map 33% reduce 0%
  34. 18/ 01/ 20 12: 20: 13 INFO mapreduce.Job: map 34% reduce 0%
  35. 18/ 01/ 20 12: 20: 17 INFO mapreduce.Job: map 35% reduce 0%
  36. 18/ 01/ 20 12: 20: 21 INFO mapreduce.Job: map 36% reduce 0%
  37. 18/ 01/ 20 12: 20: 25 INFO mapreduce.Job: map 37% reduce 0%
  38. 18/ 01/ 20 12: 20: 29 INFO mapreduce.Job: map 38% reduce 0%
  39. 18/ 01/ 20 12: 20: 33 INFO mapreduce.Job: map 39% reduce 0%
  40. 18/ 01/ 20 12: 20: 37 INFO mapreduce.Job: map 40% reduce 0%
  41. 18/ 01/ 20 12: 20: 41 INFO mapreduce.Job: map 41% reduce 0%
  42. 18/ 01/ 20 12: 20: 45 INFO mapreduce.Job: map 42% reduce 0%
  43. 18/ 01/ 20 12: 20: 50 INFO mapreduce.Job: map 43% reduce 0%
  44. 18/ 01/ 20 12: 20: 54 INFO mapreduce.Job: map 44% reduce 0%
  45. 18/ 01/ 20 12: 20: 58 INFO mapreduce.Job: map 45% reduce 0%
  46. 18/ 01/ 20 12: 21: 02 INFO mapreduce.Job: map 46% reduce 0%
  47. 18/ 01/ 20 12: 21: 06 INFO mapreduce.Job: map 47% reduce 0%
  48. 18/ 01/ 20 12: 21: 10 INFO mapreduce.Job: map 48% reduce 0%
  49. 18/ 01/ 20 12: 21: 14 INFO mapreduce.Job: map 49% reduce 0%
  50. 18/ 01/ 20 12: 21: 18 INFO mapreduce.Job: map 50% reduce 0%
  51. 18/ 01/ 20 12: 21: 22 INFO mapreduce.Job: map 51% reduce 0%
  52. ........
  53. 18/ 01/ 20 12: 29: 12 INFO mapreduce.Job: map 100% reduce 81%
  54. 18/ 01/ 20 12: 29: 24 INFO mapreduce.Job: map 100% reduce 82%
  55. 18/ 01/ 20 12: 29: 36 INFO mapreduce.Job: map 100% reduce 83%
  56. 18/ 01/ 20 12: 29: 48 INFO mapreduce.Job: map 100% reduce 84%
  57. 18/ 01/ 20 12: 30: 00 INFO mapreduce.Job: map 100% reduce 85%
  58. 18/ 01/ 20 12: 30: 12 INFO mapreduce.Job: map 100% reduce 86%
  59. 18/ 01/ 20 12: 30: 23 INFO mapreduce.Job: map 100% reduce 87%
  60. 18/ 01/ 20 12: 30: 33 INFO mapreduce.Job: map 100% reduce 88%
  61. 18/ 01/ 20 12: 30: 45 INFO mapreduce.Job: map 100% reduce 89%
  62. 18/ 01/ 20 12: 30: 59 INFO mapreduce.Job: map 100% reduce 90%
  63. 18/ 01/ 20 12: 31: 11 INFO mapreduce.Job: map 100% reduce 91%
  64. 18/ 01/ 20 12: 31: 21 INFO mapreduce.Job: map 100% reduce 92%
  65. 18/ 01/ 20 12: 31: 33 INFO mapreduce.Job: map 100% reduce 93%
  66. 18/ 01/ 20 12: 31: 45 INFO mapreduce.Job: map 100% reduce 94%
  67. 18/ 01/ 20 12: 31: 57 INFO mapreduce.Job: map 100% reduce 95%
  68. 18/ 01/ 20 12: 32: 10 INFO mapreduce.Job: map 100% reduce 96%
  69. 18/ 01/ 20 12: 32: 28 INFO mapreduce.Job: map 100% reduce 97%
  70. 18/ 01/ 20 12: 32: 57 INFO mapreduce.Job: map 100% reduce 98%
  71. 18/ 01/ 20 12: 33: 28 INFO mapreduce.Job: map 100% reduce 99%
  72. 18/ 01/ 20 12: 34: 43 INFO mapreduce.Job: map 100% reduce 100%
  73. 18/ 01/ 20 12: 38: 02 INFO mapreduce.Job: Job job_1516347580021_0001 completed successfully
  74. 18/ 01/ 20 12: 38: 02 INFO mapreduce.Job: Counters: 52
  75. File System Counters
  76. FILE: Number of bytes read= 87576726096
  77. FILE: Number of bytes written= 142193600747
  78. FILE: Number of read operations= 0
  79. FILE: Number of large read operations= 0
  80. FILE: Number of write operations= 0
  81. HDFS: Number of bytes read= 83582905128
  82. HDFS: Number of bytes written= 166475667426
  83. HDFS: Number of read operations= 5468
  84. HDFS: Number of large read operations= 0
  85. HDFS: Number of write operations= 39
  86. Job Counters
  87. Failed map tasks= 6
  88. Launched map tasks= 1086
  89. Launched reduce tasks= 10
  90. Other local map tasks= 6
  91. Data-local map tasks= 465
  92. Rack-local map tasks= 615
  93. Total time spent by all maps in occupied slots (ms)= 82454392
  94. Total time spent by all reduces in occupied slots (ms)= 47463944
  95. Total time spent by all map tasks (ms)= 10306799
  96. Total time spent by all reduce tasks (ms)= 5932993
  97. Total vcore-seconds taken by all map tasks= 10306799
  98. Total vcore-seconds taken by all reduce tasks= 5932993
  99. Total megabyte-seconds taken by all map tasks= 84433297408
  100. Total megabyte-seconds taken by all reduce tasks= 48603078656
  101. Map-Reduce Framework
  102. Map input records= 568152966
  103. Map output records= 568152966
  104. Map output bytes= 228099087448
  105. Map output materialized bytes= 54476960272
  106. Input split bytes= 186120
  107. Combine input records= 0
  108. Combine output records= 0
  109. Reduce input groups= 292435364
  110. Reduce shuffle bytes= 54476960272
  111. Reduce input records= 568152966
  112. Reduce output records= 2339482912
  113. Spilled Records= 1513624168
  114. Shuffled Maps = 10800
  115. Failed Shuffles= 0
  116. Merged Map outputs= 10800
  117. GC time elapsed (ms)= 794607
  118. CPU time spent (ms)= 21363440
  119. Physical memory (bytes) snapshot= 3038556569600
  120. Virtual memory (bytes) snapshot= 9401710268416
  121. Total committed heap usage (bytes)= 3512994889728
  122. Shuffle Errors
  123. BAD_ID= 0
  124. CONNECTION= 0
  125. IO_ERROR= 0
  126. WRONG_LENGTH= 0
  127. WRONG_MAP= 0
  128. WRONG_REDUCE= 0
  129. File Input Format Counters
  130. Bytes Read= 83582349648
  131. File Output Format Counters
  132. Bytes Written= 166475667426
  133. 18/ 01/ 20 12: 38: 02 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection- 0x71f30c76 connecting to ZooKeeper ensemble=node003: 4180,node002: 4180,node001: 4180,master: 4180,node009: 4180,node008: 4180,node007: 4180,node010: 4180,node006: 4180,node005: 4180,node004: 4180
  134. 18/ 01/ 20 12: 38: 02 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=node003: 4180,node002: 4180,node001: 4180,master: 4180,node009: 4180,node008: 4180,node007: 4180,node010: 4180,node006: 4180,node005: 4180,node004: 4180 sessionTimeout= 90000 watcher=hconnection- 0x71f30c760x0, quorum=node003: 4180,node002: 4180,node001: 4180,master: 4180,node009: 4180,node008: 4180,node007: 4180,node010: 4180,node006: 4180,node005: 4180,node004: 4180, baseZNode=/hbase
  135. 18/ 01/ 20 12: 38: 02 INFO zookeeper.ClientCnxn: Opening socket connection to server node004/ 192.168.1.38: 4180. Will not attempt to authenticate using SASL (unknown error)
  136. 18/01/20 12:38:02 INFO zookeeper.ClientCnxn: Socket connection established to node004/192.168.1.38:4180, initiating session
  137. 18/01/20 12:38:02 INFO zookeeper.ClientCnxn: Session establishment complete on server node004/192.168.1.38:4180, sessionid = 0x26001af8d8190002, negotiated timeout = 40000
  138. 18/ 01/ 20 12: 38: 02 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs: //192.168.1.31:9000/test_demo/result/test/_SUCCESS
  139. 18/ 01/ 20 12: 38: 02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs: //192.168.1.31:9000/test_demo/result/test/info/28e47c53edaf4616a3dfc349d0f0e02a with size: 10931823633 bytes can be problematic as it may lead to oversplitting.
  140. 18/ 01/ 20 12: 38: 02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs: //192.168.1.31:9000/test_demo/result/test/info/5e6501ccb7554b82a2e93024d61dbe0e with size: 10931820982 bytes can be problematic as it may lead to oversplitting.
  141. 18/ 01/ 20 12: 38: 02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs: //192.168.1.31:9000/test_demo/result/test/info/6245730468534f85a428ef7fb7acd499 with size: 10931829083 bytes can be problematic as it may lead to oversplitting.
  142. 18/ 01/ 20 12: 38: 02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs: //192.168.1.31:9000/test_demo/result/test/info/a3b7225320b24e838559d5a5772bdd87 with size: 10931823391 bytes can be problematic as it may lead to oversplitting.
  143. 18/ 01/ 20 12: 38: 02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs: //192.168.1.31:9000/test_demo/result/test/info/a8306bc4ef3941f5bd131d47f0b1c2c3 with size: 10931822321 bytes can be problematic as it may lead to oversplitting.
  144. 18/ 01/ 20 12: 38: 02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs: //192.168.1.31:9000/test_demo/result/test/info/c163c568b1c24d88ac0ed7599b81ecba with size: 10931824861 bytes can be problematic as it may lead to oversplitting.
  145. 18/ 01/ 20 12: 38: 02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs: //192.168.1.31:9000/test_demo/result/test/info/d069683ce064411793640f2a0ec6ca98 with size: 10931822990 bytes can be problematic as it may lead to oversplitting.
  146. 18/ 01/ 20 12: 38: 02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs: //192.168.1.31:9000/test_demo/result/test/info/e5df664e18c54da7b84370b72506923b with size: 10931821709 bytes can be problematic as it may lead to oversplitting.
  147. 18/ 01/ 20 12: 38: 02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs: //192.168.1.31:9000/test_demo/result/test/info/fba38b4d0bd34f6782b844b288780e7b with size: 10931826385 bytes can be problematic as it may lead to oversplitting.
  148. 18/ 01/ 20 12: 38: 02 INFO hfile.CacheConfig: CacheConfig:disabled
  149. 18/ 01/ 20 12: 38: 02 INFO hfile.CacheConfig: CacheConfig:disabled
  150. 18/ 01/ 20 12: 38: 02 INFO hfile.CacheConfig: CacheConfig:disabled
  151. 18/ 01/ 20 12: 38: 02 INFO hfile.CacheConfig: CacheConfig:disabled
  152. 18/ 01/ 20 12: 38: 02 INFO hfile.CacheConfig: CacheConfig:disabled
  153. 18/ 01/ 20 12: 38: 02 INFO hfile.CacheConfig: CacheConfig:disabled
  154. 18/ 01/ 20 12: 38: 02 INFO hfile.CacheConfig: CacheConfig:disabled
  155. 18/ 01/ 20 12: 38: 02 INFO hfile.CacheConfig: CacheConfig:disabled
  156. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load
  157. hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/11e458c2c1f0465
  158. 4ae1783ec4e6576e8 first= 459096918168596876155 last= 4999999888024945828
  159. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load h
  160. file=hdfs: //192.168.1.31:9000/test_demo/result/test/info/78067c90799149db
  161. b4a423ef556a4272 first= 559078464243536377945 last= 5999999888024945828
  162. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load
  163. hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/0049f16fd57b482
  164. aa2e68ebe21a0cb72 first= 15907887724999982915 last= 19999999217611496331
  165. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load
  166. hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/28e47c53edaf461
  167. 6a3dfc349d0f0e02a first= 80100000359202982424 last= 859088818898462383266
  168. hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/6245730468534f8
  169. 5a428ef7fb7acd499 first= 401000000531957283573 last= 459096917941294955954
  170. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load
  171. hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/18068da4a3f5469a804eee9f6921617a first=959083192452571451003 last=99999998239977206078
  172. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load
  173. hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/5e6501ccb7554b8
  174. 2a2e93024d61dbe0e first= 30100000359202982424 last= 359081166786305137185
  175. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load
  176. hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/0cff66c092004d4
  177. 88db32c3bf549a1d1 first= 0100000359202982424 last= 0999998239977206078
  178. 18/ 01/ 20 12: 38: 03 INFO hfile.CacheConfig: CacheConfig:disabled
  179. 18/ 01/ 20 12: 38: 03 INFO hfile.CacheConfig: CacheConfig:disabled
  180. 18/ 01/ 20 12: 38: 03 INFO hfile.CacheConfig: CacheConfig:disabled
  181. 18/ 01/ 20 12: 38: 03 INFO hfile.CacheConfig: CacheConfig:disabled
  182. 18/ 01/ 20 12: 38: 03 INFO hfile.CacheConfig: CacheConfig:disabled
  183. 18/ 01/ 20 12: 38: 03 INFO hfile.CacheConfig: CacheConfig:disabled
  184. 18/ 01/ 20 12: 38: 03 INFO hfile.CacheConfig: CacheConfig:disabled
  185. 18/ 01/ 20 12: 38: 03 INFO hfile.CacheConfig: CacheConfig:disabled
  186. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/c163c568b1c24d8
  187. 8ac0ed7599b81ecba first= 10100000359202982424 last= 15907887393454423668
  188. 18/ 01/ 20 12: 38: 03 INFO hfile.CacheConfig: CacheConfig:disabled
  189. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/e5df664e18c54da7b84370b72506923b first=501000000531957283573 last=559078458337340744586
  190. 18/ 01/ 20 12: 38: 03 INFO hfile.CacheConfig: CacheConfig:disabled
  191. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/88df957d66e84b758583c47c9e6aec95 first=25908421410455709356 last=29999998239977206078
  192. 18/ 01/ 20 12: 38: 03 INFO hfile.CacheConfig: CacheConfig:disabled
  193. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/a8306bc4ef3941f5bd131d47f0b1c2c3 first=60100000359202982424 last=659079145929173333600
  194. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/d28c5e918b784127a7faa8afee8b364d first=359081168652388606128 last=39999999217611496331
  195. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/a3b7225320b24e838559d5a5772bdd87 first=701000000531957283573 last=759089489615157841144
  196. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs: //192.168.1.31:9000/test_demo/result/test/info/d069683ce064411793640f2a0ec6ca98 first=20100000359202982424 last=25908421377193754247
  197. 18/ 01/ 20 12: 38: 03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs: //192.168

猜你喜欢

转载自blog.csdn.net/qq_38025219/article/details/84957131
今日推荐