HBase 创建表时的预分区

如果知道hbase数据表的key的分布情况，就可以在建表的时候对hbase进行region的预分区。这样做的好处是防止大数据量插入的热点问题，提高数据插入的效率。

背景：HBase默认建表时有一个region，这个region的rowkey是没有边界的，即没有startkey和endkey，在数据写入时，所有数据都会写入这个默认的region，随着数据量的不断增加，此region已经不能承受不断增长的数据量，会进行split，分成2个region。在此过程中，会产生两个问题：1.数据往一个region上写,会有写热点问题。2.region split会消耗宝贵的集群I/O资源。基于此我们可以控制在建表的时候，创建多个空region，并确定每个region的起始和终止rowky，这样只要我们的rowkey设计能均匀的命中各个region，就不会存在写热点问题。自然split的几率也会大大降低。当然随着数据量的不断增长，该split的还是要进行split。像这样预先创建hbase表分区的方式，称之为预分区，通常我们有三种方式实现

首先看没有进行预分区的表，startkey和endkey为空。
在这里插入图片描述

1.shell createTable直接创建预分区：

create 'split01','cf1',SPLITS=>['1000000','2000000','3000000']

在这里插入图片描述
从上图中可以看到将创建了4个region 根据raw key 写入到不同的region中

2.通过文件创建

create 'split02','cf1',SPLITS_FILE=>'/data/hbase/split/split.txt'

3.javaAPI createTable并预分区：

在hbase包的Admin类中提供了4个create表的方法（前三个为同步创建，第四个为异步）：

- 直接根据描述创建表

这里是直接根据表描述创建表，不指定分区。

 /**
 * Creates a new table. Synchronous operation.
 *  * @param desc table descriptor for table
 * @throws IllegalArgumentException if the table name is reserved
 * @throws MasterNotRunningException if master is not running
 * @throws org.apache.hadoop.hbase.TableExistsException if table already exists (If concurrent
 * threads, the table may have been created between test-for-existence and attempt-at-creation).
 * @throws IOException if a remote or network exception occurs
   */
  void createTable(HTableDescriptor desc) throws IOException;

- 根据描述和region个数以及startKey以及endKey自动分配

根据表描述以及指定startKey和endKey和region个数创建表，这里hbase会自动创建region个数，并且会为你的每一个region指定key的范围，但是所有的范围都是连续的且均匀的，如果业务key的某些范围内数据量很多有的很少，这样就会造成数据的数据的倾斜,这样的场景就必须自己指定分区的范围，可以用第三种或者第四种方式预分区。

/**
 * Creates a new table with the specified number of regions.  The start key specified will become
 * the end key of the first region of the table, and the end key specified will become the start
 * key of the last region of the table (the first region has a null start key and the last region
 * has a null end key). BigInteger math will be used to divide the key range specified into enough
 * segments to make the required number of total regions. Synchronous operation.
 *  * @param desc table descriptor for table
 * @param startKey beginning of key range
 * @param endKey end of key range
 * @param numRegions the total number of regions to create
 * @throws IllegalArgumentException if the table name is reserved
 * @throws MasterNotRunningException if master is not running
 * @throws org.apache.hadoop.hbase.TableExistsException if table already exists (If concurrent
 * threads, the table may have been created between test-for-existence and attempt-at-creation).
 * @throws IOException
   */
  void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int numRegions)
      throws IOException;

- 根据表的描述和自定义的分区设置创建表（同步）

根据表的描述和自定义的分区设置创建表，这个就可以自己自定义指定region执行的key的范围，比如：

byte[][] splitKeys = new byte[][] { Bytes.toBytes("10000"),
                Bytes.toBytes("20000"), Bytes.toBytes("30000"),
                Bytes.toBytes("40000") };

调用接口的时候splitKeys传入上面的值，那么他会自动创建5个region并且为之分配key的分区范围。
startKey，最后一个没有endKey：
第一个region：“ to 10000”
第二个region：“10000 to 20000”
第三个region：“20000 to 30000”
第四个region：“30000 to 40000”
第五个region：“40000 to ”

/**
 * Creates a new table with an initial set of empty regions defined by the specified split keys.
 * The total number of regions created will be the number of split keys plus one. Synchronous
 * operation. Note : Avoid passing empty split key.
 *  * @param desc table descriptor for table
 * @param splitKeys array of split keys for the initial regions of the table
 * @throws IllegalArgumentException if the table name is reserved, if the split keys are repeated
 * and if the split key has empty byte array.
 * @throws MasterNotRunningException if master is not running
 * @throws org.apache.hadoop.hbase.TableExistsException if table already exists (If concurrent
 * threads, the table may have been created between test-for-existence and attempt-at-creation).
 * @throws IOException
   */
  void createTable(final HTableDescriptor desc, byte[][] splitKeys) throws IOException;

- 根据表的描述和自定义的分区设置创建表（异步）

同上面的三是一样的，不过是异步执行。

/**
   * Creates a new table but does not block and wait for it to come online. Asynchronous operation.
   * To check if the table exists, use {@link #isTableAvailable} -- it is not safe to create an
   * HTable instance to this table before it is available. Note : Avoid passing empty split key.
   *
   * @param desc table descriptor for table
   * @throws IllegalArgumentException Bad table name, if the split keys are repeated and if the
   * split key has empty byte array.
   * @throws MasterNotRunningException if master is not running
   * @throws org.apache.hadoop.hbase.TableExistsException if table already exists (If concurrent
   * threads, the table may have been created between test-for-existence and attempt-at-creation).
   * @throws IOException
   */
  void createTableAsync(final HTableDescriptor desc, final byte[][] splitKeys) throws IOException;

欢迎关注，更多福利

这里写图片描述