连接HBase的正确姿势

Connection是什么

在众多HBase用户中，常见的使用Connection的错误方法有：

（1）自己实现一个Connection对象的资源池，每次使用都从资源池中取出一个Connection对象；
（2）每个线程一个Connection对象。
（3）每次访问HBase的时候临时创建一个Connection对象，使用完之后调用close关闭连接。
从这些做法来看，这些用户显然是把Connection对象当成了单机数据库里面的连接对象来用了。然而，作为一个分布式数据库，HBase客户端需要和多个服务器中的不同服务角色建立连接，所以HBase客户端中的Connection对象并不是简单对应一个socket连接。HBase的API文档当中对Connection的定义是：
A cluster connection encapsulating lower level individual connections to actual servers and a connection to zookeeper.
我们知道，HBase访问一条数据的过程中，需要连接三个不同的服务角色：

（1）Zookeeper
（2）HBase Master
（3）HBase RegionServer

而HBase客户端的Connection包含了对以上三种socket连接的封装。Connection对象和实际的socket连接之间的对应关系如下图：

在HBase客户端代码中，真正对应socket连接的是RpcConnection对象。HBase使用PoolMap这种数据结构来存储客户端到HBase服务器之间的连接。PoolMap封装了ConcurrentHashMap>的结构，key是ConnectionId（封装了服务器地址和用户ticket）,value是一个RpcConnection对象的资源池。当HBase需要连接一个服务器时，首先会根据ConnectionId找到对应的连接池，然后从连接池中取出一个连接对象。HBase中提供了三种资源池的实现，分别是Reusable，RoundRobin和ThreadLocal。

具体实现可以通过hbase.client.ipc.pool.type配置项指定，默认为Reusable。连接池的大小也可以通过hbase.client.ipc.pool.size配置项指定，默认为1。

连接HBase的正确方式

在HBase中Connection类已经实现了对连接的管理功能，所以我们不需要自己在Connection之上再做额外的管理。另外，Connection是线程安全的，而Table和Admin则不是线程安全的，因此正确的做法是一个进程共用一个Connection对象，而在不同的线程中使用单独的Table和Admin对象。

所有进程共用一个connection对象

connection = ConnectionFactory.createConnection(config);

每个线程使用单独的table对象

 Table table = connection.getTable(TableName.valueOf("test"));
           try {
               ...
           } finally {
               table.close();
           }

HBase客户端默认的是连接池大小是1，也就是每个RegionServer 1个连接。如果应用需要使用更大的连接池或指定其他的资源池类型，也可以通过修改配置实现：

config.set("hbase.client.ipc.pool.type",...);
config.set("hbase.client.ipc.pool.size",...);
connection = ConnectionFactory.createConnection(config);

con...创建一个连接

Connection conn = ConnectionFactory.createConnection(conf);

拿到一个DDL操作器：表管理器admin

Admin admin = conn.getAdmin();

用表管理器的api去建表、删表、修改表定义

admin.createTable(HTableDescriptor descriptor);

public class HbaseApiDemo {
@Test
public void testCreateTable() throws Exception {
创建hbase的配置对象
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "cts02:2181,cts03:2181,cts04:2181");
创建hbase的连接对象
Connection conn = ConnectionFactory.createConnection(conf);
 DDL操作工具
Admin admin = conn.getAdmin();
创建一个表定义描述对象
HTableDescriptor tUser = new HTableDescriptor(TableName.valueOf("t_user"));
构造一个列族描述对象
HColumnDescriptor f1 = new HColumnDescriptor("f1");
HColumnDescriptor f2 = new HColumnDescriptor("f2");
在表描述对象中加入列族描述
tUser.addFamily(f1);
tUser.addFamily(f2);
调用admin的建表方法来建表
admin.createTable(tUser);
// 关闭连接
admin.close();
conn.close();
}

修改表定义

public void testAlterTable() throws Exception {
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "cts02:2181,cts03:2181,cts04:2181");
Connection conn = ConnectionFactory.createConnection(conf);
DDL操作工具
Admin admin = conn.getAdmin();
HTableDescriptor user = admin.getTableDescriptor(TableName.valueOf("t_user"));
HColumnDescriptor f3 = new HColumnDescriptor("f3");
f3.setMaxVersions(3);
user.addFamily(f3);
admin.modifyTable(TableName.valueOf("t_user"), user);
admin.close();
conn.close();
}

删除表

public void testDropTable() throws Exception {
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "cts02:2181,cts03:2181,cts04:2181");
Connection conn = ConnectionFactory.createConnection(conf);
// DDL操作工具
Admin admin = conn.getAdmin();
// 先禁用
admin.disableTable(TableName.valueOf("t_user"));
// 再删除
admin.deleteTable(TableName.valueOf("t_user"));
admin.close();
conn.close();
}

插入|更新

public void testPut() throws Exception {
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "cts02:2181,cts03:2181,cts04:2181");
Connection conn = ConnectionFactory.createConnection(conf);
//获得表对象
Table table = conn.getTable(TableName.valueOf("t_user"));
//创建put对象 ，一个put操作一行数据，并设置rowkey名称
Put put1 = new Put("001".getBytes());
//添加一个列，要制定列族，和列名称
put1.addColumn("f1".getBytes(), "name".getBytes(), "张三".getBytes());
put1.addColumn("f1".getBytes(), Bytes.toBytes("age"), Bytes.toBytes(28));
Put put2 = new Put("002".getBytes());
// 添加一个列
put2.addColumn("f1".getBytes(), "name".getBytes(), "李四".getBytes());
put2.addColumn("f1".getBytes(), Bytes.toBytes("age"), Bytes.toBytes(38));
ArrayList<Put> puts = new ArrayList<>();
puts.add(put1);
puts.add(put2);
table.put(puts);
table.close();
conn.close();

hbase写入数据的几种方式

　　接下来我们总结一下hbase几种写入常见的方式，以及涉及的应用场景，尽量覆盖日常业务中的使用场景，另外再总结一下其中涉及到的一些原理知识。hbase一般的插入过程都使用HTable对象，将数据封装在Put对象中，Put在new创建的时候需要传入rowkey，并将列族，列名，列值add进去。然后HTable调用put方法，通过rpc请求提交到Regionserver端。写入的方式可以分为以下几种

单条put
批量put
bluckload

Htable介绍

　　要向hbase中写入就免不了要和HTable打交道，HTable负责向一张hbase表中读或者写数据，HTable对象是非线程安全的。多线程使用时需要注意，创建HTable对象时需要指定表名参数，HTable内部有一个LinkedList<Row>的队列writeAsyncBuffer ，负责对写入到hbase的数据在客户端缓存，开启缓存使用参数 table.setAutoFlushTo(false); 默认情况不开启每次put一条数据时，htable对象就会调用flushCommits方法向regserver中提交，开启缓存则会比较队列的大小，如果大于某个值则调用flushCommits，这个值默认是2m，可以通过在hbase-site.xml中设置参数 "hbase.client.write.buffer"来调整，默认是2097152，在关闭htable连接时，会隐式的调用flushCommits方法，保证数据完全提交。提交时会根据rowkey定位该put应该提交到哪个reginserver，然后每个regionserver一组action发送出去。

注意：有了BufferedMutator之后，BufferedMutator替换了HTable的setAutoFlush（false）的作用。可以从连接的实例中获取BufferedMutator的实例。在使用完成后需要调用的close（）方法关闭连接。对BufferedMutator进行配置需要通过BufferedMutatorParams完成。BufferedMutatorParams要比Htable更搞效，所以心在我们在向hbase插入数据时尽量使用BufferedMutatorParams

单条put

　　最简单基础的写入hbase，一般应用场景是线上业务运行时，记录单条插入，如报文记录，处理记录，写入后htable对象即释放。每次提交就是一次rpc请求。

table.setAutoFlushTo(true);
/**
   * 插入一条记录
   * rowkey 为rk001 列族为f1
   * 插入两列  c1列   值为001
   *          c2列   值为002
   *
   */
  public void insertPut(){
      //Configuration 加载hbase的配置信息，HBaseConfiguration.create()是先new Configuration然后调用addResource方法将
      //hbase-site.xml配置文件加载进来
      Configuration conf = HBaseConfiguration.create();
      try {
          table = new HTable(conf,tableName);
          table.setAutoFlushTo(true);//不显示设置则默认是true																														
          String rowkey  = "rk001";
          Put  put = new Put(rowkey.getBytes());
          put.add(cf.getBytes(),"c1".getBytes(),"001".getBytes());
          put.add(cf.getBytes(),"c2".getBytes(),"002".getBytes());
          table.put(put);
          table.close();//关闭hbase连接
 } catch (IOException e) {
          e.printStackTrace();
      }
  }

多条写入　　

有了单条的put自然就想到这种方式其实是低效的，每次只能提交一条记录，有没有上面方法可以一次提交多条记录呢？减少请求次数，最简单的方式使用List<Put>，这种方式操作时和单条put没有区别，将put对象add到list中，然后调用put(List<Put>)方法，过程和单条put基本一致，应用场景一般在数据量稍多的环境下，通过批量提交减少请求次数

  /**
  * 批量请求，一次提交两条
  */

 public void insertPuts() {
     Configuration conf = HBaseConfiguration.create();
     try {
         table = new HTable(conf, tableName);
         table.setAutoFlushTo(true);
         List<Put> lists = new ArrayList<Put>();
         String rowkey1 = "rk001";
         Put put1 = new Put(rowkey1.getBytes());
         put1.add(cf.getBytes(), "c1".getBytes(), "001".getBytes());
         put1.add(cf.getBytes(), "c2".getBytes(), "002".getBytes());
         lists.add(put1);
         String rowkey2 = "rk002";
         Put put2 = new Put(rowkey2.getBytes());
         put2.add(cf.getBytes(), "c1".getBytes(), "v2001".getBytes());
         put2.add(cf.getBytes(), "c2".getBytes(), "v2002".getBytes());
         lists.add(put2);
         table.put(lists);
         table.close();
     } catch (IOException e) {
         e.printStackTrace();
     }
 }

BufferedMutatorParams的使用

org.apache.hadoop.hbase.client.HTable归根结底持有的就是BufferedMutatorImpl类型的属性mutator，所有后续的操作都是基于mutator操作，那么其实我们操作hbase客户端，完全可以摒弃HTable对象，直接构建BufferedMutator，然后操作hbase，BufferedMutatorParams主要是收集构造BufferedMutator对象的参数信息，这些参数包括hbase数据表名、hbase客户端缓冲区、hbase rowkey最大所占空间、线程池和监听hbase操作的回调监听器(比如监听hbase写入失败)。

使用方式：

正如BufferedMutatorParams需要参数一样，我们需要提供表名，设置好缓存的大小，初始化mutator实例然后提价put对应，向hbase插入数据

案例：

//一个Put对象就是一行记录，在构造方法中指定主键
      val put = new Put(Bytes.toBytes(MD5Util.getMD5(userId + userName)))
      put.addColumn(Bytes.toBytes("hiveData"), Bytes.toBytes("id"), Bytes.toBytes(userId)).addColumn(Bytes.toBytes("hiveData"), Bytes.toBytes("name"), Bytes.toBytes(userName)) .addColumn(Bytes.toBytes("hiveData"), Bytes.toBytes("money"), Bytes.toBytes(userMoney))
      putList.add(put)
//设置缓存1m，当达到1m时数据会自动刷到hbase
      val params = new BufferedMutatorParams(TableName.valueOf("test6"))
      params.writeBufferSize(1024 * 1024) //设置缓存的大小
      val mutator = connection.getBufferedMutator(params)
      mutator.mutate(putList)
      mutator.flush()
      putList.clear()

sparkStreaming向hbase写数据

SparkStreaming怎么向Hbase中写数据。首先，需要说一下，下面的这个方法。
foreachRDD(func)
最通用的输出操作，把func作用于从stream生成的每一个RDD。
注意：这个函数是在运行streaming程序的driver进程中执行的。
下面跟着思路，看一下，怎么优雅的向Hbase中写入数据
向外部写数据常见的错误：
向外部数据库写数据，通常会建立连接，使用连接发送数据(也就是保存数据)。
开发者可能在driver中创建连接，而在spark worker 中保存数据
例如：

 dstream.foreachRDD { rdd =>
  val connection = createNewConnection()  // 这个会在driver中执行
  rdd.foreach { record =>
    connection.send(record) //这个会在 worker中执行
  }
}

上面这种写法是错误的！上面的写法，需要connection 对象被序列化，然后从driver发送到worker。
这样的connection是很少在机器之间传输的。知道这个问题后，我们可以写出以下的，修改后的代码：

dstream.foreachRDD { rdd =>
  rdd.foreach { record =>
    val connection = createNewConnection()
    connection.send(record)
    connection.close()
  }
}

这种写法也是不对的。这会导致，对于每条数据，都创建一个connection(创建connection是消耗资源的)。
下面的方法会好一些：

 dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = createNewConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}

上面的方法，使用 rdd.foreachPartition 创建一个connection 对象，一个RDD分区中的所有数据，都使用这一个connection。
更优的方法，在多个RDD之间，connection对象是可以重用的，所以可以创建一个连接池。如下

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool是一个静态的,延迟初始化的连接池
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // 返回到池中 以便别人使用  }
}

连接池中的连接应该是，应需求而延迟创建，并且，如果一段时间没用，就超时了(也就是关闭该连接)

实战开发规范操作

在项目实际开发中我们操作hbase 一般需要单独创建一个hbase工具类，方便之后的操作

hbase工具类案例

package com.util.hadoop
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase._
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.util.Bytes
import scala.collection.mutable
object HbaseUtil {
  var conf: Configuration = _
  //线程池
  lazy val connection: Connection = ConnectionFactory.createConnection(conf)
  lazy val admin: Admin = connection.getAdmin
  /**
    * hbase conf
    * @param quorum hbase的zk地址
    * @param port   zk端口2181
    * @return
    */
  def setConf(quorum: String, port: String): Unit = {
    val conf = HBaseConfiguration.create()
    conf.set("hbase.zookeeper.quorum", quorum)
    conf.set("hbase.zookeeper.property.clientPort", port)
    this.conf = conf
  }
  /**
    * 如果不存在就创建表
    * @param tableName 命名空间：表名
    * @param columnFamily 列族
    */
  def createTable(tableName: String, columnFamily: String): Unit = {
    val tbName = TableName.valueOf(tableName)
    if (!admin.tableExists(tbName)) {
      val htableDescriptor = new HTableDescriptor(tbName)
      val hcolumnDescriptor = new HColumnDescriptor(columnFamily)
      htableDescriptor.addFamily(hcolumnDescriptor)
      admin.createTable(htableDescriptor)
    }
  }
  def hbaseScan(tableName: String): ResultScanner = {
    val scan = new Scan()
    val table = connection.getTable(TableName.valueOf(tableName))
    table.getScanner(scan)
//    val scanner: CellScanner = rs.next().cellScanner()
  }
  /**
    * 获取hbase单元格内容
    * @param tableName 命名空间：表名
    * @param rowKey rowkey
    * @return 返回单元格组成的List
    */
  def getCell(tableName: String, rowKey: String): mutable.Buffer[Cell] = {
    val get = new Get(Bytes.toBytes(rowKey))
    /*if (qualifier == "") {
      get.addFamily(family.getBytes())
    } else {
      get.addColumn(family.getBytes(), qualifier.getBytes())
    }*/
    val table = connection.getTable(TableName.valueOf(tableName))
    val result: Result = table.get(get)
    import scala.collection.JavaConverters._
    result.listCells().asScala
    /*.foreach(cell=>{
    val rowKey=Bytes.toString(CellUtil.cloneRow(cell))
    val timestamp = cell.getTimestamp;  //取到时间戳
    val family = Bytes.toString(CellUtil.cloneFamily(cell))  //取到族列
    val qualifier  = Bytes.toString(CellUtil.cloneQualifier(cell))  //取到修饰名
    val value = Bytes.toString(CellUtil.cloneValue(cell))
    println(rowKey,timestamp,family,qualifier,value)
  })*/
  }
  /**
    * 单条插入
    * @param tableName 命名空间：表名
    * @param rowKey rowkey
    * @param family 列族
    * @param qualifier column列
    * @param value 列值
    */
  def singlePut(tableName: String, rowKey: String, family: String, qualifier: String, value: String): Unit = {
    //向表中插入数据//向表中插入数据
    //a.单个插入
    val put: Put = new Put(Bytes.toBytes(rowKey)) //参数是行健row01
    put.addColumn(Bytes.toBytes(family), Bytes.toBytes(qualifier), Bytes.toBytes(value))
    //获得表对象
    val table: Table = connection.getTable(TableName.valueOf(tableName))
    table.put(put)
    table.close()
  }
  /**
    * 删除数据
    * @param tbName 表名
    * @param row rowkey
    */
  def deleteByRow(tbName:String,row:String): Unit ={
    val delete = new Delete(Bytes.toBytes(row))
//    delete.addColumn(Bytes.toBytes("fm2"), Bytes.toBytes("col2"))
    val table = connection.getTable(TableName.valueOf(tbName))
    table.delete(delete)
  }
  def close(): Unit = {
    admin.close()
    connection.close()
  }
  def main(args: Array[String]): Unit = {
    setConf("ip", "2181")
    /*singlePut("kafka_offset:topic_offset_range", "gid_topic_name", "info", "partition0", "200")
    singlePut("kafka_offset:topic_offset_range", "gid_topic_name", "info", "partition1", "300")
    singlePut("kafka_offset:to·pic_offset_range", "gid_topic_name", "info", "partition2", "100")*/
    val cells = getCell("kafka_offset:grampus_double_groupid", "grampus_erp")
    cells.foreach(cell => {
      val rowKey = Bytes.toString(CellUtil.cloneRow(cell))
      val timestamp = cell.getTimestamp; //取到时间戳
      val family = Bytes.toString(CellUtil.cloneFamily(cell)) //取到族列
      val qualifier = Bytes.toString(CellUtil.cloneQualifier(cell)) //取到修饰名
      val value = Bytes.toString(CellUtil.cloneValue(cell))
      println(rowKey, timestamp, family, qualifier, value)
    })
    /*val topics=List("")
    val resultScanner: ResultScanner = hbaseScan("kafka_offset:topic_offset_range")
    resultScanner.asScala.foreach(rs=>{
      val cells = rs.listCells()
      cells.asScala.foreach(cell => {
        val rowKey = Bytes.toString(CellUtil.cloneRow(cell))
        val qualifier = Bytes.toString(CellUtil.cloneQualifier(cell)) //取到修饰名
        val value = Bytes.toString(CellUtil.cloneValue(cell))
      })
    })*/
//    deleteByRow("bi_odi:redis_ip","900150983cd24fb0d6963f7d28e17f72")
    this.close()
  }
}

hive数据导入hbase

里面所有的hbase操作将会调用上面的hbase工具类，使用BufferedMutatorParams（）方式将数据导入hbase

object Hive2Hbase {
  def main(args: Array[String]): Unit = {
    val session: SparkSession = SparkSession.builder()
      .appName("Hive2Hbase")
      .enableHiveSupport()
      .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
      .getOrCreate()
    // 执行查询
    print("====================   任务开始       ========================")
    val hiveData: DataFrame = session.sql("select * from t_order")
    HbaseUtil.setConf("ip", "2181")
    val connection = HbaseUtil.connection
    HbaseUtil.createTable("test6", "hiveData")
    val hiveRdd: RDD[Row] = hiveData.rdd
    hiveRdd.foreachRDD { rdd =>
     rdd.foreachPartition { x =>
      val putList = new util.ArrayList[Put]()
      HbaseUtil.setConf("ip", "2181")
      val connection = HbaseUtil.connection
      //获取用户信息
      val userId = x.getAs[String]("id")
      val userName= x.getAs[String]("inamed")
      val userMoney= x.getAs[Double]("money")
      //一个Put对象就是一行记录，在构造方法中指定主键
      val put = new Put(Bytes.toBytes(MD5Util.getMD5(userId + userName)))
      put.addColumn(Bytes.toBytes("hiveData"), Bytes.toBytes("id"), Bytes.toBytes(userId))
        .addColumn(Bytes.toBytes("hiveData"), Bytes.toBytes("name"), Bytes.toBytes(userName))
        .addColumn(Bytes.toBytes("hiveData"), Bytes.toBytes("money"), Bytes.toBytes(userMoney))
      putList.add(put)
      //设置缓存1m，当达到1m时数据会自动刷到hbase
      val params = new BufferedMutatorParams(TableName.valueOf("test6"))
      params.writeBufferSize(1024 * 1024) //设置缓存的大小
      val mutator = connection.getBufferedMutator(params)
      mutator.mutate(putList)
      mutator.flush()
      putList.clear() 
    }
    })
    session.stop()
    HbaseUtil.close()
    println("======================  任务结束  ============")
  }

}

Hbase实战--HBASE的API操作（增删改查）