HBase table data reading and analysis is a requirement to learn HBaseAPI

table of Contents

1. Important concepts of HBaseAPI

1.Scan

 2.Result 

3.Filter's Scan

2. Case analysis---HBaseAPI learning

1. Demand

2. Thinking

 3. Code


1. Important concepts of HBaseAPI

1.Scan

The data table in HBase realizes data fragmentation by dividing into regions. Each region is associated with a range of RowKey, and the data in each region is organized in the lexicographic order of RowKey.

It is based on this design that HBase can easily handle this type of query: "Specify a RowKey range and get all records in that range". This type of query is called Scan in HBase. Of course, if you don't specify it, it will be full. Table scan, the following is a query is an RPC access, return the result set to the client.

1. Build Scan, specify startRow and stopRow, if not specified, a full table scan will be performed

2. Get ResultScanner

3. Traverse the query results

4. Close ResultScanner

 2.Result 

Encapsulate Scan as a Result object and return it to the client.

3.Filter's Scan

Filter can set more condition values ​​for the returned records on the basis of the result set of Scan. These conditions can be related to RowKey, column name or column value, and multiple Filter conditions can be combined in Together, generally combined together will be FilterList, but generally not recommended, there may be a risk of data leakage.

  • Each time the Client sends a scan request to the RegionServer, it will get back a batch of data (the number of Results retrieved each time is determined by Caching), and then put it in the Result Cache this time
  • Every time the application reads data, it is obtained from the local Result Cache. If the data in the Result Cache has been read, the Client will send a scan request to the RegionServer again to obtain more data

2. Case analysis---HBaseAPI learning

1. Demand

Analyze the data in the following table. The data reported regularly is reported in minutes. The value after minutes=2#1#0 is the data to be analyzed and summarized. Now I want to summarize the data of 20200706 this day, provided that it is incremental When reading data analysis, the time stamp will be maintained in the LastJobTime table. If the modifyTime in the table is greater than the time stamp recorded after each analysis, new data will be read.

2. Thinking

  • Traverse the result set of Result, and directly take out the determined field value from Result according to CF and column, such as the c, ci, ct and other fields above
  • The uncertain fields are the fields d:1300, d:1305, etc., which are dynamically written into the table according to the time. First, traverse the cell, take out Qualify according to the cell, read and parse all fields according to the length of 4 or more, so far we can Get the value data dynamically written in the time column 

   If you are fortunate enough to read here, just understand my thinking, and you don't need to go into the above business scenarios.

 3. Code

​
​
package com.kangll.hbaseapi

import java.util
import com.winner.utils.KerberosUtil
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.{Cell, CellUtil, CompareOperator, HBaseConfiguration, TableName}
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory, Get, Result, Scan, Table}
import org.apache.hadoop.hbase.filter.{SingleColumnValueFilter, SubstringComparator}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession

import scala.collection.mutable.ListBuffer

/** ******************************************
 *
 * @AUTHOR kangll
 * @DATE 2020/8/11 14:47
 * @DESC:
 * *******************************************
 */
// 将表中解析出的数据封装为样例类
case class InOutDataHBaseTest(rowkey: String, channel: String, counterid: String, countertype: String, devicesn: String,
                              datatype: String, hostname: String, modifytime: String, datatime: String, inNum: Int, outNum: Int)

object HBaseAPI_Test_One {

  // Kerberos认证
  KerberosUtil.kerberosAuth()

  private val spark: SparkSession = SparkSession
    .builder()
    .master("local[2]")
    .appName("spark-hbase-read")
    .getOrCreate()

  private val sc: SparkContext = spark.sparkContext

  private val hbaseConf: Configuration = HBaseConfiguration.create()
  hbaseConf.set("hbase.zookeeper.quorum", "hdp301")
  hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")

  def getOriginalData() = {
    import spark.implicits._
    import collection.mutable._

    // HBase 源数据库表
    val HBASE_TAG_TABLE = "trafficData"
    // 维护的时间戳,增量读取解析数据
    val HBASE_LAST_JOBTIME = "LastJobTime"
    // 创建连接对象
    val conn: Connection = ConnectionFactory.createConnection()
    val tag_table: Table = conn.getTable(TableName.valueOf(HBASE_TAG_TABLE))
    val time_table: Table = conn.getTable(TableName.valueOf(HBASE_LAST_JOBTIME))

    // 通过 rowkey 查询 HBase 表的 lastjobtime
    val get = new Get("TrafficDateTime".getBytes())
    val mdResult: Result = time_table.get(get)
    // get 直接拿到 时间戳
    val modifyTime: String = Bytes.toString(mdResult.getValue(Bytes.toBytes("t"), Bytes.toBytes("m")))

    // 查询原始数据
    val scan = new Scan()
    // 单列值过滤器 当 表中的 modifyTime 大于时间戳时 增量读取解析
    val mdValueFilter = new SingleColumnValueFilter(
      "d".getBytes(),
      "t".getBytes(),
      CompareOperator.GREATER_OR_EQUAL,
      new SubstringComparator(modifyTime) // 大于等于增量的时间戳
    )

    // scan 的条数,默认为100 扫描100 返给 客户端 Result 缓存读取
    scan.setCaching(200)
    // 设置过滤,下推到服务器 ,减少返回给客户端的数据量和 rowkey 指定范围结合更佳
    scan.setFilter(mdValueFilter)
    import collection.JavaConversions._
    val iter: util.Iterator[Result] = tag_table.getScanner(scan).iterator()
    // 存放定义的样例类
    val basicListTmp = new ListBuffer[InOutDataHBaseTest]()

    while (iter.hasNext) {
      var rowkey = ""
      var datatime = ""
      var inNum = 0
      var outNum = 0
      val result: Result = iter.next()
      val channel = Bytes.toString(result.getValue(Bytes.toBytes("d"), Bytes.toBytes("c")))
      val counterid = Bytes.toString(result.getValue(Bytes.toBytes("d"), Bytes.toBytes("ci")))
      val countertype = Bytes.toString(result.getValue(Bytes.toBytes("d"), Bytes.toBytes("ct")))
      val devicesn = Bytes.toString(result.getValue(Bytes.toBytes("d"), Bytes.toBytes("ds")))
      val datatype = Bytes.toString(result.getValue(Bytes.toBytes("d"), Bytes.toBytes("dt")))
      val hostname = Bytes.toString(result.getValue(Bytes.toBytes("d"), Bytes.toBytes("h")))
      val modifytime = Bytes.toString(result.getValue(Bytes.toBytes("t"), Bytes.toBytes("md")))

      rowkey = Bytes.toString(result.getRow)
      // 拿到 Result 的cell ,遍历 cell 拿到 columnName后判断列名取出 3#2#1 value 值 
      val cells = result.listCells()
      for (cell <- cells) {
        //  Cell工具类 获取到 列名
        var  cname = Bytes.toString(CellUtil.cloneQualifier(cell))
        if (cname.length >= 4) {
          datatime = rowkey.split("#")(1)+cname
          val cvalue = Bytes.toString(CellUtil.cloneValue(cell))
          val arr = cvalue.split("#")
          inNum = arr(0).toInt
          outNum = arr(1).toInt
          println(datatime + "--------" + inNum + "------" + outNum)
          // 将解析出的 cell 数据 放到 List 的样例类中
          basicListTmp += InOutDataHBaseTest(rowkey, channel, counterid, countertype,
            devicesn, datatype, hostname, modifytime, datatime, inNum, outNum)
        }
      }
    }
    // 获取 返回的 basicListTmp 并且返回
    val basicList: ListBuffer[InOutDataHBaseTest] = basicListTmp.map(x => InOutDataHBaseTest(x.rowkey, x.channel, x.counterid, x.countertype,
      x.devicesn, x.datatype, x.hostname, x.modifytime, x.datatime ,x.inNum, x.outNum))
    basicList.toSet
  }

  def main(args: Array[String]): Unit = {
    getOriginalData().foreach(println(_))
  }
}

 

The above is an example of reading and analyzing the company table data. Of course, the reading can also be optimized according to the rowkey, because the rowkey is a custom design, and the hostname+channel md5 is encrypted and hashed, and it can be specified according to the scan range of the rowkey---- withStartRow() and withStopRow(), coupled with incremental data analysis speed, are perfect.

 Reference: https://www.sohu.com/a/284932698_100109711

 

Guess you like

Origin blog.csdn.net/qq_35995514/article/details/108110553