Spark integrates HBase (custom HBase DataSource)

background

Spark supports a variety of data sources, but Spark does not have a relatively elegant API for reading and writing HBase, but there are many integration scenarios between spark and HBase, so a set of APIs that are more convenient to operate HBase is implemented through spark's DataSource API.

Write HBase

Writing HBase will write the data of the corresponding data type to HBase according to the schema of the Dataframe. First, use the example:

import spark.implicits._
import org.apache.hack.spark._
val df = spark.createDataset(Seq(("ufo",  "play"), ("yy",  ""))).toDF("name", "like")
// 方式一
val options = Map(
            "rowkey.filed" -> "name",
            "startKey" -> "aaaaa",
            "endKey" -> "zzzzz",
            "numReg" -> "12",
            "bulkload.enable" -> "false"
        )
df.saveToHbase("hbase_table", Some("XXX:2181"), options)
// 方式二
df1.write.format("org.apache.spark.sql.execution.datasources.hbase")
            .options(Map(
                "rowkey.filed" -> "name",
                "outputTableName" -> "hbase_table",
                "hbase.zookeeper.quorum" -> "XXX:2181",
                "startKey" -> "aaaaa",
                "endKey" -> "zzzzz",
                "numReg" -> "12",
                "bulkload.enable" -> "false"
            )).save()

The effect achieved by the above two methods is the same. The meaning of each parameter is explained below:

  • rowkey.field: Specify which field of the dataframe is used as the rowkey of the HBase table, the first field is used by default
  • outputTableName:HBase表名
  • hbase.zookeeper.quorum: zookeeper address, can also be written as zk
  • startKey, endKey, numReg: These three configurations belong to one family. When the corresponding HBase table does not exist, the table will be created first. By default, there is only one partition. Pre-partitioning can be performed through these three configurations. The starting key, ending key, number of partitions, and the column family name are familyset by default. The default is info.
  • bulkload.enable: When writing to HBase, you can use HBase to write, or BulkLoad can directly generate the Hfile required by HBase. In the case of a large amount of data, it will be much faster. The default is to use BulkLoad

Read HBase

The sample code is as follows:

// 方式一
import org.apache.hack.spark._
 val options = Map(
    "spark.table.schema" -> "appid:String,appstoreid:int,firm:String",
    "hbase.table.schema" -> ":rowkey,info:appStoreId,info:firm"
)
spark.hbaseTableAsDataFrame("hbase_table", Some("XXX:2181")).show(false)
// 方式二
spark.read.format("org.apache.spark.sql.execution.datasources.hbase").
            options(Map(
            "spark.table.schema" -> "appid:String,appstoreid:int,firm:String",
            "hbase.table.schema" -> ":rowkey,info:appStoreId,info:firm",
            "hbase.zookeeper.quorum" -> "XXX:2181",
            "inputTableName" -> "hbase_table"
        )).load.show(false)  

It is not necessary to specify the schema mapping relationship between spark and hbase tables. By default, two fields, rowkey and content, are generated. Content is a json string composed of all fields. You can field.type.fieldnameset the data type for a single field. The default is StringType. In this way, you have to turn it through the spark program to get what you want, and all fields will be scanned, which is relatively inefficient.

So we can customize the schema mapping to get the data:

  • hbase.table.schema: The data structure of the hbase table, :rowkey,cm:fieldname1,cm:fieldname2, you can understand it at a glance
  • spark.table.schema: The schema corresponding to the spark DataFrame. name_x:fieldType,name_y:fieldType,name_z:fieldType
    Note that these two schemas are in one-to-one correspondence, and Hbase will only scan hbase.table.schemathe corresponding columns.

The source code is on my GitHub , and I will update it on GitHub in the future. Welcome to star

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325521309&siteId=291194637