Spark On HBase

I. Introduction

MapReduce has already docked with HBase, using HBase as the data source to complete the reading and writing of batch data. Now Spark, after MapReduce, has a pivotal position in the field of big data, and it has its use in batches, stream processing, and even graph computing. Spark docking with HBase has become a requirement of many users.

二.Spark On HBase

1. Problems that can be solved

The seamless integration of Spark and HBase means that we no longer need to care about security and the details of the interaction between RDD and HBase. It is more convenient to apply the batch processing and stream processing capabilities that Spark brings. For example, the following common application scenarios:

  1. HBase is used as storage, and streaming data is processed through Spark.

  2. Use HBase as storage to complete large-scale graph or DAG calculations.

  3. BulkLoad operation on HBase through Spark

  4. Interactive analysis of HBase data with Spark SQL


2. Community related work

There are already a variety of implementations of Spark docking with HBase. Here we select three representative jobs for analysis:

2.1 Huawei: Spark-SQL-on-HBase

Features:
Extend the parse function of Spark SQL to connect to HBase. Improve read and write performance through coprocessor and custom filters.

advantage:

  • Expanded the corresponding cli function, supports Scala shell and Python shell

  • A variety of performance optimization methods, and even support sub plan to coprocessor to achieve partial aggregation.

  • Support Java and Python API

  • Support row key combination

  • Support common DDL and DML (including bulkload, but not update)

Disadvantages:

  • Does not support query based on timestamp and version

  • Does not support security

  • Row key supports primitive types or Strings, but does not support complex data types

Example of use:

Create a table in HBase and write data

$HBase_Home/bin/hbase shell
create 'hbase_numbers', 'f'for i in '1'..'100' do for j in '1'..'2' do put 'hbase_numbers', "row#{i}", "f:c#{j}", "#{i}#{j}" end end

Use Spark SQL to create tables and establish mappings with HBase tables

$SPARK_HBASE_Home/bin/hbase-sqlCREATE TABLE numbers
rowkey STRING, a STRING, b STRING, PRIMARY KEY (rowkey)
MAPPED BY hbase_numbers COLS=[a=f.c1, b=f.c2];

Inquire

select a, b from numbers where b > "980"


2.2 Hortonworks: Apache HBase Connector

Features:
A standard Spark Datasource API is implemented in a simple way, and the Spark Catalyst engine is used for query optimization. At the same time, RDD is constructed through scratch, and many common query optimizations are also realized.

advantage:

  • native avro support

  • Predicate pushdown and partition tailoring

  • Support row key combination

  • Support security

Disadvantages:

  • The SQL syntax is not rich enough, only the original syntax of spark sql is supported

  • Only supports java primitive types

  • Does not support multi-language API

Example of use:

Define HBase Catalog

def catalog = s"""{
        |"table":{"namespace":"default", "name":"table1"},
        |"rowkey":"key",
        |"columns":{
          |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
          |"col1":{"cf":"cf1", "col":"col1", "type":"boolean"},
          |"col2":{"cf":"cf2", "col":"col2", "type":"double"},
          |"col3":{"cf":"cf3", "col":"col3", "type":"float"},
          |"col4":{"cf":"cf4", "col":"col4", "type":"int"},
          |"col5":{"cf":"cf5", "col":"col5", "type":"bigint"},
          |"col6":{"cf":"cf6", "col":"col6", "type":"smallint"},
          |"col7":{"cf":"cf7", "col":"col7", "type":"string"},
          |"col8":{"cf":"cf8", "col":"col8", "type":"tinyint"}
        |}
      |}""".stripMargin

使用SQL查询

// Load the dataframeval df = withCatalog(catalog)//SQL exampledf.createOrReplaceTempView("table")
sqlContext.sql("select count(col1) from table").show


2.3 Cloudrea: SparkOnHBase

特点:
通过简单的接口实现链接Spark与HBase, 支持常用的bulk读写。架构图如下:

图片

优点

  • 支持安全

  • 通过get或者scan直接生成rdd, 并可以使用API完成更高级的功能

  • 支持组合rowkey

  • 支持多种bulk操作

  • 为spark和 spark streaming提供相似的API

  • 支持谓词下推优化

缺点

  • 不支持复杂数据类型

  • SQL只支持spark sql原有的语法

使用示例

直接使用scan创建一个RDD

SparkConf sparkConf = new SparkConf().setAppName(  
                "Scan_RDD").set("spark.executor.memory", "2000m").setMaster(  
                "spark://xx.xx.xx.xx:7077")
                 .setJars(new String[]{"/path/to/hbase.jar"});  

val sc = new SparkContext(sparkConf)

val conf = HBaseConfiguration.create()

val hbaseContext = new HBaseContext(sc, conf)var scan = new Scan()
scan.setCaching(100)var getRdd = hbaseContext.hbaseRDD(tableName, scan)

创建一个RDD并把RDD的内容写入HBase

val sc = new SparkContext(sparkConf)//This is making a RDD of//(RowKey, columnFamily, columnQualifier, value)val rdd = sc.parallelize(Array(
      (Bytes.toBytes("1"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("1")))),
      (Bytes.toBytes("2"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("2")))),
      (Bytes.toBytes("3"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("3")))),
      (Bytes.toBytes("4"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("4")))),
      (Bytes.toBytes("5"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("5"))))
     )
    )//Create the HBase config like you normally would  then//Pass the HBase configs and SparkContext to the HBaseContextval conf = HBaseConfiguration.create();
val hbaseContext = new HBaseContext(sc, conf);//Now give the rdd, table name, and a function that will convert a RDD record to a put, and finally// A flag if you want the puts to be batchedhbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd,
    tableName,    //This function is really important because it allows our source RDD to have data of any type
    // Also because puts are not serializable
    (putRecord) > {
      val put = new Put(putRecord._1)
      putRecord._2.foreach((putValue) > put.add(putValue._1, putValue._2, putValue._3))
       put
    },    true);


2.4 综合对比

产品 SQL支持优化  支持安全 接口丰富易用度 易集成到HBase 社区活跃度
华为 近两年无更新
Hortonworks 较多 近一个月内有更新
Cloudrea 较高 已集成到HBASE trunk且持续更新


3. 最后

社区中有不少Spark on HBase的工作,出发点都是为了提供更易用,更高效的接口。其中Cloudrea的SparkOnHbase更加灵活简单,在2015年8月被提交到HBase的主干(trunk)上,模块名为HBase-Spark Module,目前准备在HBASE 2.0 正式Release, 相信这个特性一定是HBase新版本的一个亮点。 于此同时云HBase也会与社区同步发展,使用包括但不限于Spark On HBase的新特性,届时欢迎大家尝鲜。如若文章中有不准确的描述,请多多指正,谢谢!

4. 参考

https://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/
http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/
https://issues.apache.org/jira/browse/HBASE-13992
http://blog.madhukaraphatak.com/introduction-to-spark-two-part-6/
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-catalyst.htmlh


图片




Guess you like

Origin blog.51cto.com/15060465/2679781