I. Introduction
MapReduce has already docked with HBase, using HBase as the data source to complete the reading and writing of batch data. Now Spark, after MapReduce, has a pivotal position in the field of big data, and it has its use in batches, stream processing, and even graph computing. Spark docking with HBase has become a requirement of many users.
二.Spark On HBase
1. Problems that can be solved
The seamless integration of Spark and HBase means that we no longer need to care about security and the details of the interaction between RDD and HBase. It is more convenient to apply the batch processing and stream processing capabilities that Spark brings. For example, the following common application scenarios:
HBase is used as storage, and streaming data is processed through Spark.
Use HBase as storage to complete large-scale graph or DAG calculations.
BulkLoad operation on HBase through Spark
Interactive analysis of HBase data with Spark SQL
2. Community related work
There are already a variety of implementations of Spark docking with HBase. Here we select three representative jobs for analysis:
2.1 Huawei: Spark-SQL-on-HBase
Features:
Extend the parse function of Spark SQL to connect to HBase. Improve read and write performance through coprocessor and custom filters.
advantage:
Expanded the corresponding cli function, supports Scala shell and Python shell
A variety of performance optimization methods, and even support sub plan to coprocessor to achieve partial aggregation.
Support Java and Python API
Support row key combination
Support common DDL and DML (including bulkload, but not update)
Disadvantages:
Does not support query based on timestamp and version
Does not support security
Row key supports primitive types or Strings, but does not support complex data types
Example of use:
Create a table in HBase and write data
$HBase_Home/bin/hbase shell create 'hbase_numbers', 'f'for i in '1'..'100' do for j in '1'..'2' do put 'hbase_numbers', "row#{i}", "f:c#{j}", "#{i}#{j}" end end
Use Spark SQL to create tables and establish mappings with HBase tables
$SPARK_HBASE_Home/bin/hbase-sqlCREATE TABLE numbers rowkey STRING, a STRING, b STRING, PRIMARY KEY (rowkey) MAPPED BY hbase_numbers COLS=[a=f.c1, b=f.c2];
Inquire
select a, b from numbers where b > "980"
2.2 Hortonworks: Apache HBase Connector
Features:
A standard Spark Datasource API is implemented in a simple way, and the Spark Catalyst engine is used for query optimization. At the same time, RDD is constructed through scratch, and many common query optimizations are also realized.
advantage:
native avro support
Predicate pushdown and partition tailoring
Support row key combination
Support security
Disadvantages:
The SQL syntax is not rich enough, only the original syntax of spark sql is supported
Only supports java primitive types
Does not support multi-language API
Example of use:
Define HBase Catalog
def catalog = s"""{ |"table":{"namespace":"default", "name":"table1"}, |"rowkey":"key", |"columns":{ |"col0":{"cf":"rowkey", "col":"key", "type":"string"}, |"col1":{"cf":"cf1", "col":"col1", "type":"boolean"}, |"col2":{"cf":"cf2", "col":"col2", "type":"double"}, |"col3":{"cf":"cf3", "col":"col3", "type":"float"}, |"col4":{"cf":"cf4", "col":"col4", "type":"int"}, |"col5":{"cf":"cf5", "col":"col5", "type":"bigint"}, |"col6":{"cf":"cf6", "col":"col6", "type":"smallint"}, |"col7":{"cf":"cf7", "col":"col7", "type":"string"}, |"col8":{"cf":"cf8", "col":"col8", "type":"tinyint"} |} |}""".stripMargin
使用SQL查询
// Load the dataframeval df = withCatalog(catalog)//SQL exampledf.createOrReplaceTempView("table") sqlContext.sql("select count(col1) from table").show
2.3 Cloudrea: SparkOnHBase
特点:
通过简单的接口实现链接Spark与HBase, 支持常用的bulk读写。架构图如下:
优点
支持安全
通过get或者scan直接生成rdd, 并可以使用API完成更高级的功能
支持组合rowkey
支持多种bulk操作
为spark和 spark streaming提供相似的API
支持谓词下推优化
缺点
不支持复杂数据类型
SQL只支持spark sql原有的语法
使用示例
直接使用scan创建一个RDD
SparkConf sparkConf = new SparkConf().setAppName( "Scan_RDD").set("spark.executor.memory", "2000m").setMaster( "spark://xx.xx.xx.xx:7077") .setJars(new String[]{"/path/to/hbase.jar"}); val sc = new SparkContext(sparkConf) val conf = HBaseConfiguration.create() val hbaseContext = new HBaseContext(sc, conf)var scan = new Scan() scan.setCaching(100)var getRdd = hbaseContext.hbaseRDD(tableName, scan)
创建一个RDD并把RDD的内容写入HBase
val sc = new SparkContext(sparkConf)//This is making a RDD of//(RowKey, columnFamily, columnQualifier, value)val rdd = sc.parallelize(Array( (Bytes.toBytes("1"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("1")))), (Bytes.toBytes("2"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("2")))), (Bytes.toBytes("3"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("3")))), (Bytes.toBytes("4"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("4")))), (Bytes.toBytes("5"), Array((Bytes.toBytes(columnFamily), Bytes.toBytes("1"), Bytes.toBytes("5")))) ) )//Create the HBase config like you normally would then//Pass the HBase configs and SparkContext to the HBaseContextval conf = HBaseConfiguration.create(); val hbaseContext = new HBaseContext(sc, conf);//Now give the rdd, table name, and a function that will convert a RDD record to a put, and finally// A flag if you want the puts to be batchedhbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, //This function is really important because it allows our source RDD to have data of any type // Also because puts are not serializable (putRecord) > { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) > put.add(putValue._1, putValue._2, putValue._3)) put }, true);
2.4 综合对比
产品 | SQL支持优化 | 支持安全 | 接口丰富易用度 | 易集成到HBase | 社区活跃度 |
---|---|---|---|---|---|
华为 | 多 | 否 | 高 | 否 | 近两年无更新 |
Hortonworks | 较多 | 是 | 中 | 是 | 近一个月内有更新 |
Cloudrea | 少 | 是 | 较高 | 是 | 已集成到HBASE trunk且持续更新 |
3. 最后
社区中有不少Spark on HBase的工作,出发点都是为了提供更易用,更高效的接口。其中Cloudrea的SparkOnHbase更加灵活简单,在2015年8月被提交到HBase的主干(trunk)上,模块名为HBase-Spark Module,目前准备在HBASE 2.0 正式Release, 相信这个特性一定是HBase新版本的一个亮点。 于此同时云HBase也会与社区同步发展,使用包括但不限于Spark On HBase的新特性,届时欢迎大家尝鲜。如若文章中有不准确的描述,请多多指正,谢谢!
4. 参考
https://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/
http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/
https://issues.apache.org/jira/browse/HBASE-13992
http://blog.madhukaraphatak.com/introduction-to-spark-two-part-6/
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-catalyst.htmlh