HBase Practice | Use Spark to analyze cloud HBase data

Cloud HBase has good online storage and query capabilities, but there are relatively large deficiencies in analysis. This article mainly introduces how to use Spark to do complex analysis of data in cloud HBase.

1 The current status of cloud HBase query analysis

  • HBase native API: HBase native API is suitable for checking based on row key. This is the query scenario that HBase is best at

  • Phoenix: As the SQL layer of HBase, Phoenix uses secondary index technology and is good at multi-condition combination queries. Phoenix does not have its own computing resources. Complex queries such as groupby need to be completed with the help of HBase coprocessors. On the one hand, the performance is not good. Good, it will also affect the stability of the HBase cluster;

  • Spark: It has a wealth of operators to support complex analysis. Using the computing resources of the Spark cluster, performance can be improved through concurrent analysis without affecting the stability of the HBase cluster.

2 Comparison of Spark's analysis of HBase

Spark analyzes HBase data in three ways: "RDD API", "SQL API", and "HFILE". The relevant comparisons are as follows:

the way Features advantage Disadvantage
RDD API 1. Use the TableInputFormat and TableRecordReader tool classes of the hadoop community to do split and data scan;
2. The specific API is newAPIHadoopRDD();
1. Both spark and hive integrate TableInputFormat and TableRecordReader tool classes 1. This method needs to scan all the data of the table to the spark calculation engine for filtering and complex calculations;
2. It does not support the blockcache switch and cachesize configuration in the HBase scan API, which will affect HBase when scanning large tables with high concurrency. The stability of the cluster
SQL API 1. Through spark's sql optimizer, it supports the optimization of predicate pushdown, column clipping, partition clipping, etc., and pushes down the optimization to the storage side as much as possible, thereby improving performance; 
2 With the mapping of SQL schema to HBase column, no need to write Complex type conversion; 
3. Support blockcache switch and cachesize configuration in HBase scan API, users can adjust according to the scene, so as to ensure the stability of cloud Hbase
1. Make full use of the storage features of HBase and push down the optimization to the data source to improve performance 1. Using scan API will increase the load and memory usage of the HBase cluster when analyzing large tables
HFILE 1. Spark directly analyzes the HFILE of the table and reads the HDFS directly, without using HBase cluster resources at all; 
2. The comparison between the HFILE of the table and Scan directly by spark can reduce the serialization and deserialization of the data in the HFile, and improve performance;
1. It can effectively solve the high load and memory consumption of the HBase cluster caused by high concurrent scan;
2. Direct reading of HFILE has better performance
1. Spark analysis of HFILE needs to be combined with HBase's snapshot table to ensure the consistency of the analysis data

It is recommended to use SQL API for small tables with dynamically updated data , which can effectively optimize the analysis and reduce the impact on the stability of the HBase cluster; for the analysis of static tables or fully static tables, it is recommended to use the method of analyzing HFILE to directly read HDFS, so It does not affect the stability of the HBase cluster at all; it is not recommended to use the RDD API . This method does not optimize performance. At the same time, high concurrency and large table data will seriously affect the stability of the HBase cluster, thereby affecting online business. .


3 Specific use of the three methods

云HBase团队为大家提供了一个github项目供大家参考使用上面的三种方式来开发Spark分析HBase的程序,项目地址:

https://github.com/lw309637554/alicloud-hbase-spark-examples?spm=a2c4e.11153940.blogcont573569.14.1b6077b4MNpI9X

  • 依赖项:需要下载云HBase及云Phoenix的client包

  • 分析HFILE:

    • 需要先开通云HBase的HDFS访问权限,参考文档

    • 在hbase shell中对表生成snapshot表“snapshot 'sourceTable', ‘snapshotName'”

    • 在项目中配置自己的hdfs-sit.xml文件,然后通过直读HDFS的方式分析snapshot表

  • 具体的example

    • RDD API对应:org.apache.spark.hbase.NativeRDDAnalyze

    • SQL API对应:org.apache.spark.sql.execution.datasources.hbase.SqlAnalyze

    • 分析HFILE对应:org.apache.spark.hfile.SparkAnalyzeHFILE


图片


Guess you like

Origin blog.51cto.com/15060465/2677215