Is HBase a suitable data source for BI analysis?

HBase is a layer of Key-Value Pair storage server built on Hadoop File System. HBase can support Key-Value fast insertion, modification and deletion, and single Key to Value fast query. So is Hbase a suitable data source for BI analysis? Filtering and aggregation are basic operations in BI, so we first need to know whether HBase can support fast filtering and aggregation operations.

MapReduce is the basic computing framework on the Hadoop system. HBase users can use MapReduce to perform filtering and aggregation operations. But we know that the response time of MapReduce is generally more than tens of seconds or several minutes, which is too slow for real-time BI operations. So we wanted to investigate whether HBase Coprocessor is a better choice.

HBase Coprocessor is a simpler computing system than MapReduce. Coprocessor is equivalent to a Stored Procedure on HBase Region Server. HBase computing clients can call (through execCoprocessor) the Coprocessor on the Region Server to perform filtering and aggregation operations. The Coprocessor operation is performed locally on the Region Server, and then the Region Server transmits some of the results to the client, and the final result will be assembled on the client. The following figure shows the schematic diagram of using Corprocessor to do Count operation.

Programmers can write their own Coprocessor programs, use HBase's Scan Object for filtering, and use Java code to implement aggregation operations such as Sum() and Avg().

Since HBase's own API does not support Table Join, we can assume that all data warehouse data is stored in a giant HBase Table.

At the logical level, HBase Table is equivalent to a 3-dimensional Map - with (Row, Column, TimeStamp), we can find the corresponding value. In the specific implementation, the data of HBase Table is stored according to one data unit, each data unit has other fields besides the value field, such as RowKey, Column ID, TimeStamp. In this way, a large part of the space of the data unit is actually used to store those Metadata. This storage format is very efficient for sparse reports, but when the data density of the report becomes larger, its storage efficiency is greatly reduced. The data density of a typical data warehouse data table is often close to 100%. At this time, the storage efficiency of HBase Table is much lower than that of a simple 2-dimensional report, such as a relational database report or a CSV report.

Our tests show that when the data report is small (200-300MB), the coprocessor is slightly slower than MySQL, but faster than MapReduce. When the data report becomes larger, the coprocessor will be slower than MySQL or even slower than MapReduce . In MapReduce, for the same data, CSV is much faster than HBase Table.

In summary, the blogger believes that the storage format of HBase Table itself is not suitable for typical BI applications.

But for some simple reporting applications, such as Facebook Insight, HBase can still be used as a data source. In Facebook Insight, each user has some Count metrics, such as Click#, Impression #, etc. User ID (as Key) and these Count metrics are stored in an HBase Table, Insight can measure each user based on Web Log Do real-time updates. The metric value of each user can also be read in real time. Since more complex filtering and aggregation operations are not involved here, HBase can play a good role.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326529892&siteId=291194637