HBase series (a) - HBase Profile

A, Hadoop limitations

HBase is a database management system for building on top of Hadoop File System column.

To understand why produce HBase, we need to first look at Hadoop restrictions exist? Hadoop HDFS can store structured, semi-structured or even unstructured data, it is a complement to traditional databases, is the best way to store vast amounts of data, it is for storing large files, batch access and streaming access have done optimization, but also solve the problem of disaster recovery through multiple copies.

But Hadoop flaw is that it can only execute the batch, and can only access data in a sequential manner, which means that even the most simple task, must search the entire data set, can not achieve random access to data. Random data access to achieve a traditional relational databases are good, but they can not be used to store large amounts of data. In this case, there must be a new approach to solve the problem of mass data storage and random access, HBase is one of them (HBase, Cassandra, couchDB, Dynamo and MongoDB can store massive amounts of data and supports random access).

Note: Data Structure Category:

  • Structured data: data that is in the form of a relational database management tables;
  • Semi-structured data: non-relational model, the data structure substantially fixed pattern, such as log files, XML documents, JSON documents, Email and the like;
  • Unstructured data: There is no fixed pattern of data, such as WORD, PDF, PPT, EXL, pictures in various formats, video and so on.

Two, HBase Profile

HBase is a database management system for building on top of Hadoop File System column.

HBase is similar to a Google’s Big Tabledata model, which is part of the ecosystem Hadoop, HDFS data stored on it, the client may be implemented on a random access to the data by the HDFS HBase. It has the following features:

  • Complex transaction is not supported, only supports row-level transaction, i.e., data reading and writing a single line are atomic;
  • Because it is employed as the underlying storage HDFS, and HDFS so as to support structured, semi-structured and unstructured storage;
  • It supports the machine by increasing the scale;
  • It supports data pieces;
  • Support automatic failover between RegionServers transfer;
  • Easy to use Java client-side API;
  • And support BlockCache bloom filter;
  • Filter supports predicate pushdown.

三、HBase Table

HBase is oriented database management system, where more precise and say, HBase is oriented 列族database management systems. Table schema defined only column family, a group table having a plurality of columns, each column group may contain any number of columns, a plurality of columns of cells (cell) composed of cells can store multiple versions of data, a plurality of versions of data to distinguish timestamps.

In the next picture shows a HBase table:

  • RowKey uniquely identifies the row, all rows are sorted according to the lexicographic order RowKey;
  • The table has two columns family, personal and Office respectively;
  • Wherein the column has a personal family name, city, phone three columns, column group has office tel, addres two columns.

Pictures Quote from: HBase is a column-store database it https://www.iteblog.com/archives/2498.html

Hbase table has the following characteristics:

  • Capacity: a table can be billions of rows on one million;

  • Column for: storing data in columns, each column are stored separately, i.e. indexing data, access to the data specified in the query column only, effectively reducing the I / O load on the system;

  • Sparsity: empty (null) column does not take up storage space, you can watch design is very sparse;

  • Multiple versions of data: the data in each cell may have multiple versions, sorted according to time stamp, the new data is at the top;

  • Storage Type: All underlying storage format of the data is a byte array (byte []).

Four, Phoenix

Phoenix 是 HBase 的开源 SQL 中间层,它允许你使用标准 JDBC 的方式来操作 HBase 上的数据。在 Phoenix 之前,如果你要访问 HBase,只能调用它的 Java API,但相比于使用一行 SQL 就能实现数据查询,HBase 的 API 还是过于复杂。Phoenix 的理念是 we put sql SQL back in NOSQL,即你可以使用标准的 SQL 就能完成对 HBase 上数据的操作。同时这也意味着你可以通过集成 Spring Data JPAMybatis 等常用的持久层框架来操作 HBase。

其次 Phoenix 的性能表现也非常优异,Phoenix 查询引擎会将 SQL 查询转换为一个或多个 HBase Scan,通过并行执行来生成标准的 JDBC 结果集。它通过直接使用 HBase API 以及协处理器和自定义过滤器,可以为小型数据查询提供毫秒级的性能,为千万行数据的查询提供秒级的性能。同时 Phoenix 还拥有二级索引等 HBase 不具备的特性,因为以上的优点,所以 Phoenix 成为了 HBase 最优秀的 SQL 中间层。

参考资料

  1. HBase - Overview

更多大数据系列文章可以参见 GitHub 开源项目大数据入门指南

Guess you like

Origin www.cnblogs.com/heibaiying/p/11403497.html