Way of learning HBase (a)

Read the statement: The following is a personal combination of online material written work and understanding, if inappropriate, welcome to correct me thank you ~~~

A, HBase introduction

The definition and role of HBase

  HBase is a distributed and oriented column of the non-relational (i.e. based on the stored key-value) of the database .

  The role of HBase: HDFS provides low-latency data query capabilities.

  Note : not HBase computational frame, it is for storing data, low-latency data storage, the MR (Hive) HBase calculation function can not be replaced.

Low latency

  The client reads the delay time in the second stage to be controlled even millisecond response is given; the data can be read from by the HDFS or MapReduce Hive, MR or Hive but the delay is typically in the query level minutes.

    Hive or MR analysis principle that the intermediate as follows:

  ① handwriting complex Hql ---> ② resolved to MRJob ---> assignments ③MRJob of ----> ④ execute the query ----> ⑤ return results

  Most of the time consumed in ①②③ process ( bottom involves a lot of disk IO )

HBase low latency

  HBase has been able to achieve low-latency data query, because the underlying (follow-up will analyze the underlying design) full use of caching mechanism, as well as complex data structures and sophisticated algorithms to achieve.

Two, HBase features

① distributed architecture

    HBase is through the cluster to store data, final data is still stored in HDFS.

② column-oriented storage

  

  From the above chart we can analyze:

    1) If it is on-line storage, data stored on disk is continuous.

    2) If a column is based on stored data in the disk storage is discontinuous.

    

    3) performance comparison:

      ① write performance: it is a measure of the number of times data is written, the fewer the number of writes, the higher the performance, because every once written to disk, the head scheduling should occur, resulting in seek time.

             So sum up, the rows are stored in the data is written to have an advantage, because the column stores will occur many times random write disk.

     ② reading performance:

        If the ratio from the reader to the performance data, such as reading the whole table is stored in the rows of high performance, because it will produce fewer disk I / O can read the entire table data, and the column to be produced or more storage disk random write.

        If you come from than read performance data, such as reading a particular column, column stores have a distinct advantage, because there is no redundant columns (data) problem, and if it is stored in the line, there will be redundant column (data), process eliminates redundant column is there (place in memory of); column stores this advantage, especially when massive data queries, will be exponentially amplified, and in a production environment, characteristics of a query are based on some columns to query.

    4) Based on the advantages of column stores

      ① Because each column of data are homogeneous (same type), and avoid frequent switching between types.

      ② Because the data of each column of the same type, may be employed a more efficient compression algorithm to compress data.

 Three, HBase column family

  HBase is a column for storage, the purpose of the mechanism is the column family: reducing the number of data write, write performance improved.

  Note: HBase table is created, the number of column families should not be excessive, should have similar I / O characteristics of the column into a column under the same family, to avoid cross-access column family , such as: name and age columns that are frequently together or insert query, should the two columns into a single column family.

    

    HBase table column family is specified when construction of the table, and after the construction of the table increases listed in the insert data, do not worry about the cost of storing the column caused by excessive increase of the problem. Because the data for one line, a column has no data, and does not take up extra disk space . Therefore, a HBase is often sparse.

Four, HBase line key

   Inserting data into a HBase, row keys must be specified for each row of data includes unique. HBase Key-Value is essentially based storage system, wherein the key: that is, value row of keys (rowKey): Group data set columns.

   HBase will do RowKey dictionary by ascending order.

  

Five versions of historical data 

  HBase version stored historical data, by: scan 'tabname', {ROW => true, VERSIONS => 3} to open the query.

  Where VERSIONS => 3, represents to see three versions of the data (including the current version of the historical version + = 3)

  By default, HBase to store up to version number = 3

  Historical data should not be too much, it will waste a lot of storage space.

  

 

Guess you like

Origin www.cnblogs.com/rmxd/p/11314721.html