HBase columnar data model profile

A database data model is the key to understanding, this section HBase inline data model, the basic concepts related to the data model, and describes the physical and conceptual views HBase database view.

Data Model Overview

HBase is a sparse, multi-dimensional, ordered map.

Each cell in this table is indexed by a row key, a column group, and the time stamp column qualifiers be identified. Value of each cell is unexplained a string data type no. When the user data is stored in a table, each row has a unique row and any number of columns of keys.

Each row of the table by a group composed of one or more columns, a column family can contain any number of columns. In the same mode, each row comprising a column group is the same, that is, the group number and the column names are the same, but the number of columns in each group of columns in each row may be different ,As shown in Figure 1.

HBase data model schematically
Data Model schematically in FIG. 1 HBase

The same column family together inside the data stored in HBase, column family support dynamic extensions, you can add new columns at any time, without the number of columns defined in advance. So, although each row in the table will have the same column family, but may have very different columns. Because of this, each row of data for the entire mapping table, the value of some column is empty, so HBase tables are sparse.

HBase when an update operation, and will not remove the old version of the data, but generates a new version, the old version is still retained.

Users can set the number of versions HBase reserved. In querying the database, the user can choose to get the latest version from a certain time, or time to get all versions. If the query time does not provide a time stamp, then the system will return the data from the most recent time that a current version.

HBase provides two versions of data recovery ways: one is to save the data of the last version; the other is a saved version in the most recent period, such as last month.

The basic conceptual data model

The HBase data is stored in a table, with rows and columns, it is a multidimensional map structure. This section describes the basic concepts related to HBase data model for unified presentation. ,

1. The table (Table)

HBase employed to organize the data table, the table consists of many rows and columns, the column is divided into a plurality of column group.

2. row (Row)

Inside a table, each row represents a data object. Each line consists of a row of keys (Row Key) and one or more columns. Row key uniquely identifies a row, the row key and no specific data type to binary bytes to store, alphabetically.

Because the table is row by row to sequentially store key, the row key design is very important. An important design principles row of keys is associated with the line key to be stored in a close position, for example, when recording table design site, line keys need to reverse domain name (for example, org.apache.www, org.apache.mail, org.apache.jira), this design enables the domain name associated with apache stored in the table position very close.

Access to table rows only three ways: by a single row key obtaining single line transactions; access to multiple rows of data to a given interval by a line segment bonds; full table scan.

3. Column (Column)

Aromatic joint identifier column by the column (Column Family) and qualifier column (Column Qualifier), of: for the interval, such as the family "": qualifiero

Group 4. Column (Column Family)

In the definition of HBase table when the need to set up in advance column family, all columns in the table need to be organized in a column family inside. Once determined column family, can not be easily modified, as it will affect the physical storage HBase real structure, but a column qualifier column family their corresponding values can be dynamically added or deleted.

Each row in the table has the same column group, but does not require a column in each row group have a consistent qualifier column, so that the table is a sparse structure, so that redundant data can be avoided to some extent .

HBase column family is a collection of columns. All columns a column family members all have the same prefix, for example, courses: history and courses: math courses of family members are listed. ":" It is a separator group column, and column names are used to distinguish the prefix. Column family must be declared at the time table set up, you can always create a new column.

The qualifier column (Column Qualifier)

Family data column by column to map the qualifier. Column qualifier without prior definition, does not need to be consistent between different rows. No special qualifier column data type to binary bytes to store.

6. A unit (the Cell)

Key rows, columns and column group qualifiers a unit together with the identification data stored in the cell in the cell data is referred to, there is no specific data type to binary bytes to store.

7. timestamp (Timestamp)

By default, each data unit will use the time stamp insertion to the version identification.

When reading the data unit, if the timestamp is not specified, the default return the latest data; writing new data unit, if the timestamp is not set, the current time is the default. The version number column of each of the data units of the group are maintained separately HBase, by default, three versions HBase data retention.

Conceptual view

In the conceptual view of HBase, a table can be regarded as a sparse, multidimensional mapping relationship, the "+ OK key column family: Column foot limit timestamp operator +" format can locate a particular data unit. Because HBase tables are sparse, so some columns may be blank.

HBase FIG. 2 is a conceptual view, a fragment of a memory page table information. Row key is a reverse UKL, such as reverse www.cnn.com com.cnn.www.

The benefits of a reverse URL is that you can make the data content from the same site are stored in the adjacent position, which can improve the site's user data read speed. contents column family is stored content of the page; anchor column family is stored a reference links on this page; mime column family stores that page media type.

2 HBase conceptual view of FIG.

Com.cnn.www site conceptual view given in Figure 2 only one row of data, uniquely identifies the line is "com.cnn.www", each time a logical data has modified this line corresponds to a time stamp associated. There are four tables: Contents: HTML,
Anchor: cnnsi.com, Anchor: my.look.ca and mime: type, each column in a column prefix given by way of which the group belongs.

As it can be seen from Figure 3, a total content page version 3, respectively, corresponding to the time stamp t3, t5 and t6. Web page cited two pages, are my.look.ca and cnnsi.com, were cited time t8 and t9. Web media types from t6 start "text / html".

To locate the data unit may be a "three-dimensional coordinates" to, i.e. [row key, a column group: column qualifier, timestamp].

For example, in Figure 3:

[ "Com.cnn.www", anchor: cnnsi.com, t9] data corresponding to the cell as "CNN".
[ "Com.cnn.www", anchor: my.look.ca, t8] Wu in a single data corresponding to "CNN.com".
[ "Com.cnn.www", mime: type, t6] of data units corresponding to "text / html".

As can be seen from Figure 3, a conceptual view HBase table, each row contains the same column group, although not needed for each row of data are stored in each column in the group. For example, the first two rows of data in FIG. 3, the contents of the column and the column contents aromatic group mime is empty. After the data line 3, the contents of the column is empty anchor group. After the two rows of data, the contents of the column group mime is empty.

Physical View

Although conceptual view perspective, HBase each table is composed of many rows, but in the physical storage level, it is the use of a storage column-based, rather than a relational database as a line-based storage. This is one important difference HBase and relational databases.

FIG 2 is a conceptual view when performing physical storage, will be stored as in FIG. 33 fragments. In other words, the HBase table will be stored separately in accordance with the contents, anchor and mime 3 columns families. Data belonging to the same column group kept together, simultaneously, and each column group with line keys stored and further comprising a time stamp.

In the conceptual view of FIG. 2, can see many column is empty, that is, these values do not exist above the columns. In the physical view, and these empty columns are not stored into the null, but will not be stored, which can save a lot of storage space. When these gaps units request, returns a null value.

Physical view HBase
3 HBase physical view of FIG.

21. A the HDFS basic principles and design
22. A the HDFS architecture and implementation mechanism
23. A the HDFS read and write data
24. A the HDFS two operating modes
25. A the NoSQL profile
26. A the NoSQL type profile
27. A HBase profile
28. A HBase columnar data model
29. The HBase Shell
30 .HBase main operating mechanism
31 .HBase common API the Java
32. The HBase instance of the Java programming API
33. The Hadoop MapReduce
34. The Hadoop MapReduce framework
35. The Hadoop MapReduce workflow
36. The MapReduce case Study: word count
37. The Hadoop MapReduce mechanism
38. The MapReduce programming examples