Detailed explanation of hbase data model

HBase is an open source implementation of Google Bigtable. It uses Hadoop HDFS as its file storage system, uses Hadoop MapReduce to process massive data in HBase, and uses Zookeeper as a collaborative service.

1. Introduction

HBase is a distributed, column-oriented open source database, derived from a Google paper "bigtable: a distributed storage system for structured data". HBase is an open source implementation of Google Bigtable. It uses Hadoop HDFS as its file storage system, uses Hadoop MapReduce to process massive data in HBase, and uses Zookeeper as a collaborative service.

2. HBase table structure

HBase stores data in the form of tables. Tables consist of rows and columns. Columns are divided into several column families/column families.

Row Key column-family1 column-family2 column-family3
column1 column2 column1 column2 column3 column1
key1            
key2            
key3

 

As shown in the figure above, key1, key2, and key3 are the unique row key values ​​of three records, and column-family1, column-family2, and column-family3 are three column families, and each column family includes several columns. For example, the column family of column-family1 includes two columns named column1 and column2. t1:abc, t2:gdxdf is a unit cell uniquely determined by row key1 and column-family1-column1. There are two data in this cell, abc and gdxdf. The timestamps of the two values ​​are different, namely t1, t2, and hbase will return the value of the latest time to the requester.

 

The specific meanings of these nouns are as follows:

 

(1) Row Key

 

Like nosql databases, the row key is the primary key used to retrieve records. There are only three ways to access rows in hbase table:

 

(1.1) Access via a single row key

 

(1.2) Range by row key

 

(1.3) Full table scan

 

Row key row key (Row key) can be any string (the maximum length is 64KB, and the length is generally 10-100bytes in practical applications). Inside hbase, the row key is stored as a byte array.

 

When storing, the data is sorted and stored according to the lexicographical order (byte order) of the Row key. When designing keys, store this feature in sufficient order and store rows that are frequently read together. (location dependency)

 

Notice:

 

The result of lexicographical sorting of int is 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21,…,9,91,92,93,94,95, 96, 97, 98, 99. To preserve the natural ordering of integers, row keys must be left-padded with 0s.

 

A read or write of a row is an atomic operation (regardless of how many columns are read or written at once). This design decision makes it easy for users to understand how the program behaves when concurrently updating the same row.

 

(2) 列族 column family

 

Each column in the hbase table belongs to a certain column family. Column families are part of the table's chema (and columns are not) and must be defined before using the table. Column names are prefixed with the column family. For example, courses:history and courses:math belong to the column family of courses.

 

Access control, disk and memory usage statistics are performed at the column family level. In practice, control permissions on column families can help us manage different types of applications: we allow some applications to add new basic data, some applications to read basic data and create inherited column families, and some applications to only allow browsing. data (maybe not even able to browse all data for privacy reasons).

 

(3) Unit Cell

 

In HBase, what is determined by row and columns is a storage unit called cell. A unit uniquely identified by {row key, column( =<family> + <label>), version}. The data in the cell is untyped, and all are stored in the form of bytecode.

 

(4) timestamp timestamp

 

Each cell holds multiple versions of the same data. Versions are indexed by timestamp. The type of the timestamp is a 64-bit integer. Timestamps can be assigned by hbase (automatically when data is written), in which case the timestamp is the current system time accurate to milliseconds. Timestamps can also be explicitly assigned by the client. If an application wants to avoid data version conflicts, it must generate its own unique timestamp. In each cell, the data of different versions are sorted in reverse chronological order, that is, the latest data is ranked first.

 

In order to avoid the management (including storage and indexing) burden caused by the existence of too many versions of data, hbase provides two data version recovery methods. One is to save the last n versions of the data, and the other is to save the versions in the most recent period (such as the last seven days). User can set for each column family.

 

3. Basic usage of HBase shell

 

hbase provides a shell terminal for user interaction. Use the command hbase shell to enter the command interface. You can see the help information of the command by executing help.

 

The use of hbase is demonstrated with an example of a student score sheet on the Internet.

 

name 1 c course
math art
Tom 5 97 87
Jim 4 89 80

 

Here grad is a column family with only its own for the table, course is a column family with two columns for the table, this column family consists of two columns math and art, of course we can according to our needs in Create more column families in the course, such as computer, physics and other corresponding columns are added to the course column family.

 

(1) Create a table scores with two column families grad and coursee

Copy the code The code is as follows:

hbase(main):001:0> create ‘scores','grade', ‘course'


You can use the list command to see which tables are currently in HBase. Use the describe command to view the table structure. (Remember that all indications and column names need to be enclosed in quotation marks)

(2) Insert values ​​according to the designed table structure:

Copy the code The code is as follows:

put ‘scores','Tom','grade:','5′
put ‘scores','Tom','course:math','97′
put ‘scores','Tom','course:art','87′
put ‘scores','Jim','grade','4′
put ‘scores','Jim','course:','89′
put ‘scores','Jim','course:','80′


In this way, the table structure is up. In fact, it is relatively free. It is very convenient to freely add sub-columns in the column family. If there are no subcolumns under the column family, it is possible to add or not to add a colon.

The put command is relatively simple, and there is only one usage:
hbase> put 't1', 'r1', 'c1', 'value', ts1

t1 refers to the table name, r1 refers to the row key name, c1 refers to the column name, and value refers to the unit grid value. ts1 refers to the timestamp, which is generally omitted.

(3) Query data according to the key value

get 'scores', 'Jim'
get 'scores', 'Jim', 'grade'

You may find the rule, HBase shell operation, a general order is the operation keyword followed by the table name , row name, column name, etc. If there are other conditions, add them with curly brackets.
get can be used as follows:

hbase> get 't1', 'r1'
hbase> get 't1', 'r1', {TIMERANGE => [ts1, ts2]}
hbase> get 't1', 'r1', {COLUMN = > 'c1'}
hbase> get 't1', 'r1', {COLUMN => ['c1', 'c2', 'c3']}
hbase> get 't1', 'r1', {COLUMN => ' c1', TIMESTAMP => ts1}
hbase> get 't1', 'r1', {COLUMN => 'c1', TIMERANGE => [ts1, ts2],
hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1, VERSIONS => 4}
hbase> get 't1', 'r1', 'c1'
hbase> get 't1', 'r1', 'c1', 'c2'
hbase> get 't1', 'r1', ['c1', 'c2']

(4) Scan all data

scan 'scores'

can also specify some modifiers: TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH, or COLUMNS. Without any modifiers, just the example above, all rows of data will be displayed.

Examples are as follows:

Copy the code The code is as follows:

hbase> scan ‘.META.'
hbase> scan ‘.META.', {COLUMNS => ‘info:regioninfo'}
hbase> scan ‘t1′, {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => ‘xyz'}
hbase> scan ‘t1′, {COLUMNS => ‘c1′, TIMERANGE => [1303668804, 1303668904]}
hbase> scan ‘t1′, {FILTER => “(PrefixFilter (‘row2′) AND (QualifierFilter (>=, ‘binary:xyz'))) AND (TimestampsFilter ( 123, 456))”}
hbase> scan ‘t1′, {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}



The filter has two ways to point out:

a. Using a filterString – more information on this is available in the
Filter Language document attached to the HBASE-4176 JIRA
b. Using the entire package name of the filter.

There is also a CACHE_BLOCKS modifier , the cache of the switch scan is enabled by default (CACHE_BLOCKS=>true), and can be disabled (CACHE_BLOCKS=>false).

(5) Delete the specified data

Copy the code The code is as follows:

delete ‘scores','Jim','grade'
delete ‘scores','Jim'


There is not much change in the delete data command, there is only one:

hbase> delete 't1', 'r1', 'c1', ts1

There is also a deleteall command, which can delete the entire range of the line, use it with caution!
If you need to delete the entire table, use the truncate command. In fact, there is no direct full-table deletion command. This command is also a combination of disable, drop, and create commands.

(6) Modify the table structure

Copy the code The code is as follows:

disable ‘scores'
alter ‘scores',NAME=>'info'
enable ‘scores'


The alter command is used as follows (if the version cannot be successful, you need to disable the general table first):
a. Change or add a column family:

hbase> alter 't1', NAME => 'f1', VERSIONS => 5

b. Delete a column family:

Copy the code The code is as follows:

hbase> alter ‘t1′, NAME => ‘f1′, METHOD => ‘delete'
hbase> alter ‘t1′, ‘delete' => ‘f1′


c. You can also modify table attributes such as MAX_FILESIZE
MEMSTORE_FLUSHSIZE, READONLY, and DEFERRED_LOG_FLUSH:
hbase> alter 't1', METHOD => 'table_att', MAX_FILESIZE => '134217728'
d. You can add a table coprocessor

hbase> alter 't1 ', METHOD => 'table_att', 'coprocessor'=> 'hdfs:///foo.jar|com.foo.FooRegionObserver|1001|arg1=1,arg2=2'

Multiple coprocessors can be configured on a table , a sequence will automatically grow for identification. Loading a coprocessor (so to speak a filter program) needs to comply with the following rules:

[coprocessor jar file location] | class name | [priority] | [arguments]

e. Remove the coprocessor as follows:

hbase> alter 't1', METHOD => 'table_att_unset', NAME => 'MAX_FILESIZE'
hbase> alter 't1', METHOD => 'table_att_unset', NAME => 'coprocessor$1'

f. Multiple alter commands can be executed at one time:

hbase> alter 't1', {NAME => 'f1'}, {NAME => 'f2', METHOD =>

Copy the code The code is as follows:

hbase> count ‘t1′
hbase> count ‘t1′, INTERVAL => 100000
hbase> count ‘t1′, CACHE => 1000
hbase> count ‘t1′, INTERVAL => 10, CACHE => 1000


Count is generally time-consuming. Use mapreduce for statistics, and the statistical results will be cached. The default is 10 rows. The default statistical interval is 1000 lines (INTERVAL).

(8) disable and enable operations
Many operations need to suspend the availability of the table first, such as the alter operation mentioned above, and this operation is also required to delete the table. disable_all and enable_all can operate on more tables.

(9) Deletion of the table
First stop the usability of the table, and then execute the delete command.

The above drop 't1'

is a detailed explanation of some common commands. The specific shell commands of all hbase are as follows, divided into several command groups, you can see the general use by reading English, and use help "cmd" to understand the detailed usage.

Copy the code The code is as follows:

COMMAND GROUPS:
Group name: general
Commands: status, version

Group name: ddl
Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all,
enable, enable_all, exists, is_disabled, is_enabled, list, show_filters

Group name: dml
Commands: count, delete, deleteall, get, get_counter, incr, put, scan, truncate

Group name: tools
Commands: assign, balance_switch, balancer, close_region, compact, flush, hlog_roll, major_compact,
move, split, unassign, zk_dump

Group name: replication
Commands: add_peer, disable_peer, enable_peer, list_peers, remove_peer, start_replication,
stop_replication

Group name: security
Commands: grant, revoke, user_permission


4. hbase shell script
Since it is a shell command, of course, you can write all hbase shell commands into a file, and execute all commands in sequence like a linux shell script program. Just like writing a linux shell, write all hbase shell commands in a file, and then execute the following commands:

Copy the code The code is as follows:

$ hbase shell test.hbaseshell


Convenient and easy to use.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327017502&siteId=291194637