Introduction and construction of HBase

Introduction and construction of HBase

1. Overview

    HBase is a database tool based on hadoop.

1. Features

    HBase originated from a paper BigTable by google, and was later implemented by Apache as an open source implementation of HBase. It is a NoSQL, non-relational database that does not conform to the paradigm of relational databases. 

    It is suitable for storing semi-structured and unstructured data; it is suitable for storing sparse data, and empty data in sparse data does not occupy space.

    Column-oriented (family) storage, providing the ability to add, delete, modify and query in real time, is a real database.

    It can store massive data and has strong performance. It can realize millisecond-level queries of hundreds of millions of records, but it cannot provide strict transaction control and can only guarantee transactions at the row level.

    HBase is a highly reliable, high-performance, column-oriented, and scalable distributed storage system. Using hbase technology, a large-scale structured storage cluster can be built on cheap PCs.

    HBase uses Hadoop HDFS as its file storage system, uses Hadoop's MapReduce to process massive data in HBase, and uses Zookeeper as a coordination tool.

2. Logical structure

    HBase stores data through tables, but the structure of tables is very different from relational databases.

1. row key

    RowKey: The primary key of HBase. All records in the HBase table must have a row key and cannot be repeated.

    There are three ways to access data in HBase:

    Access via a single row key, access via a set of row keys, full table scan.

    Because the stored data content is semi-structured and unstructured, only these methods can be used to query.

    Row key row key can be any string, and the maximum length is 64KB. In practice, the length is generally 10-100 bytes. Inside hbase, row key is stored as a byte array.

    When storing, the data is sorted and stored according to the lexicographical order (byte order) of the Row key. When designing keys, make full use of the sorted storage feature, and put together row stores that are frequently read together.

    Notice:

    The result of lexicographic sorting of int is 1, 10, 100, 11, 2, 20, 21, ..., 9, 91, 96, 97, 98, 99. To preserve the natural ordering of integers, row keys must be left-padded with 0s.

    A read or write of a row is an atomic operation (regardless of how many columns are read or written at once). This design decision makes it easy for users to understand how the program behaves when concurrently updating the same row.

2. Column family (cluster)

    Column Family: It is part of the metadata of the table. It needs to be declared when the table is created and cannot be added later. If it needs to be added, only the alter table can be added. A column family can contain one or more columns.

3. List

    Column: Columns can be added dynamically without prior declaration. They can be added at any time when used. They are not part of the metadata of the table and belong to a column family.

4. Cells and Timestamps

    cell timestamp: A storage unit determined by row and columns. Each storage unit stores multiple versions of a data, and the versions are distinguished by timestamps, and the only unit for storing data determined by row column and timestamp is called a cell.

    The data are all stored in binary form, there is no distinction of data type. All empty data does not take up space.

5.Cell

    A unit uniquely identified by {row key, column( =<family> + <label>), version}. The data in the cell is untyped, and all are stored in the form of bytecode.

2. Installation configuration

1. Prepare

    The HBase version has strict requirements for the Hadoop version, which are matched as follows:

 

HBase-0.92.x

HBase-0.94.x

HBase-0.96

Hadoop-0.20.205

S

X

X

Hadoop-0.22.x

S

X

X

Hadoop-1.0.x

S

S

S

Hadoop-1.1.x

NT

S

S

Hadoop-0.23.x

X

S

NT

Hadoop-2.x

X

S

S

    X: Incompatible, S: Compatible, NT: Unknown

    Preconditions:

    Install jdk, Zookeeper and Hadoop, and configure environment variables.

    The demo version selection is as follows:

    jdk:1.8

    Zookeeper:3.4.7

    Hadoop:2.7.1

    Hbase:0.98.17

    Zookeeper installation see: Zookeeper cluster construction

    For Hadoop installation, please refer to: Hadoop pseudo-distributed mode construction , Hadoop fully distributed cluster construction

2. Installation

1. Standalone mode

    Unzip the installation package directly.

tar -zxvf xxxxx.tar.gz

1>hbase-site.xml

    Modify conf/hbase-site.xml.

    Configure the location of the data files used by hbase. The default is /tmp/hbase-[username]. This directory is the temporary directory of Linux and may be emptied by the system, so it is best to modify it.

<property>
<name>hbase.rootdir</name>
<value>file:///<path>/hbase</value>
</property>

2. Pseudo-distributed mode

1>hbase-env.sh

    Modify conf/hbase-env.sh and change the value of JAVA_HOME to the same as the environment variable.

export JAVA_HOME=xxxx

2>hbase-site.xml

    Modify hbase-site.xml to configure the hdfs information to be used.

<!--设置hdfs的地址-->
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoop01:9000/hbase</value>
</property>
<!--设置副本个数-->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

    Start hbase.

3. Fully distributed mode

1> Configuration file

hbase-env.sh

    hbase-env.sh configures relevant environment variables required when HBase starts.

    Modify conf/hbase-env.sh and change the value of JAVA_HOME to the same as the environment variable.

export JAVA_HOME=xxxx

    HBASE_MANAGES_ZK is logged out by default and enabled by default. You need to disable the automatic management of zookeeper and change the value to false. If it is not modified, then Zookeeper will be started and shut down with HBase, which will cause other services that use Zookeeper to be unavailable.

export HBASE_MANAGES_ZK=false

 

hbase-site.xml

    hbase-site.xml configures basic HBase configuration information. When HBASE starts, the configuration in hbase-default.xml is used by default. If necessary, you can modify the hbase-site.xml file. The configuration in this file will override the configuration in hbase-default.xml. It will take effect only after restarting hbase after modifying the configuration.

    Modify hbase-site.xml to enable fully distributed mode.

    Configure hbase.cluster.distributed to true.

    Configure hbase.rootdir to be the HDFS access address.

<!--配置hdfs连接地址这里使用的是hadoop伪分布式,所以只配置了一个地址-->
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoop01:9000/hbase</value>
</property>
<!--配置副本个数-->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!--配置启动HBase集群模式-->
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<!--配置Zookeeper的连接地址-->
<property>
<name>hbase.zookeeper.quorum</name>
<value>hadoop01:2181,hadoop02:2181,hadoop03:2181</value>
</property>

regionservers

    Configure the region server, modify the conf/regionservers file, configure all hbase hosts, each host name has an exclusive line, when hbase starts or shuts down, the hbase in the host will be started or shut down according to the configuration sequence.

3. Start the cluster

    The startup sequence is as follows:

    Start zookeeper->start hadoop->start hbase.

./start-hbase.sh

    After the startup is complete, you can access the web interface through the http://xxxxx:60010 address to check whether the startup is successful. Manage hbase through web meeting, and bhase can also be accessed through hbase shell script.

    You can start the standby master to achieve high availability. There is no need for redundant configuration to start the standby master. You only need to execute the following command in the corresponding server:

hbase-daemon.sh start master

    The principle of master hot standby used by HBase is the same as the hot standby principle of NameNode in Hadoop, which is implemented by Zookeeper.

    Shut down the cluster:

stop-hbase.sh

 

3. Basic operation

1. Command

bin/start-hbase.sh
bin/hbase shell
hbase>status
hbase>help
hbase>create 'testtable',''colfam1','colfam2'
hbase>list
hbase>describe 'testtable'
hbase>put 'testtable','myrow-1','colfam1:q1','value-1'
hbase>put 'testtable','myrow-2','colfam1:q2','value-2'
hbase>put 'testtable','myrow-2','colfam1:q3','value-3'
hbase>scan 'testtable'
hbase>get 'testtable','myrow-1'
hbase>delete 'testtable','myrow-2','colfam1:q2'
hbase>scan 'testtable'
hbase>disable 'testtable'
hbase>drop 'testtable'

2. Analysis

1. create

VERSIONS can be specified when creating a table. The configuration is to keep several latest versions of data when the current column family is persisted to the file system, which does not affect the historical data version in memory.

hbase>create 'testtable',{NAME=>'colfam1',VERSIONS=>3},{NAME=>'colfam2',VERSIONS=>1}
hbase>put 'testtable','myrow-1','colfam1:q1','value-1'

2. Check

hbase> scan 'hbase:meta'
hbase> scan 'hbase:meta', {COLUMNS => 'info:regioninfo'}
hbase> scan 'ns1:t1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
hbase> scan 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}
hbase> scan 't1', {REVERSED => true}
hbase> scan 't1', {FILTER => "(PrefixFilter ('row2') AND (QualifierFilter (>=, 'binary:xyz'))) AND (TimestampsFilter ( 123, 456))"}
hbase> scan 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
hbase> scan 't1', { COLUMNS => ['c1', 'c2'], ATTRIBUTES => {'mykey' => 'myvalue'}}
hbase> scan 't1', { COLUMNS => ['c1', 'c2'], AUTHORIZATIONS => ['PRIVATE','SECRET']}
hbase> scan 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}
hbase> scan 't1', {RAW => true, VERSIONS => 10}

    Using scan directly without adding RAW=>true can only query the latest version of the data.

hbase>scan 'testtable'
hbase>put 'testtable','myrow-1','colfam1:q1','value-2'
hbase>scan 'testtable'
hbase>put 'testtable','myrow-1','colfam1:q1','value-3'
hbase>scan 'testtable'

    You can add RAW=>true to the query to open the query of historical version data, and VERSIONS=>3 specifies to query the latest versions of data.

hbase>scan 'testtable',{RAW=>true,VERSIONS=>3}
hbase>put 'testtable','myrow-1','colfam1:q1','value-4'
hbase>scan 'testtable'
hbase>scan 'testtable',{RAW=>true,VERSIONS=>3}
hbase>put 'testtable','myrow-1','colfam2:x1','value-1'
hbase>scan 'testtable'
hbase>put 'testtable','myrow-1','colfam2:x1','value-2'
hbase>scan 'testtable'
hbase>scan 'testtable',{RAW=>true,VERSIONS=>3}

    restart hbase

hbase>scan 'testtable',{RAW=>true,VERSIONS=>3}
hbase>exit
bin/stop-hbase.sh

 

3. Pay attention

    Delete and delete characters cannot be used directly on the hbase command line:

    You can use ctrl+delete key to delete.

    or

    Modify the xshell configuration:

    File->Properties->Terminal->Keyboard

    ->delete key sequence [VT220Del]

    -> backspace key sequence [ASCII127]

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325153988&siteId=291194637