HBase Basic Beginners

Whether NoSQL, Big Data or field, HBase are very "hot" in a database.
This article will HBase do some basic introduction, aimed at entry.

I. Introduction

HBase is an open source, column-oriented non-relational distributed database, currently is a very critical part of Hadoop system.
In the first, HBase is based on the realization of the prototype Google's BigTable, a number of technical papers from Google in 2006 by Fay Chang wrote "BigTable". And BigTable Google File System (File System) based on the same, HBase is based on HDFS (Hadoop Distributed File System) over developed.

HBase implemented using the Java language, in which incorporates a number of compression algorithms, and memory operations BigTable Bloom filters mentioned article, such HBase these capabilities in mass data storage, high performance read scene has been a large number of applications, such as Facebook in November 2010 the outset has been chosen as the HBase storage layer technology messaging platform.
HBase to Apache License Version 2.0 open source, which is a friendly agreement for commercial applications, while the project is currently one of the Apache Software Foundation top-level project.

What characteristics

  • Column storage model, achieve a high degree of data compression to save storage costs
  • Instead of using the LSM mechanism B (+) tree, which makes it very suitable for mass data HBase real scene written
  • High reliability, data will contain multiple copies (the default is 3 copies), thanks to HDFS replication capability provides automatic fault RegionServer by the transfer function
  • High scalability, support fragmentation scalability (based Region), can automatically, equalized data
  • Strong consistency read, write data are carried out for the main Region, part of the CP system type
  • Easy to operate, HBase provides a Java API, RestAPI / Thrift API interfaces
  • Query optimization, using the Block Cache and Bloom filter to quickly find support massive data

The difference between the RDBMS

For traditional RDBMS, it supports ACID transactions are the basic database capabilities, and HBase use row-level locking to ensure atomicity of write operations, but does not support multi-line transactional write operation, mainly from the flexibility and scalability on making trade-offs.

ACID element comprising atomicity (Atomicity), consistency (Consistency), isolation (Isolation) and persistent (Durability Rev)

Overall, the difference HBase traditional relational databases, as shown in the following table:

characteristic HBase RDBMS
Hardware Architecture Similar to Hadoop distributed clusters, low-cost hardware Conventional multi-core system, expensive hardware
Fault Tolerance Software architecture to achieve, since a plurality of nodes, so do not worry about the point or points down Generally require additional hardware mechanism to achieve HA
Database Size PB GB, TB
Data arrangement of way Sparse, distributed multi-dimensional Map Organized in rows and columns
type of data Bytes Rich data types
Things support ACID supports only a single level Row Full ACID support, and tables for Row
Query Language Supports only Java API (unless used in conjunction with other frameworks, such as Phoenix, Hive) SQL
index Only supports Row-key, unless the application with other technologies, such as Phoenix, Hive stand by
Throughput One million queries / second Thousands of queries / sec

Second, the data model

Here, we take a relational database table data to demonstrate the difference at HBase. First look at this table below:

ID Equipment name status Timestamp
1 air conditioning turn on 20190712 10:05:01
2 TV set shut down 20190712 10:05:08

Here are some home recording equipment reported by state data (DeviceState), which includes the device name, status, timestamp these fields.

In HBase, the group data in columns (Column Family, referred to as CF) for storing, that may be stored for different columns are separated into different files.
So for the above state data table, in HBase it will be stored as two:

1. Device Name column family

Row-Key CF:Column-Key Timestamp Cell Value
1 DeviceState: device name 20190712 10:05:01 air conditioning
2 DeviceState: device name 20190712 10:05:08 TV set

2. Status column family

Row-Key CF:Column-Key Timestamp Cell Value
1 DeviceState: state 20190712 10:05:01 turn on
2 DeviceState: state 20190712 10:05:08 shut down

Row-key here is the unique ID field positioning data row, and Row-key plus CF, Column-Key, plus a time stamp can locate a data cell.
Where timestamp is used to indicate the version of the data lines , there will be three versions of the data in HBase timestamp default, which means write to the same data (data associated with a Rowkey), can save up to three version.

When querying data a row, HBase both find columns from two families (file), the final result will be returned to the client after the merger. Thus if the column family too much, it will affect the reading performance , when the design requires some trade-offs.

Thus, the use HBase and relational database is very different when using HBase has to throw away a lot of thinking and practice of relational databases, such as strong typing, secondary indexes, table joins, triggers, and so on.

However HBase flexibility and high scalability is unmatched by traditional RDBMS.

Third, the installation HBase

Install stand-alone environment

  1. Prepare a JDK

Ensure environmental JDK has been installed, the executable java -version to confirm:

host:/home/hbase # java -version
openjdk version "1.8.0_201"
OpenJDK Runtime Environment (build 1.8.0_201-Huawei_JDK_V100R001C00SPC060B003-b10)
OpenJDK 64-Bit Server VM (build 25.201-b10, mixed mode)
  1. Download software

Download the official website address of the page:
http://archive.apache.org/dist/hbase/

Select the appropriate version, such as 1.4.10. Download after decompression:

wget http://archive.apache.org/dist/hbase/2.1.5/hbase-2.1.5-bin.tar.gz
tar -xzvf hbase-2.1.5-bin.tar.gz
mkdir -p /opt/local
mv hbase-2.1.5 /opt/local/hbase

Configuration HBase execute the command path:

export HBASE_HOME=/opt/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
  1. Configuration Software

vim conf/hbase-env.sh

#JDK安装目录
export JAVA_HOME=/usr/local/jre1.8.0_201
#配置hbase自己管理zookeeper
export HBASE_MANAGES_ZK=true

vim conf/hbase-site.xml

<configuration>

  <!-- zookeeper端口  -->
  <property>
      <name>hbase.zookeeper.property.clientPort</name>
      <value>2182</value>                                                                                                                                           
  </property>

  <!--  HBase 数据存储目录 -->
  <property>
    <name>hbase.rootdir</name>
    <value>file:///opt/local/hbase/data</value>
  </property>

  <!-- 用于指定 ZooKeeper 数据存储目录 -->
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/opt/local/hbase/data/zookeeper</value>
  </property>

  <!-- 用于指定临时数据存储目录 -->
  <property>
    <name>hbase.tmp.dir</name>
    <value>/opt/local/hbase/temp/hbase-${user.name}</value>
  </property>
</configuration>

Hbase.rootdir and hbase.zookeeper.property.dataDir which are used to store data directory specified, the default hbase use / tmp directory, this is clearly inappropriate.
After you configure these two paths, hbase will automatically create the appropriate directories.

The parameters can be set for more reference herein

  1. Start the software
start-hbase.sh

See case logs / hbase-root-master-host-xxx.log, as follows:

2019-07-11 07:37:23,654 INFO  [localhost:33539.activeMasterManager] hbase.MetaMigrationConvertingToPB: hbase:meta doesn't have any entries to update.
2019-07-11 07:37:23,654 INFO  [localhost:33539.activeMasterManager] hbase.MetaMigrationConvertingToPB: META already up-to date with PB serialization
2019-07-11 07:37:23,664 INFO  [localhost:33539.activeMasterManager] master.AssignmentManager: Clean cluster startup. Assigning user regions
2019-07-11 07:37:23,665 INFO  [localhost:33539.activeMasterManager] master.AssignmentManager: Joined the cluster in 11ms, failover=false
2019-07-11 07:37:23,672 INFO  [localhost:33539.activeMasterManager] master.TableNamespaceManager: Namespace table not found. Creating...

The inspection process, the discovery process has started

ps -ef |grep hadoop
root     11049 11032  2 07:37 pts/1    00:00:20 /usr/local/jre1.8.0_201/bin/java -Dproc_master -XX:OnOutOfMemoryError=kill -9 %p -XX:+UseConcMarkSweepGC -XX:PermSize=128m -XX:MaxPermSize=128m -XX:ReservedCodeCacheSize=256m -Dhbase.log.dir=/opt/local/hbase/logs -Dhbase.log.file=hbase-root-master-host-192-168-138-148.log -Dhbase.home.dir=/opt/local/hbase -Dhbase.id.str=root -Dhbase.root.logger=INFO,RFA -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.master.HMaster start
root     18907 30747  0 07:50 pts/1    00:00:00 grep --color=auto hadoop

By JPS (JDK comes with the inspection tools) you can see the current Java process started:

# jps
5701 Jps
4826 HMaster
1311 jar

View data directory, find the corresponding file is generated:

host:/opt/local/hbase/data # ls -lh .
total 36K
drwx------. 4 root root 4.0K Jul 11 08:08 data
drwx------. 4 root root 4.0K Jul 11 08:08 hbase
-rw-r--r--. 1 root root   42 Jul 11 08:08 hbase.id
-rw-r--r--. 1 root root    7 Jul 11 08:08 hbase.version
drwx------. 2 root root 4.0K Jul 11 08:08 MasterProcWALs
drwx------. 2 root root 4.0K Jul 11 08:08 oldWALs
drwx------. 3 root root 4.0K Jul 11 08:08 .tmp
drwx------. 3 root root 4.0K Jul 11 08:08 WALs
drwx------. 3 root root 4.0K Jul 11 08:08 zookeeper

About running mode
when HBase start the default will be used stand-alone mode, where Zookeeper and HMaster / RegionServer will run in the same JVM.
HBase started in standalone mode contains a HMaster, RegionServer, Zookeeper instance, this time directly HBase local file system rather than HDFS.

By conf / hbase-site.xml arranged in hbase.cluster.distributed is true, that is, a cluster pattern.
In this mode, you can use the distributed environment to deploy, or "pseudo-distributed" multi-process environment.

<configuration>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
</configuration>

Note that, if you start with standalone words, HMaster, RegionServer ports are random, can not be specified by the configuration file.

Fourth, the basic use

Open HBase Shell

hbase shell

Command execution status

Version 2.1.5, r76ab087819fe82ccf6f531096e18ad1bed079651, Wed Jun  5 16:48:11 PDT 2019

hbase(main):001:0> status
1 active master, 0 backup masters, 1 servers, 0 dead, 2.0000 average load

This means that there is a Master running, a RegionServer, each containing two RegionServer Region.

Operating Table

  • Create a table DeviceState
hbase(main):002:0> create "DeviceState", "name:c1", "state:c2"

=> Hbase::Table - DeviceState

At this point, we have created a DeviceState table contains the name (device name), state (state) two columns.

View table information:

hbase(main):003:0> list
TABLE
DeviceState
1 row(s) in 0.0090 seconds

=> ["DeviceState"]

hbase(main):003:0> describe "DeviceState"
Table DeviceState is ENABLED
DeviceState
COLUMN FAMILIES DESCRIPTION
{NAME => 'name', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSIO
N => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'state', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSI
ON => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
2 row(s) in 0.0870 seconds
  • data input

The following command to write a record to DeviceState two, since the group has two columns, it is necessary to write data of four cells:

put "DeviceState", "row1", "name", "空调"
put "DeviceState", "row1", "state", "打开"
put "DeviceState", "row2", "name", "电视机"
put "DeviceState", "row2", "state", "关闭"
  • Query data

Query a row, a column

hbase(main):012:0> get "DeviceState","row1"
COLUMN                                      CELL
 name:                                      timestamp=1562834473008, value=\xE7\x94\xB5\xE8\xA7\x86\xE6\x9C\xBA
 state:                                     timestamp=1562834474630, value=\xE5\x85\xB3\xE9\x97\xAD
1 row(s) in 0.0230 seconds

hbase(main):013:0> get "DeviceState","row1", "name"
COLUMN                                      CELL
 name:                                      timestamp=1562834473008, value=\xE7\x94\xB5\xE8\xA7\x86\xE6\x9C\xBA
1 row(s) in 0.0200 seconds

Scanning Table

hbase(main):026:0> scan "DeviceState"
ROW                                         COLUMN+CELL
 row1                                       column=name:, timestamp=1562834999374, value=\xE7\xA9\xBA\xE8\xB0\x83
 row1                                       column=state:, timestamp=1562834999421, value=\xE6\x89\x93\xE5\xBC\x80
 row2                                       column=name:, timestamp=1562834999452, value=\xE7\x94\xB5\xE8\xA7\x86\xE6\x9C\xBA
 row2                                       column=state:, timestamp=1562835001064, value=\xE5\x85\xB3\xE9\x97\xAD
2 row(s) in 0.0250 seconds

The number of queries

hbase(main):014:0> count "DeviceState"
2 row(s) in 0.0370 seconds

=> 1
  • clear data

To delete a column, a row

delete "DeviceState", "row1", "name"
0 row(s) in 0.0080 seconds

hbase(main):003:0> deleteall "DeviceState", "row2"
0 row(s) in 0.1290 seconds

Empty the entire table data

hbase(main):021:0> truncate "DeviceState"
Truncating 'DeviceState' table (it may take a while):
 - Disabling table...
 - Truncating table...
0 row(s) in 3.5060 seconds

Delete table (you need to disable)

hbase(main):006:0> disable "DeviceState"
0 row(s) in 2.2690 seconds

hbase(main):007:0> drop "DeviceState"
0 row(s) in 1.2880 seconds

Five, FAQ

  • When prompted to start listening on port ZK failed:
    Could not Start AT requested ZK ZK Port of 2181. Started WAS AT Port: the Aborting AS 2182. Clients (EG shell) by Will not BE the Quorum of Able to the Find the this ZK

The reason
HBase need to start Zookeeper, while the local port 2181 has been enabled (there may be other instances Zookeeper)

Solution
conf / hbase-site.xml modified value hbase.zookeeper.property.clientPort will change it to 2182 ,:

<configuration>
  <property>
      <name>hbase.zookeeper.property.clientPort</name>
      <value>2182</value>                                                                                                                                           
  </property>
</configuration>
  • Tip java.lang.UnsatisfiedLinkError start HBase Shell

The reason
during the execution of hbase shell, JRuby creates a temporary file in the "java.io.tmpdir" path, the default path is "/ tmp". If "/ tmp" directory NOEXEC set permissions, and then hbase shell fail to start and throw "java.lang.UnsatisfiedLinkError" error.

Solution

  1. Cancel / tmp in noexec permissions (not recommended)
  2. Set java.io.tmpdir variable, pointing to available paths, edit conf / hbase-env.sh file:
export HBASE_TMP_DIR=/opt/local/hbase/temp
export HBASE_OPTS="-XX:+UseConcMarkSweepGC -Djava.io.tmpdir=$HBASE_TMP_DIR"

Reference Documents

HBase official Definitive Guide
https://hbase.apache.org/book.html#quickstart

HBase stand-alone mode to build
https://my.oschina.net/jackieyeah/blog/712019

HBase in simple terms
the more detailed the HBase origin and characteristics, the paper provides some introduction HBase cluster storage mechanism, ideal for entry-reading
https://www.ibm.com/developerworks/cn/analytics/library/ba-cn- bigdata-hbase / index.html

Guess you like

Origin www.cnblogs.com/littleatp/p/11946199.html