HBase 2.x环境搭建与基本使用

一、HBase简介
- 数据模型
- 系统架构
二、HBase伪分布式配置
三、HBase基本操作命令

一、HBase简介

HBase是一个基于Hadoop的分布式、可扩展、支持大数据存储的数据库。

使用场景：需要随机或实时读写大数据的场景
目标：支持数十亿行和数百万列的大表
来源：Google的论文：《 Bigtable: A Distributed Storage System for Structured Data》

底层技术对应关系：

	BigTable	HBase
文件存储系统	GFS	HDFS
海量数据处理	MapReduce	Hadoop MapReduce
协同服务管理	Chubby	Zookeeper

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.

Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables – billions of rows X millions of columns – atop clusters of commodity hardware. Apache HBase is an open-source, distributed, ver+sioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

数据模型

HBase采用表来组织数据，采用命名空间（NameSpace）对表进行逻辑分组。

NameSpace: 命名空间，类似于mysql中的database，默认有default和hbase，用户表默认在default中
表：HBase采用表来组织数据，表由行和列组成，列划分为若干个列族。
行：每个HBase表都由若干行组成，每个行由可排序的**行键（row key）**来标识。
列：采用列族:列限定符的形式确定具体的一列。
- 列族：一个HBase表被分组成许多“**列族”（Column Family）**的集合，它是基本的访问控制单元。列族可以动态添加，但在定义表时需要指定至少一个列族，在使用某个列族时要事先定义。
- 列限定符：表在水平方向由一个或者多个列族组成，一个列族中可以包含任意多个列，同一个列族里面的数据存储在一起。列族里的数据通过**“列限定符”（Column qualifier）**来定位。
单元格：在HBase表中，通过行、列族和列限定符确定一个“单元格”（cell），单元格中存储的数据没有数据类型，总被视为字节数组byte[]，所以在定义表时无需定义数据的类型，使用时用户需要自行进行数据类型转换
时间戳：每个单元格都保存着同一份数据的多个版本，这些版本采用时间戳进行索引， HBase中执行更新操作时，并不会删除数据旧的版本，而是生成一个新的版本，旧有的版本仍然保留（这是和HDFS只允许追加不允许修改的特性相关的）

HBase是一个稀疏、多维度、排序的映射表，这张表的索引是行键、列族、列限定符和时间戳，在进行数据存储的时，其采用key-value形式：Table + RowKey（升序） + ColumnFamily + Column + Timestamp --> Value

HBase数据模型

系统架构

HBase系统架构

HBase采用主从结构设计，基础存储依赖于HDFS，协调服务依赖于Zookeeper集群，HMaster负责HBase的管理操作，HRegionServer负责数据的相关操作。

客户端（Client）

客户端包含访问HBase的接口，同时在缓存中维护着已经访问过的Region位置信息，用来加快后续数据访问过程。
- 对于管理类操作，Client与HMaster进行RPC
- 对于数据读写操作，Client与HRegion Server进行RPC
Zookeeper服务器

Zookeeper是Chubby算法的一种开源实现
- 保证任何时候，集群中只有一个活跃的master，因为为了保证安全性会启动多个Master
- 存储所有Region的寻址入口
- 实时监控Region Server的状态，将Region Server的上下线的信息汇报给HMaster。
- 存储Hbase的元数据（Schema）包括，知道整个Hbase集群中有哪些Table,每个 Table 有哪些column family（列族）
Master服务器

主服务器主要负责表和Region的管理工作, 其实现类为 HMaste ：
- 对于表的操作：create, delete, alter
- 对于 RegionServer 的操作：
  - 实现不同Region服务器之间的负载均衡
  - 在Region分裂或合并后，负责重新调整Region的分布
  - 对发生故障失效的Region服务器上的Region进行迁移
Region服务器

Region服务器是HBase中最核心的模块，维护Master分配给他的 Region ，其实现类为 HRegionServer ，主要组成如下：
- 一个Region服务器包含多个Region，这些Region共用一个HLog文件
- Region由一个或者多个Store组成，每个Store保存一个 Columns Family。
- 每个Strore又由一个MemStore和0至多个StoreFile组成。
- MemStore存储在内存中，StoreFile存储在HDFS
- StoreFile的底层实现是HFile
主要作用如下:
- 对于数据的操作：get, put, delete
- 对于 Region 的操作：splitRegion、compactRegion

二、HBase伪分布式配置

HBase主要分为如下两种安装模式：

独立模式：HBase不使用HDFS，而是使用本地文件系统代替它在同一个JVM上运行所有HBase守护进程和本地ZooKeeper。
分布模式
- 伪分布式：所有守护进程都运行在单个节点上。
- 完全分布式：守护进程分布在集群中的所有节点上。

本教程主要讲解如何配置伪分布式。

0. 准备工作

版本：

JDK 1.8
Zookeeper 3.7.0
- 下载地址：https://zookeeper.apache.org/releases.html
- 安装参考：https://blog.csdn.net/tangyi2008/article/details/121984758
Hadoop 2.7.7
- 下载地址：https://hadoop.apache.org/releases.html
- 安装参考：https://blog.csdn.net/tangyi2008/article/details/121908766
HBase 2.1.9
- 下载地址：http://hbase.apache.org/downloads.html

你也可以到分享的百度云盘进行下载：

链接：https://pan.baidu.com/s/1kjcuNNCY2FxYA5o7Z2tgkQ 
提取码：nuli

HBase与Hadoop版本的兼容问题可以参考官网： http://hbase.apache.org/book.html

1. HBase配置文件介绍

所有配置文件都位于 conf 目录中，需要保持集群中每个节点同步

backup-masters：默认情况下不存在。它是一个列出所有Master进程备份的机器名的纯文本文件，每一行记录一台机器名或IP。
hadoop-metrics2-hbase.properties：用于连接HBase Hadoop的Metrics2框架
hbase-env.cmd和hbase-env.sh：用于Windows和Linux/UNIX环境的脚本来设置HBase的工作环境，包括Java、Java选项和其他环境变量的位置，比如：JAVA_HOME、HBASE_MANAGES_ZK等

HBASE_MANAGES_ZK 该配置项为true时，由HBase自己管理Zookeeper；否则，启动独立的Zookeeper
hbase-policy.xml：它是一个RPC服务器使用的默认策略配置文件，根据文件配置内容对客户端请求进行授权决策。仅在启用HBase安全性时使用。
hbase-site.xml：该文件指定覆盖HBase默认的配置选项。

配置项	说明
hbase.tmp.dir	本地文件系统的临时目录，默认目录在`/tmp`目录下，该目录会在系统重启后清空，所以需要注意该参数的值默认值为： ${java.io.tmpdir}/hbase-$ {user.name}
hbase.rootdir	RegionServers使用的目录，指定了HBase的数据存放目录，该路径需要完全限定（full-qualified），比如需要指定一个9000端口的HDFS文件系统下的/hbase目录，应写成：hdfs://namenode.example.org:9000/hbase 默认值：${hbase.tmp.dir}/hbase
hbase.cluster.distributed	是否分布式默认值：false
hbase.zookeeper.quorum	用逗号分隔的ZooKeeper集群中的服务器列表
dfs.replication	HDFS客户端关于副本个数的配置，如果是伪分布，可以设为1

log4j.properties：通过log4j进行HBase日志记录的配置文件。修改这个文件中的参数可以改变HBase的日志级别。
regionservers：包含HBase集群中运行的所有Region Server主机列表（默认情况下，这个文件包含单个条目localhost）。该文件是一个纯文本文件，每行是一个主机名或IP地址

更多配置信息参考：https://hbase.apache.org/book.html#config.files

2. HBase安装与伪分布配置

本教程按如下约定的目录或名称进行，如果不一致，请自行更改：

教程中使用的用户名为xiaobai
安装包的位置/home/xiaobai/soft
安装的目录/home/xiaobai/opt
主机名称node1

完成了0步的准备工作后，下载好对应版本的HBase后，将其上传到虚拟机的~/soft目录，接下来完成HBase的快速安装配置。

1）安装HBase

tar -xvf ~/soft/hbase-2.1.9-bin.tar.gz  -C ~/opt
cd ~/opt
ln -s hbase-2.1.9 hbase

配置环境变量

vi ~/.bashrc

添加如下内容：

HBASE_HOME=/home/xiaobai/opt/hbase
PATH=$HBASE_HOME/bin:$PATH

使环境变量生效

source .bashrc

2）配置

（1）hbase-env.sh

vi ~/opt/hbase/conf/hbase-env.sh

修改如下两个配置参数：

export JAVA_HOME=/home/xiaobai/opt/jdk
export HBASE_MANAGES_ZK=false

(2）hbase-site.xml

vi ~/opt/hbase/conf/hbase-site.xml

在configuration标签中添加如下配置：

<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
</property>

<property>
    <name>hbase.rootdir</name>
    <value>hdfs://node1:9000/hbase</value>
</property>

<property>
    <name>hbase.zookeeper.quorum</name>
    <value>node1</value>
</property>

<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>

<property>
    <name>hbase.tmp.dir</name>
    <value>/home/xiaobai/opt/hbase/tmp</value>
</property>

(3）regionservers

vi ~/opt/hbase/conf/regionservers

将原来的localhost修改成主机名node1

3. 服务的启动、查看与停止

1）启动

(1) 启动hdfs

start-dfs.sh

（2) 启动zookeeper

zkServer.sh start

（3) 启动hbase

start-hbase.sh

2）查看

(1) 进程查看

jps

在这里插入图片描述

（2）网页查看

http://node1:16010

在这里插入图片描述

三、HBase基本操作命令

0. help命令

进入交互界面，查看help

hbase shell
> help

Type ‘help “COMMAND”’, (e.g. ‘help “get”’ – the quotes are necessary) for help on a specific command.

Commands are grouped. Type ‘help “COMMAND_GROUP”’, (e.g. ‘help “general”’) for help on a command group.

…

SHELL USAGE:

Quote all names in HBase Shell such as table and column names. Commas delimitcommand parameters. Type after entering a command to run it. Dictionaries of configuration used in the creation and alteration of tables are Ruby Hashes. They look like this:

{‘key1’ => ‘value1’, ‘key2’ => ‘value2’, …}

and are opened and closed with curley-braces. Key/values are delimited by the ‘=>’ character combination**.** Usually keys are predefined constants such as NAME, VERSIONS, COMPRESSION, etc. Constants do not need to be quoted. Type ‘Object.constants’ to see a (messy) list of all constants in the environment.

If you are using binary keys or values and need to enter them in the shell, use double**-quote’d hexadecimal representation.** For example:

hbase> get ‘t1’, “key\x03\x3f\xcd”

hbase> get ‘t1’, “key\003\023\011”

hbase> put ‘t1’, “test\xef\xff”, ‘f1:’, “\x01\x33\x40”

下列列出使用HBase shell的相关要点：

交互界面的进入 hbase shell
help 可以查看单个命令，也可以查看一个分组的命令
在命令的使用过程中，像表名或者列名这样的名称，需要使用引号
命令的参数使用逗号隔开
输入回车即运行命令
单行命令过长可以使用续行符\写多行
创建或者修改表时使用字典形式进行相应配置，字典采用花括号形式，Key和Value使用 => 分隔
常量不需要使用引号，可以使用Object.constants查看有哪些常量
如果在命令行中要使用二进制，采用16进行制的写法并用双引号引起来

1. 一般操作

1）查询服务器状态

status

2）查询 Hbase 版本

version

3）查看所有表

list

2. 增删改

创建一个表

create 'member001','member_id','address','info'

2）获得表的描述

describe 'member001'

3）添加一个列族

alter 'member001', 'id'

4）添加数据
在 HBase shell 中，我们可以通过 put 命令来插入数据。列簇下的列不需要提前创
建，在需要时通过:来指定即可。添加数据如下：

put 'member001', 'debugo','id','11'
put 'member001', 'debugo','info:age','27'
put 'member001', 'debugo','info:birthday','1991-04-04'
put 'member001', 'debugo','info:industry', 'it'
put 'member001', 'debugo','address:city','Shanghai'

put 'member001', 'debugo','address:country','China'
put 'member001', 'Sariel', 'id', '21'
put 'member001', 'Sariel','info:age', '26'
put 'member001', 'Sariel','info:birthday', '1992-05-09'
put 'member001', 'Sariel','info:industry', 'it'
put 'member001', 'Sariel','address:city', 'Beijing'
put 'member001', 'Sariel','address:country', 'China'
put 'member001', 'Elvis', 'id', '22'
put 'member001', 'Elvis','info:age', '26'
put 'member001', 'Elvis','info:birthday', '1992-09-14'
put 'member001', 'Elvis','info:industry', 'it'
put 'member001', 'Elvis','address:city', 'Beijing'
put 'member001', 'Elvis','address:country', 'china'

5）查看表数据

scan 'member001'

6）删除一个列族

alter 'member001', {NAME => 'member_id', METHOD => 'delete’}

7）删除列
a）通过 delete 命令，我们可以删除 id 为某个值的‘info:age’字段，接下来的 get 就无值了：

delete 'member001','debugo','info:age'
get 'member001','debugo','info:age'

b）删除整行的值，用 deleteall 命令：

deleteall 'member001','debugo'
get 'member001','debugo'

8）通过 enable 和 disable 来启用 / 禁用这个表 , 相应的可以通过 is_enabled 和is_disabled 来检查表是否被禁用

is_enabled 'member001'
is_disabled 'member001'

9）使用 exists 来检查表是否存在

exists 'member001'

3. 查询

1）查询表中有多少行，用 count 命令：

count 'member001'

2）get
a)获取一个 id 的所有数据：

get 'member001', 'Sariel'

b)获得一个 id，一个列簇（一个列）中的所有数据：

get 'member001', 'Sariel', 'info'

3）查看scan的帮助

help 'scan'

Scan a table; pass table name and optionally a dictionary of scanner
specifications. Scanner specifications may include one or more of:
TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, ROWPREFIXFILTER, TIMESTAMP,
MAXLENGTH or COLUMNS, CACHE or RAW, VERSIONS, ALL_METRICS or METRICS

If no columns are specified, all columns will be scanned.
To scan all members of a column family, leave the qualifier empty as in
‘col_family’.

The filter can be specified in two ways:

Using a filterString - more information on this is available in the
Filter Language document attached to the HBASE-4176 JIRA

Using the entire package name of the filter.

If you wish to see metrics regarding the execution of the scan, the
ALL_METRICS boolean should be set to true. Alternatively, if you would
prefer to see only a subset of the metrics, the METRICS array can be
defined to include the names of only the metrics you care about.

Some examples:

hbase> scan ‘hbase:meta’
hbase> scan ‘hbase:meta’, {COLUMNS => ‘info:regioninfo’}
hbase> scan ‘ns1:t1’, {COLUMNS => [‘c1’, ‘c2’], LIMIT => 10, STARTROW => ‘xyz’}
hbase> scan ‘t1’, {COLUMNS => [‘c1’, ‘c2’], LIMIT => 10, STARTROW => ‘xyz’}
hbase> scan ‘t1’, {COLUMNS => ‘c1’, TIMERANGE => [1303668804, 1303668904]}
hbase> scan ‘t1’, {REVERSED => true}
hbase> scan ‘t1’, {ALL_METRICS => true}
hbase> scan ‘t1’, {METRICS => [‘RPC_RETRIES’, ‘ROWS_FILTERED’]}
hbase> scan ‘t1’, {ROWPREFIXFILTER => ‘row2’, FILTER => "
(QualifierFilter (>=, ‘binary:xyz’)) AND (TimestampsFilter ( 123, 456))"}
hbase> scan ‘t1’, {FILTER =>
org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
hbase> scan ‘t1’, {CONSISTENCY => ‘TIMELINE’}
For setting the Operation Attributes
hbase> scan ‘t1’, { COLUMNS => [‘c1’, ‘c2’], ATTRIBUTES => {‘mykey’ => ‘myvalue’}}
hbase> scan ‘t1’, { COLUMNS => [‘c1’, ‘c2’], AUTHORIZATIONS => [‘PRIVATE’,‘SECRET’]}
For experts, there is an additional option – CACHE_BLOCKS – which
switches block caching for the scanner on (true) or off (false). By
default it is enabled. Examples:

hbase> scan ‘t1’, {COLUMNS => [‘c1’, ‘c2’], CACHE_BLOCKS => false}

Also for experts, there is an advanced option – RAW – which instructs the
scanner to return all cells (including delete markers and uncollected deleted
cells). This option cannot be combined with requesting specific COLUMNS.
Disabled by default. Example:

hbase> scan ‘t1’, {RAW => true, VERSIONS => 10}

Besides the default ‘toStringBinary’ format, ‘scan’ supports custom formatting
by column. A user can define a FORMATTER by adding it to the column name in
the scan specification. The FORMATTER can be stipulated:

either as a org.apache.hadoop.hbase.util.Bytes method name (e.g, toInt, toString)

or as a custom class followed by method name: e.g. ‘c(MyFormatterClass).format’.

Example formatting cf:qualifier1 and cf:qualifier2 both as Integers:
hbase> scan ‘t1’, {COLUMNS => [‘cf:qualifier1:toInt’,
‘cf:qualifier2:c(org.apache.hadoop.hbase.util.Bytes).toInt’] }

Note that you can specify a FORMATTER by column only (cf:qualifier). You cannot
specify a FORMATTER for all columns of a column family.

Scan can also be used directly from a table, by first getting a reference to a
table, like such:

hbase> t = get_table ‘t’
hbase> t.scan

Note in the above situation, you can still provide all the filtering, columns,
options, etc as described above.

4）查询整表数据

scan 'member001'

5）扫描整个列簇

scan 'member001', {COLUMN=>'info'}

6）指定扫描其中的某个列

scan 'member001', {COLUMNS=> 'info:birthday'}

7）除了列（COLUMNS）修饰词外，HBase 还支持 Limit（限制查询结果行数），STARTROW（ROWKEY 起始行。会先根据这个 key 定位到 region，再向后扫描）、STOPROW(结束行)、TIMERANGE（限定时间戳范围）、VERSIONS（版本数）、和 FILTER（按条件过滤行）等。比如我们从 Sariel 这个 rowkey 开始，找下一个行的最新版本：

scan 'member001', {STARTROW => 'Sariel', LIMIT=>1, VERSIONS=>1}

8）Filter 是一个非常强大的修饰词，可以设定一系列条件来进行过滤。比如我们要限制某个列的值等于 26。

scan 'member001', FILTER=>"ValueFilter(=,'binary:26')"

值包含 6 这个值：

scan 'member001', FILTER=>"ValueFilter(=,'substring:6')"

列名中的前缀为 birth 的：

scan 'member001', FILTER=>"ColumnPrefixFilter('birth') "

FILTER 中支持多个过滤条件通过括号、AND 和 OR 的条件组合：

scan 'member001', FILTER=>"ColumnPrefixFilter('birth') AND ValueFilter ValueFilter(=,'substring:1988')"

PrefixFilter 是对 Rowkey 的前缀进行判断,这是一个非常常用的功能。

scan 'member001', FILTER=>"PrefixFilter('E')"

4. 删除

删除表需要先将表 disable。

disable 'member001'
drop 'member001'

HBase环境搭建与基本使用（保姆级教程）