HBase environment construction and basic use (nanny-level tutorial)

1. Introduction to HBase

HBase is a distributed, scalable, big data storage database based on Hadoop.

  • Usage scenarios: scenarios that require random or real-time reading and writing of big data

  • Goal: Support large tables with billions of rows and millions of columns

  • Source: Google's paper: "Bigtable: A Distributed Storage System for Structured Data"

  • The underlying technology correspondence:

    BigTable HBase
    file storage system GFS HDFS
    Mass data processing MapReduce Hadoop MapReduce
    Collaborative service management Chubby Zookeeper

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.

Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables – billions of rows X millions of columns – atop clusters of commodity hardware. Apache HBase is an open-source, distributed, ver+sioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

data model

HBase uses tables to organize data, and uses NameSpace to logically group tables.

  • NameSpace: Namespace, similar to database in mysql, default and hbase, user table default in default

  • Table : HBase uses tables to organize data. Tables are composed of rows and columns, and columns are divided into several column families.

  • Row : Each HBase table consists of several rows, each row is identified by a sortable row key.

  • Column : 列族:列限定符The form used to identify a specific column.

    • Column Family : An HBase table is grouped into a collection of "Column Families", which are the basic unit of access control. Column families can be added dynamically, but at least one column family needs to be specified when defining a table, and it must be defined in advance when using a column family.
    • Column qualifier : A table consists of one or more column families in the horizontal direction. A column family can contain any number of columns, and the data in the same column family is stored together. Data in a column family is located by **"column qualifier"**.
  • Cell : In an HBase table, a "cell" is determined by row, column family and column qualifier. The data stored in the cell has no data type and is always regarded as a byte array byte[], so in There is no need to define the type of data when defining a table, and users need to convert the data type by themselves when using it

  • Timestamp : Each cell stores multiple versions of the same data. These versions are indexed with timestamps. When performing an update operation in HBase, the old version of the data is not deleted, but a new version is generated. The old version is still retained (this is related to the feature that HDFS only allows appends and does not allow modifications)

HBase is a sparse, multi-dimensional, sorted mapping table. The index of this table is row key, column family, column qualifier and timestamp. When storing data, it adopts the key-value form: Table + RowKey ( Ascending) + ColumnFamily + Column + Timestamp --> Value

HBase data model

system structure

HBase system architecture

HBase adopts a master-slave structure design, the basic storage depends on HDFS, the coordination service depends on the Zookeeper cluster, HMaster is responsible for HBase management operations, and HRegionServer is responsible for data-related operations.

  • Client (Client)

    The client includes an interface to access HBase, and maintains the location information of the region that has been accessed in the cache, which is used to speed up the subsequent data access process.

    • For management operations, Client performs RPC with HMaster

    • For data read and write operations, Client performs RPC with HRegion Server

  • Zookeeper server

    Zookeeper is an open source implementation of the Chubby algorithm

    • Ensure that there is only one active master in the cluster at any time, because multiple masters will be started for security

    • Stores the addressing entry of all Regions

    • Monitor the status of the Region Server in real time, and report the information on and offline of the Region Server to the HMaster.

    • Store the metadata (Schema) of HBase, including knowing which Tables are in the entire HBase cluster, and which column families (column families) each Table has.

  • Master server

    The main server is mainly responsible for the management of tables and regions, and its implementation class is HMaster:

    • For table operations: create, delete, alter

    • For RegionServer operations:

      • Implement load balancing between servers in different Regions
      • Responsible for readjusting the distribution of Regions after Regions are split or merged
      • Migrate the Region on the failed Region Server
  • Region server

    The Region server is the core module in HBase. It maintains the Region assigned to him by the Master. Its implementation class is HRegionServer. The main components are as follows:

    • A Region server contains multiple Regions, and these Regions share a HLog file

    • A Region consists of one or more Stores, and each Store saves a Columns Family.

    • Each Storre in turn consists of a MemStore and 0 or more StoreFiles.

    • MemStore is stored in memory, StoreFile is stored in HDFS

    • The underlying implementation of StoreFile is HFile

    The main functions are as follows:

    • Operations on data: get, put, delete

    • For Region operations: splitRegion, compactRegion

2. HBase pseudo-distributed configuration

HBase is mainly divided into the following two installation modes:

  • Standalone mode: HBase does not use HDFS, but instead uses the local file system to run all HBase daemons and local ZooKeeper on the same JVM.

  • distribution pattern

    • Pseudo-distributed: All daemons run on a single node.
    • Fully distributed: The daemon is distributed across all nodes in the cluster.

This tutorial mainly explains how to configure pseudo-distribution.

0. Preparations

Version:

  • JDK 1.8

  • Zookeeper 3.7.0

    • Download address: https://zookeeper.apache.org/releases.html
    • Installation reference: https://blog.csdn.net/tangyi2008/article/details/121984758
  • Hadoop 2.7.7

    • Download address: https://hadoop.apache.org/releases.html
    • Installation reference: https://blog.csdn.net/tangyi2008/article/details/121908766
  • HBase 2.1.9

    • Download address: http://hbase.apache.org/downloads.html

You can also go to the shared Baidu cloud disk to download:

链接:https://pan.baidu.com/s/1kjcuNNCY2FxYA5o7Z2tgkQ 
提取码:nuli 

For compatibility issues between HBase and Hadoop versions, please refer to the official website: http://hbase.apache.org/book.html

1. Introduction to HBase configuration files

All configuration files are located in the conf directory and need to keep every node in the cluster in sync

  • backup-masters: does not exist by default. It is a plain text file listing all the machine names backed up by the Master process, one machine name or IP per line.

  • hadoop-metrics2-hbase.properties: Metrics2 framework for connecting to HBase Hadoop

  • hbase-env.cmd and hbase-env.sh: Scripts for Windows and Linux/UNIX environments to set up the working environment of HBase, including the location of Java, Java options and other environment variables such as: JAVA_HOME, HBASE_MANAGES_ZKetc.

    HBASE_MANAGES_ZKWhen this configuration item is true, Zookeeper is managed by HBase itself; otherwise, an independent Zookeeper is started

  • hbase-policy.xml: It is a default policy configuration file used by the RPC server to make authorization decisions for client requests based on the content of the file configuration. Only used when HBase security is enabled.

  • hbase-site.xml: This file specifies to override HBase default configuration options.

configuration item illustrate
hbase.tmp.dir The temporary directory of the local file system, the default directory is in the /tmpdirectory, the directory will be emptied after the system restarts, so it should be noted that the
default value of this parameter is: java . io . tmpdir / hbase − {java.io.tmpdir}/hbase -java.io.tmpdir/hbase{user.name}
hbase.rootdir The directory used by RegionServers specifies the data storage directory of HBase. The path needs to be fully qualified (full-qualified). For example, if you need to specify the /hbase directory under the HDFS file system with port 9000, it should be written as: hdfs://namenode.example .org:9000/hbase
Default: ${hbase.tmp.dir}/hbase
hbase.cluster.distributed Is it distributed
Default value: false
hbase.zookeeper.quorum comma-separated list of servers in the ZooKeeper cluster
dfs.replication The configuration of the number of copies of the HDFS client can be set to 1 if it is a pseudo-distribution
  • log4j.properties: Configuration file for HBase logging via log4j. Modifying the parameters in this file can change the log level of HBase.

  • regionservers: Contains a list of all Region Server hosts running in the HBase cluster (by default, this file contains a single entry localhost). The file is a plain text file, each line is a hostname or IP address

For more configuration information, please refer to: https://hbase.apache.org/book.html#config.files

2. HBase installation and pseudo-distribution configuration

This tutorial is carried out according to the following directory or name. If it is inconsistent, please change it yourself:

  • The username used in the tutorial isxiaobai
  • The location of the installation package/home/xiaobai/soft
  • installed directory/home/xiaobai/opt
  • hostnamenode1

After completing the preparations of step 0, after downloading the corresponding version of HBase, upload it to the ~/softdirectory of the virtual machine, and then complete the quick installation and configuration of HBase.

1) Install HBase

tar -xvf ~/soft/hbase-2.1.9-bin.tar.gz  -C ~/opt
cd ~/opt
ln -s hbase-2.1.9 hbase

Configure environment variables

vi ~/.bashrc

Add the following:

HBASE_HOME=/home/xiaobai/opt/hbase
PATH=$HBASE_HOME/bin:$PATH

Make environment variables take effect

source .bashrc

2) Configuration

(1)hbase-env.sh

vi ~/opt/hbase/conf/hbase-env.sh

Modify the following two configuration parameters:

export JAVA_HOME=/home/xiaobai/opt/jdk
export HBASE_MANAGES_ZK=false

(2)hbase-site.xml

vi ~/opt/hbase/conf/hbase-site.xml

Add the following configuration to the configuration tab:

<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
</property>

<property>
    <name>hbase.rootdir</name>
    <value>hdfs://node1:9000/hbase</value>
</property>

<property>
    <name>hbase.zookeeper.quorum</name>
    <value>node1</value>
</property>

<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>

<property>
    <name>hbase.tmp.dir</name>
    <value>/home/xiaobai/opt/hbase/tmp</value>
</property>

(3)regionservers

vi ~/opt/hbase/conf/regionservers

Change the original localhostto the hostnamenode1

3. Start, view and stop services

1) start

(1) Start hdfs

start-dfs.sh

(2) Start zookeeper

zkServer.sh start

(3) Start hbase

start-hbase.sh

2) View

(1) Process view

jps

insert image description here

(2) Web page view

http://node1:16010

insert image description here

3. HBase basic operation commands

0. help command

Enter the interactive interface and view help

hbase shell
> help

Type ‘help “COMMAND”’, (e.g. ‘help “get”’ – the quotes are necessary) for help on a specific command.

Commands are grouped. Type ‘help “COMMAND_GROUP”’, (e.g. ‘help “general”’) for help on a command group.

SHELL USAGE:

Quote all names in HBase Shell such as table and column names. Commas delimitcommand parameters. Type after entering a command to run it. Dictionaries of configuration used in the creation and alteration of tables are Ruby Hashes. They look like this:

{‘key1’ => ‘value1’, ‘key2’ => ‘value2’, …}

and are opened and closed with curley-braces. Key/values are delimited by the ‘=>’ character combination**.** Usually keys are predefined constants such as NAME, VERSIONS, COMPRESSION, etc. Constants do not need to be quoted. Type ‘Object.constants’ to see a (messy) list of all constants in the environment.

If you are using binary keys or values and need to enter them in the shell, use double**-quote’d hexadecimal representation.** For example:

hbase> get ‘t1’, “key\x03\x3f\xcd”

hbase> get ‘t1’, “key\003\023\011”

hbase> put ‘t1’, “test\xef\xff”, ‘f1:’, “\x01\x33\x40”

The following lists the relevant points for using the HBase shell:

  • Enter the hbase shell in the interactive interface

  • help can view a single command or a grouped command

  • During the use of the command, names such as table names or column names need to be quoted

  • Command arguments are separated by commas

  • Enter Enter to run the command

  • If a single-line command is too long, you can use the continuation character \ to write multiple lines

  • When creating or modifying a table, use the dictionary form for corresponding configuration. The dictionary is in the form of curly braces, and the Key and Value are separated by =>

  • Constants do not need quotation marks, you can use Object.constants to see what constants are available

  • If you want to use binary on the command line, use base 16 and enclose it in double quotes

1. General operation

1) Query the server status

status

2) Query Hbase version

version

3) View all tables

list

2. Additions, deletions and modifications

  1. create a table
create 'member001','member_id','address','info'

2) Get the description of the table

describe 'member001'

3) Add a column family

alter 'member001', 'id'

4) Adding data
In the HBase shell, we can insert data through the put command. The columns under the column cluster do not need to be created in advance
, and can be specified by : when needed. Add data as follows:

put 'member001', 'debugo','id','11'
put 'member001', 'debugo','info:age','27'
put 'member001', 'debugo','info:birthday','1991-04-04'
put 'member001', 'debugo','info:industry', 'it'
put 'member001', 'debugo','address:city','Shanghai'

put 'member001', 'debugo','address:country','China'
put 'member001', 'Sariel', 'id', '21'
put 'member001', 'Sariel','info:age', '26'
put 'member001', 'Sariel','info:birthday', '1992-05-09'
put 'member001', 'Sariel','info:industry', 'it'
put 'member001', 'Sariel','address:city', 'Beijing'
put 'member001', 'Sariel','address:country', 'China'
put 'member001', 'Elvis', 'id', '22'
put 'member001', 'Elvis','info:age', '26'
put 'member001', 'Elvis','info:birthday', '1992-09-14'
put 'member001', 'Elvis','info:industry', 'it'
put 'member001', 'Elvis','address:city', 'Beijing'
put 'member001', 'Elvis','address:country', 'china'

5) View table data

scan 'member001'

6) Delete a column family

alter 'member001', {NAME => 'member_id', METHOD => 'delete’}

7) Delete column
a) Through the delete command, we can delete the 'info:age' field whose id is a certain value, and the next get has no value:

delete 'member001','debugo','info:age'
get 'member001','debugo','info:age'

b) To delete the entire row of values, use the deleteall command:

deleteall 'member001','debugo'
get 'member001','debugo'

8) Enable/disable the table through enable and disable, correspondingly check whether the table is disabled through is_enabled and is_disabled

is_enabled 'member001'
is_disabled 'member001'

9) Use exists to check if the table exists

exists 'member001'

3. Inquiry

1) To find out how many rows are in the table, use the count command:

count 'member001'

2) get
a) Get all the data of an id:

get 'member001', 'Sariel'

b) Get an id, all data in a column cluster (one column):

get 'member001', 'Sariel', 'info'

3) View the help of scan

help 'scan'

Scan a table; pass table name and optionally a dictionary of scanner
specifications. Scanner specifications may include one or more of:
TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, ROWPREFIXFILTER, TIMESTAMP,
MAXLENGTH or COLUMNS, CACHE or RAW, VERSIONS, ALL_METRICS or METRICS

If no columns are specified, all columns will be scanned.
To scan all members of a column family, leave the qualifier empty as in
‘col_family’.

The filter can be specified in two ways:

  1. Using a filterString - more information on this is available in the
    Filter Language document attached to the HBASE-4176 JIRA
  2. Using the entire package name of the filter.

If you wish to see metrics regarding the execution of the scan, the
ALL_METRICS boolean should be set to true. Alternatively, if you would
prefer to see only a subset of the metrics, the METRICS array can be
defined to include the names of only the metrics you care about.

Some examples:

hbase> scan ‘hbase:meta’
hbase> scan ‘hbase:meta’, {COLUMNS => ‘info:regioninfo’}
hbase> scan ‘ns1:t1’, {COLUMNS => [‘c1’, ‘c2’], LIMIT => 10, STARTROW => ‘xyz’}
hbase> scan ‘t1’, {COLUMNS => [‘c1’, ‘c2’], LIMIT => 10, STARTROW => ‘xyz’}
hbase> scan ‘t1’, {COLUMNS => ‘c1’, TIMERANGE => [1303668804, 1303668904]}
hbase> scan ‘t1’, {REVERSED => true}
hbase> scan ‘t1’, {ALL_METRICS => true}
hbase> scan ‘t1’, {METRICS => [‘RPC_RETRIES’, ‘ROWS_FILTERED’]}
hbase> scan ‘t1’, {ROWPREFIXFILTER => ‘row2’, FILTER => "
(QualifierFilter (>=, ‘binary:xyz’)) AND (TimestampsFilter ( 123, 456))"}
hbase> scan ‘t1’, {FILTER =>
org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
hbase> scan ‘t1’, {CONSISTENCY => ‘TIMELINE’}
For setting the Operation Attributes
hbase> scan ‘t1’, { COLUMNS => [‘c1’, ‘c2’], ATTRIBUTES => {‘mykey’ => ‘myvalue’}}
hbase> scan ‘t1’, { COLUMNS => [‘c1’, ‘c2’], AUTHORIZATIONS => [‘PRIVATE’,‘SECRET’]}
For experts, there is an additional option – CACHE_BLOCKS – which
switches block caching for the scanner on (true) or off (false). By
default it is enabled. Examples:

hbase> scan ‘t1’, {COLUMNS => [‘c1’, ‘c2’], CACHE_BLOCKS => false}

Also for experts, there is an advanced option – RAW – which instructs the
scanner to return all cells (including delete markers and uncollected deleted
cells). This option cannot be combined with requesting specific COLUMNS.
Disabled by default. Example:

hbase> scan ‘t1’, {RAW => true, VERSIONS => 10}

Besides the default ‘toStringBinary’ format, ‘scan’ supports custom formatting
by column. A user can define a FORMATTER by adding it to the column name in
the scan specification. The FORMATTER can be stipulated:

  1. either as a org.apache.hadoop.hbase.util.Bytes method name (e.g, toInt, toString)
  2. or as a custom class followed by method name: e.g. ‘c(MyFormatterClass).format’.

Example formatting cf:qualifier1 and cf:qualifier2 both as Integers:
hbase> scan ‘t1’, {COLUMNS => [‘cf:qualifier1:toInt’,
‘cf:qualifier2:c(org.apache.hadoop.hbase.util.Bytes).toInt’] }

Note that you can specify a FORMATTER by column only (cf:qualifier). You cannot
specify a FORMATTER for all columns of a column family.

Scan can also be used directly from a table, by first getting a reference to a
table, like such:

hbase> t = get_table ‘t’
hbase> t.scan

Note in the above situation, you can still provide all the filtering, columns,
options, etc as described above.

4) Query the entire table data

scan 'member001'

5) Scan the entire column cluster

scan 'member001', {COLUMN=>'info'}

6) Specify to scan one of the columns

scan 'member001', {COLUMNS=> 'info:birthday'}

7) In addition to the column (COLUMNS) modifier, HBase also supports Limit (limiting the number of query result rows), STARTROW (ROWKEY starting row. It will first locate the region based on this key, and then scan backwards), STOPROW (end row) , TIMERANGE (limit timestamp range), VERSIONS (version number), and FILTER (filter rows by condition), etc. For example, we start with the rowkey of Sariel and find the latest version of the next row:

scan 'member001', {STARTROW => 'Sariel', LIMIT=>1, VERSIONS=>1}

8) Filter is a very powerful modifier that can set a series of conditions to filter. For example, we want to limit the value of a column to 26.

scan 'member001', FILTER=>"ValueFilter(=,'binary:26')"

The value contains 6 this value:

scan 'member001', FILTER=>"ValueFilter(=,'substring:6')"

Column names prefixed with birth:

scan 'member001', FILTER=>"ColumnPrefixFilter('birth') "

Multiple filter conditions are supported in FILTER through parentheses, AND and OR condition combinations:

scan 'member001', FILTER=>"ColumnPrefixFilter('birth') AND ValueFilter ValueFilter(=,'substring:1988')"

PrefixFilter is to judge the prefix of Rowkey, which is a very common function.

scan 'member001', FILTER=>"PrefixFilter('E')"

4. Delete

To delete a table, you need to disable the table first.

disable 'member001'
drop 'member001'

Guess you like

Origin blog.csdn.net/tangyi2008/article/details/122593037