HBase_ summary

1, HBase core concepts

HBase HDFS is built on a distributed database NoSql, mainly for large data field, and the node supports concurrent read and write high expansion, for sparse data storage column, generally can be operated by various data JavaAPI.

2, HBase features

  • (1)Mass Storage: You can store large amounts of data,
  • (2)Columnar storage: Data is stored based on the column group, dataByte arrayFor storage, you can have multiple versions of values
  • (3)Easy expansion: Can the expansion of the storage cluster by adding servers (bottom HDFS)
  • (4)High concurrent read and write: Support high concurrent read and write requests
  • (5) does not support small files, does not support concurrent write, does not support file randomly modify, query efficiency is low
  • When you start, you need to start HDFS clusters in advance and ZooKeeper

3, HBase architecture

HMaster- HRegionServer- Region

  • Region: HBase cluster is the smallest unit of a distributed storage table Region Table corresponds to a part of the data
  • HBase cluster only a meta table, this table has only one region, the region data stored on a HRegionServer
  • meta region table stores information for all user tables, we can scan 'hbase:meta'see the meta table information

4, HBase data structures

rowkey OK key - Column-Family column family - Column Column - cell cell - Timestamp timestamp

  • A HRegionServer manage multiple region, a region that contains a number of store, on a column family into a store, a store which has a MEM store and more StoreFile , final data are lots of HFile file such a data structure stored in HDFS on.

  • Data is not the type of cell, the entire memory array in bytes; multiple times Cell assignment table, each timestamp when the assignment operation, can be viewed as a version number value Cell,

  • HBase Regionserver on the memory is divided into two portions

    • As part of Memstore, mainly used to write;
    • Also as part of BlockCache, mainly for reading data;

5, HBase installation deployment

HDFS clusters need to start early and ZooKeeper

start-hbase.sh

WEbUI:http://node01:60010

stop-hbase.sh

6, HBase shell command basic operation


list
--创建
create 'user', 'info', 'data'
create 'user', {NAME => 'info', VERSIONS => '3'},{NAME => 'data'}
--赋值
put 'user', 'rk0001', 'info:name', 'zhangsan'
--查询--get
get 'user', 'rk0001'
get 'user', 'rk0001', 'info:name', 'info:age'
--查询-过滤器
 get 'user', 'rk0001', {FILTER => "ValueFilter(=, 'binary:zhangsan')"} 
get 'user', 'rk0001', {FILTER => "(QualifierFilter(=,'substring:a'))"}

--查询--scan
scan 'user', {COLUMNS => 'info'}
scan 'user', {COLUMNS => 'info', RAW => true, VERSIONS => 5}
scan 'user', {COLUMNS => 'info', RAW => true, VERSIONS => 3}

scan 'user',{FILTER=>"PrefixFilter('rk')"}--模糊查询
scan 'user', {TIMERANGE => [1392368783980, 1392380169184]}--范围查询

--更新--alter
alter 'user', NAME => 'info', VERSIONS => 5
delete 'user', 'rk0001', 'info:name'

count 'user'

disable 'user'
enable 'user'

7, HBase Java API for operation

Create a maven project, content repositories of pom.xml automatically lead package (need to download from cloudera warehouse, takes a long time, be patient )

Operation of the database: Get connected ---- get client objects - operate the database - closed

 	private Connection connection ;
    private final String TABLE_NAME = "myuser";
    private Table table ;

    @BeforeTest
    public void initTable () throws IOException {
        Configuration configuration = HBaseConfiguration.create();
        configuration.set("HBase.zookeeper.quorum","node01:2181,node02:2181");
        connection= ConnectionFactory.createConnection(configuration);
        table = connection.getTable(TableName.valueOf(TABLE_NAME));
    }

   @Test
 public void createData() throws IOException {
        Admin admin = connection.getAdmin();//获取管理员对象,来对手数据库进行DDL的操作  
        TableName myuser = TableName.valueOf("myuser");   //指定我们的表名
        HTableDescriptor hTableDescriptor = new HTableDescriptor(myuser);
        //指定两个列族
        HColumnDescriptor f1 = new HColumnDescriptor("f1");
        HColumnDescriptor f2 = new HColumnDescriptor("f2");
        hTableDescriptor.addFamily(f1);
        hTableDescriptor.addFamily(f2);
        admin.createTable(hTableDescriptor);
        admin.close();
 }

    public void addData() throws IOException {
        //获取表
        Table table = connection.getTable(TableName.valueOf(TABLE_NAME));
        Put put = new Put("0001".getBytes());//创建put对象,并指定rowkey值
        put.addColumn("f1".getBytes(),"name".getBytes(),"zhangsan".getBytes());
     
        table.put(put);} 

    @AfterTest
    public void close() throws IOException {
        table.close();
        connection.close();
    }

8, HBase filter query

Role of the filter is determined whether the data satisfies the conditions of the server, and then only the data satisfying the conditions back to the client

Type filter a lot, but can be divided into:Compare filterSpecial filters

  1. Comparator Filter: OK key filter RowFilter, column family filter FamilyFilter, column filters QualifierFilter, the column value filter ValueFilter
//查询哪些字段值  包含数字8
    @Test
    public void contains8() throws IOException {
        Scan scan = new Scan();
        SubstringComparator substringComparator = new SubstringComparator("8");
        //列值过滤器,过滤列值当中包含数字8的所有的列
        ValueFilter valueFilter = new ValueFilter(CompareFilter.CompareOp.EQUAL, substringComparator);
        scan.setFilter(valueFilter);
        ResultScanner scanner = table.getScanner(scan);
        printlReult(scanner);
    }
  1. Use special filters

1, single value filter SingleColumnValueFilter, returns all fields to meet the conditions of the entire column value

2, the column value exclusion filter SingleColumnValueExcludeFilter: column will exclude specified, all other columns return

3, rowkey prefix filter PrefixFilter: All inquiries rowkey prefix to the beginning of the XX

4, tab filter PageFilter

3) Multi-filter integrated query FilterList

  • Requirements: f1 query column family, name is Liu Bei data, and at the same time meet the prefix rowkey at the beginning of the 00 data (PrefixFilter)

    @Test
    public  void filterList() throws IOException {
        Scan scan = new Scan();
        SingleColumnValueFilter singleColumnValueFilter = new SingleColumnValueFilter("f1".getBytes(), "name".getBytes(), CompareFilter.CompareOp.EQUAL, "刘备".getBytes());
        PrefixFilter prefixFilter = new PrefixFilter("00".getBytes());
        
        FilterList filterList = new FilterList();
        filterList.addFilter(singleColumnValueFilter);
        filterList.addFilter(prefixFilter);
        
        scan.setFilter(filterList);
        ResultScanner scanner = table.getScanner(scan);
        printlReult(scanner);
    }

9, in an actual application scenario Hbase

1 Transport: Ship GPS information, GPS information ship the whole Yangtze River, the data stored about 10 million a day.

2 Financial aspects: consumer information, credit information, credit card and other payment information

3 electric business areas: electricity trading business information website, logistics information, tour information, etc.

4 telecommunications: call information, such as details of a single voice

Summary: Massive detail data store, and the latter need good query performance

10, Hbase read data

  • 1, the client is first connected with zk, find HRegionServer table comprising from meta zk, comprising HRegionServer this connection, reads meta data table;
  • 2, according to the query information, the information data to find the corresponding region, found in the region corresponding to regionserver, and send the request
  • 3, corresponding to a search and locate Region,
  • 4, starting memstore find data - If you do not read from the BlockCache ---- If there is no StoreFile then read on.
  • After 5, the data read from the storeFile, not directly returns the data to the client, but the data is first written to BlockCache, the goal is to speed up the subsequent query; then return the results to the client.

11, Hbase write data

  • 1, the client is first connected with zk, zk find the table to find the region from positions meta zk, i.e. comprising HRegionServer meta table, this connection comprising HRegionServer, reads the meta data table;

  • 2, according to the query information, the information data to find the corresponding region, found in the region corresponding to regionserver, and send the request

  • 3, corresponding to a search and locate Region,

  • 4, when writing data, the data are written HLog and one each memstore buffers

  • . 5, the flush: the memstore reaches a threshold, the data brush disk to generate a plurality of files storeFile.

    • Region memstore reach any of 128MB
    • Region in the size of the sum of all Memstore reach block.multiplier * flush.size
    • Region Server in HLog number reaches the upper limit
  • 6、compact

    • Small merging: small store file merged into a relatively large store file,
    • Big Merge: Merge Store all HFile a HFile

12, region splitting mechanism

rowkey region in the store large amounts of data, when too many pieces of data in the region, will affect query efficiency. Therefore, when the region is too big time, hbase will split the region.

HBase the region split a total of strategy: fixed size region Split (10G, version 0.94 before default), increment the upper limit of the split (split times ^ 3 * 256,0.94 to 2.0 version default), fractional split (version 2.0 default )

13, region pre-partition

Each region maintains startRow and endRowKey, if added to the data in line with a region rowKey range of maintenance, the maintenance data to this region.

Increase the efficiency of data reading and writing, to prevent data skew load balancing, optimization of the number of Map

Manually specify the pre-partition

HexStringSplit algorithm

create 'person','info1','info2',SPLITS => ['1000','2000','3000','4000']

create 'mytable', 'base_info',' extra_info', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}

14, region merge

For maintenance purposes. After such a large number of deleted data, Region have become very small, Region wasted.

Class Region Merge by combined cold, heat online_merge combined by Region

15, HBase MapReduce Integration

HBase data table finally is stored in the HDFS, HBase natural support MR operation ,

We can MR, data processing directly HBase table and MR may be processed results stored directly to HBase table.

16, HBase Hive integration

Hive provides integration with HBase so that the table can be used on HBase hive SQL statement queries, as well as insert and Union Join complex queries like, but can also be a data table hive Hbase mapped to the

Hive HBase
database Column store non-relational databases
Data analysis and off-line cleaning, high latency Low latency, access to online business use
Based on HDFS, MapReduce Based on HDFS
One kind of class SQL engine, and run MapReduce tasks One kind of NoSQL Key / vale database on the Hadoop

Our HBase package of five jar into the lib directory of the hive

17, HBase coprocessor

After 0.92 HBase introduced coprocessor (coprocessors), a secondary index can be easily established, a complex filter (predicate pushdown), and access control.

Two coprocessor: Observer endpoint coprocessor coprocessor

Observer coprocessor:

  • Similar to the traditional database triggers , mainly in the service end
  • In the normal course of cluster allows client operations, there may behave differently
  • You can achieve rights management, priority setting, monitoring, ddl control, the secondary index and other functions

endpoint coprocessor

  • Similar to the traditional database stored procedure , the main work in the client end
  • Ability to allow the expansion of the cluster, the client application and opening a new operation command
  • May be implemented min, max, avg, sum, distinct, group by other functions

There are two ways to load coprocessor

  • Static load modification hbase-site.xml
  • Enable dynamic loading table aggregation, only take effect for a specific table.

18, HBase table rowkey design (three principles)

Length principle, the principle of hashing, the only principle

1) rowkey length principle, it is recommended as short as possible, but not too short

2) rowkey hash principle, rowkey high as the hash of fields,
. 3) rowkey only principle, rowkey is stored sorted lexicographically

19, hot

Habse retrieved records, to locate the first data row by row key.
When a large number of client access hbase cluster of one or a few nodes, resulting in a small number of region server load is too large, while the other region server load is very small, causing a "hot spot" phenomenon.

Hot solutions (pre-partition, salt, hash, reverse)

1) Pre-Partition: make data-sheets can be dispersed in a balanced cluster,
2) rowkey salt: in front of rowkey increase random number
3) Hash: always the same line with the same prefix with salt.
4) Reverse: Reverse rowkey fixed length or digital format, so that part (least significant part) rowkey constantly changing in front.

20, HBase data backup

1) Class-based HBase provided HBase data in a particular table, exported to HDFS

2) snapshot snapshot-based approach to achieve data transfer and copy HBase.

创建表的snapshot
snapshot 'tableName', 'snapshotName'

恢复snapshot
disable 'tableName'
restore_snapshot 'snapshotName'
enable 'tableName'

21, HBase secondary index

rowkey equivalent hbase table of an index

hbase secondary index, is established HBase table mapping between the columns and rows bond

Construction of two indexing scheme hbase

  • (1) MapReduce Scheme
  • (2) Hbase Coprocessor (coprocessor) Program
  • (3) Solr + hbase Program
  • (4) ES + hbase Program
  • (. 5) Phoenix program HBase +

Construction secondary index phoenix

Phoenix is ​​to build a SQL layer on HBase, and allows us to use standard JDBC APIs rather than HBase client APIs to create tables, insert data to HBase and query the data. Phoenix has provided support for HBase secondary index.

If you want to enable phoenix secondary index, you need to modify the configuration file hbase-site.xml

  • Global indexing, the global index for reading and writing less business scenarios.
  • Local indexing, local index for write operations and frequent space-constrained scenario.
--给表USER创建基于Global Indexing的二级索引
create index USER_COOKIE_ID_INDEX on USER ("f"."cookie_id"); 

--给表USER创建基于Local Indexing的二级索引
create local index USER_USER_ID_INDEX on USER ("f"."user_id");
Published 77 original articles · won praise 25 · views 9198

Guess you like

Origin blog.csdn.net/TU_JCN/article/details/105340806