Article Directory
- 1, HBase core concepts
- 2, HBase features
- 3, HBase architecture
- 4, HBase data structures
- 5, HBase installation deployment
- 6, HBase shell command basic operation
- 7, HBase Java API for operation
- 8, HBase filter query
- 9, in an actual application scenario Hbase
- 10, Hbase read data
- 11, Hbase write data
- 12, region splitting mechanism
- 13, region pre-partition
- 14, region merge
- 15, HBase MapReduce Integration
- 16, HBase Hive integration
- 17, HBase coprocessor
- 18, HBase table rowkey design (three principles)
- 19, hot
- 20, HBase data backup
- 21, HBase secondary index
1, HBase core concepts
HBase HDFS is built on a distributed database NoSql, mainly for large data field, and the node supports concurrent read and write high expansion, for sparse data storage column, generally can be operated by various data JavaAPI.
2, HBase features
- (1)Mass Storage: You can store large amounts of data,
- (2)Columnar storage: Data is stored based on the column group, dataByte arrayFor storage, you can have multiple versions of values
- (3)Easy expansion: Can the expansion of the storage cluster by adding servers (bottom HDFS)
- (4)High concurrent read and write: Support high concurrent read and write requests
- (5) does not support small files, does not support concurrent write, does not support file randomly modify, query efficiency is low
- When you start, you need to start HDFS clusters in advance and ZooKeeper
3, HBase architecture
HMaster- HRegionServer- Region
- Region: HBase cluster is the smallest unit of a distributed storage table Region Table corresponds to a part of the data
- HBase cluster only a meta table, this table has only one region, the region data stored on a HRegionServer
- meta region table stores information for all user tables, we can
scan 'hbase:meta'
see the meta table information
4, HBase data structures
rowkey OK key - Column-Family column family - Column Column - cell cell - Timestamp timestamp
-
A HRegionServer manage multiple region, a region that contains a number of store, on a column family into a store, a store which has a MEM store and more StoreFile , final data are lots of HFile file such a data structure stored in HDFS on.
-
Data is not the type of cell, the entire memory array in bytes; multiple times Cell assignment table, each timestamp when the assignment operation, can be viewed as a version number value Cell,
-
HBase Regionserver on the memory is divided into two portions
- As part of Memstore, mainly used to write;
- Also as part of BlockCache, mainly for reading data;
5, HBase installation deployment
HDFS clusters need to start early and ZooKeeper
start-hbase.sh
WEbUI:http://node01:60010
stop-hbase.sh
6, HBase shell command basic operation
list
--创建
create 'user', 'info', 'data'
create 'user', {NAME => 'info', VERSIONS => '3'},{NAME => 'data'}
--赋值
put 'user', 'rk0001', 'info:name', 'zhangsan'
--查询--get
get 'user', 'rk0001'
get 'user', 'rk0001', 'info:name', 'info:age'
--查询-过滤器
get 'user', 'rk0001', {FILTER => "ValueFilter(=, 'binary:zhangsan')"}
get 'user', 'rk0001', {FILTER => "(QualifierFilter(=,'substring:a'))"}
--查询--scan
scan 'user', {COLUMNS => 'info'}
scan 'user', {COLUMNS => 'info', RAW => true, VERSIONS => 5}
scan 'user', {COLUMNS => 'info', RAW => true, VERSIONS => 3}
scan 'user',{FILTER=>"PrefixFilter('rk')"}--模糊查询
scan 'user', {TIMERANGE => [1392368783980, 1392380169184]}--范围查询
--更新--alter
alter 'user', NAME => 'info', VERSIONS => 5
delete 'user', 'rk0001', 'info:name'
count 'user'
disable 'user'
enable 'user'
7, HBase Java API for operation
Create a maven project, content repositories of pom.xml automatically lead package (need to download from cloudera warehouse, takes a long time, be patient )
Operation of the database: Get connected ---- get client objects - operate the database - closed
private Connection connection ;
private final String TABLE_NAME = "myuser";
private Table table ;
@BeforeTest
public void initTable () throws IOException {
Configuration configuration = HBaseConfiguration.create();
configuration.set("HBase.zookeeper.quorum","node01:2181,node02:2181");
connection= ConnectionFactory.createConnection(configuration);
table = connection.getTable(TableName.valueOf(TABLE_NAME));
}
@Test
public void createData() throws IOException {
Admin admin = connection.getAdmin();//获取管理员对象,来对手数据库进行DDL的操作
TableName myuser = TableName.valueOf("myuser"); //指定我们的表名
HTableDescriptor hTableDescriptor = new HTableDescriptor(myuser);
//指定两个列族
HColumnDescriptor f1 = new HColumnDescriptor("f1");
HColumnDescriptor f2 = new HColumnDescriptor("f2");
hTableDescriptor.addFamily(f1);
hTableDescriptor.addFamily(f2);
admin.createTable(hTableDescriptor);
admin.close();
}
public void addData() throws IOException {
//获取表
Table table = connection.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put("0001".getBytes());//创建put对象,并指定rowkey值
put.addColumn("f1".getBytes(),"name".getBytes(),"zhangsan".getBytes());
table.put(put);}
@AfterTest
public void close() throws IOException {
table.close();
connection.close();
}
8, HBase filter query
Role of the filter is determined whether the data satisfies the conditions of the server, and then only the data satisfying the conditions back to the client
Type filter a lot, but can be divided into:Compare filter、Special filters
- Comparator Filter: OK key filter RowFilter, column family filter FamilyFilter, column filters QualifierFilter, the column value filter ValueFilter
//查询哪些字段值 包含数字8
@Test
public void contains8() throws IOException {
Scan scan = new Scan();
SubstringComparator substringComparator = new SubstringComparator("8");
//列值过滤器,过滤列值当中包含数字8的所有的列
ValueFilter valueFilter = new ValueFilter(CompareFilter.CompareOp.EQUAL, substringComparator);
scan.setFilter(valueFilter);
ResultScanner scanner = table.getScanner(scan);
printlReult(scanner);
}
- Use special filters
1, single value filter SingleColumnValueFilter, returns all fields to meet the conditions of the entire column value
2, the column value exclusion filter SingleColumnValueExcludeFilter: column will exclude specified, all other columns return
3, rowkey prefix filter PrefixFilter: All inquiries rowkey prefix to the beginning of the XX
4, tab filter PageFilter
3) Multi-filter integrated query FilterList
- Requirements: f1 query column family, name is Liu Bei data, and at the same time meet the prefix rowkey at the beginning of the 00 data (PrefixFilter)
@Test
public void filterList() throws IOException {
Scan scan = new Scan();
SingleColumnValueFilter singleColumnValueFilter = new SingleColumnValueFilter("f1".getBytes(), "name".getBytes(), CompareFilter.CompareOp.EQUAL, "刘备".getBytes());
PrefixFilter prefixFilter = new PrefixFilter("00".getBytes());
FilterList filterList = new FilterList();
filterList.addFilter(singleColumnValueFilter);
filterList.addFilter(prefixFilter);
scan.setFilter(filterList);
ResultScanner scanner = table.getScanner(scan);
printlReult(scanner);
}
9, in an actual application scenario Hbase
1 Transport: Ship GPS information, GPS information ship the whole Yangtze River, the data stored about 10 million a day.
2 Financial aspects: consumer information, credit information, credit card and other payment information
3 electric business areas: electricity trading business information website, logistics information, tour information, etc.
4 telecommunications: call information, such as details of a single voice
Summary: Massive detail data store, and the latter need good query performance
10, Hbase read data
- 1, the client is first connected with zk, find HRegionServer table comprising from meta zk, comprising HRegionServer this connection, reads meta data table;
- 2, according to the query information, the information data to find the corresponding region, found in the region corresponding to regionserver, and send the request
- 3, corresponding to a search and locate Region,
- 4, starting memstore find data - If you do not read from the BlockCache ---- If there is no StoreFile then read on.
- After 5, the data read from the storeFile, not directly returns the data to the client, but the data is first written to BlockCache, the goal is to speed up the subsequent query; then return the results to the client.
11, Hbase write data
-
1, the client is first connected with zk, zk find the table to find the region from positions meta zk, i.e. comprising HRegionServer meta table, this connection comprising HRegionServer, reads the meta data table;
-
2, according to the query information, the information data to find the corresponding region, found in the region corresponding to regionserver, and send the request
-
3, corresponding to a search and locate Region,
-
4, when writing data, the data are written HLog and one each memstore buffers
-
. 5, the flush: the memstore reaches a threshold, the data brush disk to generate a plurality of files storeFile.
- Region memstore reach any of 128MB
- Region in the size of the sum of all Memstore reach block.multiplier * flush.size
- Region Server in HLog number reaches the upper limit
-
6、compact:
- Small merging: small store file merged into a relatively large store file,
- Big Merge: Merge Store all HFile a HFile
12, region splitting mechanism
rowkey region in the store large amounts of data, when too many pieces of data in the region, will affect query efficiency. Therefore, when the region is too big time, hbase will split the region.
HBase the region split a total of strategy: fixed size region Split (10G, version 0.94 before default), increment the upper limit of the split (split times ^ 3 * 256,0.94 to 2.0 version default), fractional split (version 2.0 default )
13, region pre-partition
Each region maintains startRow and endRowKey, if added to the data in line with a region rowKey range of maintenance, the maintenance data to this region.
Increase the efficiency of data reading and writing, to prevent data skew load balancing, optimization of the number of Map
Manually specify the pre-partition
HexStringSplit algorithm
create 'person','info1','info2',SPLITS => ['1000','2000','3000','4000']
create 'mytable', 'base_info',' extra_info', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
14, region merge
For maintenance purposes. After such a large number of deleted data, Region have become very small, Region wasted.
Class Region Merge by combined cold, heat online_merge combined by Region
15, HBase MapReduce Integration
HBase data table finally is stored in the HDFS, HBase natural support MR operation ,
We can MR, data processing directly HBase table and MR may be processed results stored directly to HBase table.
16, HBase Hive integration
Hive provides integration with HBase so that the table can be used on HBase hive SQL statement queries, as well as insert and Union Join complex queries like, but can also be a data table hive Hbase mapped to the
Hive | HBase |
---|---|
database | Column store non-relational databases |
Data analysis and off-line cleaning, high latency | Low latency, access to online business use |
Based on HDFS, MapReduce | Based on HDFS |
One kind of class SQL engine, and run MapReduce tasks | One kind of NoSQL Key / vale database on the Hadoop |
Our HBase package of five jar into the lib directory of the hive
17, HBase coprocessor
After 0.92 HBase introduced coprocessor (coprocessors), a secondary index can be easily established, a complex filter (predicate pushdown), and access control.
Two coprocessor: Observer endpoint coprocessor coprocessor
Observer coprocessor:
- Similar to the traditional database triggers , mainly in the service end
- In the normal course of cluster allows client operations, there may behave differently
- You can achieve rights management, priority setting, monitoring, ddl control, the secondary index and other functions
endpoint coprocessor
- Similar to the traditional database stored procedure , the main work in the client end
- Ability to allow the expansion of the cluster, the client application and opening a new operation command
- May be implemented min, max, avg, sum, distinct, group by other functions
There are two ways to load coprocessor
- Static load modification hbase-site.xml
- Enable dynamic loading table aggregation, only take effect for a specific table.
18, HBase table rowkey design (three principles)
Length principle, the principle of hashing, the only principle
1) rowkey length principle, it is recommended as short as possible, but not too short
2) rowkey hash principle, rowkey high as the hash of fields,
. 3) rowkey only principle, rowkey is stored sorted lexicographically
19, hot
Habse retrieved records, to locate the first data row by row key.
When a large number of client access hbase cluster of one or a few nodes, resulting in a small number of region server load is too large, while the other region server load is very small, causing a "hot spot" phenomenon.
Hot solutions (pre-partition, salt, hash, reverse)
1) Pre-Partition: make data-sheets can be dispersed in a balanced cluster,
2) rowkey salt: in front of rowkey increase random number
3) Hash: always the same line with the same prefix with salt.
4) Reverse: Reverse rowkey fixed length or digital format, so that part (least significant part) rowkey constantly changing in front.
20, HBase data backup
1) Class-based HBase provided HBase data in a particular table, exported to HDFS
2) snapshot snapshot-based approach to achieve data transfer and copy HBase.
创建表的snapshot
snapshot 'tableName', 'snapshotName'
恢复snapshot
disable 'tableName'
restore_snapshot 'snapshotName'
enable 'tableName'
21, HBase secondary index
rowkey equivalent hbase table of an index
hbase secondary index, is established HBase table mapping between the columns and rows bond
Construction of two indexing scheme hbase
- (1) MapReduce Scheme
- (2) Hbase Coprocessor (coprocessor) Program
- (3) Solr + hbase Program
- (4) ES + hbase Program
- (. 5) Phoenix program HBase +
Construction secondary index phoenix
Phoenix is to build a SQL layer on HBase, and allows us to use standard JDBC APIs rather than HBase client APIs to create tables, insert data to HBase and query the data. Phoenix has provided support for HBase secondary index.
If you want to enable phoenix secondary index, you need to modify the configuration file hbase-site.xml
- Global indexing, the global index for reading and writing less business scenarios.
- Local indexing, local index for write operations and frequent space-constrained scenario.
--给表USER创建基于Global Indexing的二级索引
create index USER_COOKIE_ID_INDEX on USER ("f"."cookie_id");
--给表USER创建基于Local Indexing的二级索引
create local index USER_USER_ID_INDEX on USER ("f"."user_id");