HBase revisit the core knowledge points summary

A, HBase Introduction

 

1, the basic concept

Hadoop HBase is a database, is often described as a sparse, distributed, persistent multidimensional sorted map, it is indexed based on the row key, column key and timestamp, is a random access memory and platform to retrieve the data. HBase does not limit the type of data storage, allowing dynamic, flexible data model, SQL language do not emphasize the relationship between the data. HBase designed to run on a cluster of servers can be laterally extended accordingly.

 

2, HBase usage scenarios and success stories

  • Internet search problem: reptile collection pages, stored in BigTable, MapReduce computing jobs full table scan to generate a search index, query search results from BigTable, the presentation to the user.

  • Crawl incremental data: for example, monitoring indicators crawl, crawl user interaction data, telemetry, targeted advertising, etc.

  • Content Services

  • Information exchange

     

3, HBase Shell command line interaction:

Start Shell $ hbase shell

List all the tables hbase> list

Creates a table named mytable containing a Group column hb hbase> create 'mytable', 'hb'

In the 'mytable' table 'first' row 'hb: data' means data corresponding to a column inserted in an array of bytes 'hello HBase'

hbase > put 'mytable' , 'first' , 'hb:data' , 'hello HBase'

Table mytable read 'first' content hbase row> get 'mytable', 'first'

Table mytable read all content hbase> scan 'mytable'

 

Second, the entry

 

1、API

And data related to the operation of HBase API has five, are Get (read), of Put (write), the Delete (delete), Scan (scan) and Increment (column value is incremented)

 

2, the operation table

First create a configuration target

  Configuration conf = HBaseConfiguration.create();

Then when using eclipse must also add profiles to come.
  conf.addResource (new Path ( "E: \\ share \\ hbase-site.xml"));

conf.addResource(new Path("E:\\share\\core-site.xml"));

conf.addResource(new Path("E:\\share\\hdfs-site.xml"));

  Create a table using a connection pool.

  HTablePool pool = new HTablePool(conf,1);
  HTableInterface usersTable = pool.getTable("users");

 

3, write

  Command is used to store data is put, the stored data into a table, you need to create Put instances. And to develop the line to join

  Put put = new Put(byte[] row) ;

  Put method to add the added data, respectively set the column group, refers to a cell qualifiers and

  put.add(byte[] family , byte[] qualifier , byte[] value) ;

  Final submission to the command table

  usersTable.put(put);

  usersTable.close();

  Modify the data, simply re-submit an updated data can be.

HBase writes working mechanism:

 

 

HBase each write operation is written to two places: Write-Ahead Log (write-ahead log, also known as HLog) and MemStore (write buffer) to ensure data persistence, change only when these two places after the information is written and confirmed before they think write operation is completed. MemStore the write buffer is memory, HBase data written to disk before permanently accumulate here, when filled MemStore, wherein the data is flushed to the hard disk, generates a HFile.

 

4, the read operation

Creating a Get command instance, contains the line to query

Get get = new Get(byte[] row) ;

Performing the addColumn () or addFamily () can be set constraints.

Examples submitted to get the returned Result instance a table containing data instance contains all columns of all rows of the column group.

Result r = usersTable.get(get) ;

You can retrieve a specific example of result values

byte[] b = r.getValue(byte[] family , byte[] qualifier) ;

Working Mechanism:

 

BlockCache used to hold frequently accessed read into memory from HFile data, avoid hard to read, each column family has its own BlockCache. HBase read out from the line, the first checks the queue waiting MemStore modified, then check whether BlockCache see Block contains the most recently accessed line, corresponding to the last access HFile hard disk.

 

5, delete

Creating a Delete instance, specifies the rows to be deleted.

Delete delete = new Delete(byte[] row) ;

You can specify a portion of the deleted lines by deleteFamily () and deleteColumn () method.

 

6, a table scan

Scan scan = new Scan () may specify a starting and ending lines.

setStartRow (), setStopRow (), setFilter () method can be used to restrict the returned data.

the addColumn () and addFamily () method may also specify the column and row group.

HBase pattern data model comprises:

Table: HBase table to organize the data.

Line: in the table, the data stored by row, row by the row key uniquely identified. OK key is not the data type, a byte array byte [].

Group Column: column line group according to the data in the packet, the column must be defined and the group is not easily modified. Table rows have the same column group.

Qualifier column: the column group in the data qualifier column or row to locate the column without prior defined qualifiers.

Unit: a data storage unit in the unit referred to in the value, the value is an array of bytes. Together with the key is determined by the row unit, column or columns qualifier group.

Time Version: Version unit time value, is a long type.

A coordinate data HBase example:

 

HBase can be seen as a key-value database. HBase is designed for semi-structured data, the data record may contain inconsistent column uncertain size.

 

 

Third, distributed HBase, HDFS and MapReduce

 

HBase 1, distributed mode

HBase table will be cut into smaller units called data region, distributed to multiple servers. Hosted region server is called RegionServer. In general, RgionServer DataNodes HDFS and arranged in parallel on the same physical hardware, the essence RegionServer HDFS client, to access data stored thereon, HMaster assigned to RegionServer region, each of the plurality of hosted RegionServer region.

 

HBase two special table, -ROOT- and .META., Used to find the region where the position of various tables. -ROOT- point .META. Region table, .META. RegionServer table points to the hosting region to be looking for.

Lookup a client process B + tree distributed layer 3 as shown below:

 

HBase top structure:

 

zookeeper in charge of tracking region server, save the address of the root region.

Client is responsible for the zookeeper sub-clusters and HRegionServer contact.

HMaster responsible when you start HBase, all the region allocated to each HRegion Server, including -ROOT- and .META. Table.

HRegionServer responsible for opening the region, and create HRegion corresponding instance. After HRegion is opened, it creates a Store instance is HColumnFamily each table. Each instance contains a Store StoreFile or more instances, they are stored in the actual data file HFile lightweight package. Each has its Store MemStore a corresponding share a HRegionServer a HLog example.

A basic process:

a, the client obtains the server name contains -ROOT- region by zookeeper.

b, through the region comprising -ROOT- server queries containing .META. table region corresponding to the server name.

c, query .META. Gets region server server name row key client query data resides.

d, data acquired by the key data line region where the server.

 

HFile structure:

 

Trailer pointing to other blocks of the pointer, Index, and offset block Meta Data recording block, Data block store data, and Meta. The default size is 64KB. KeyValue each block contains a sequence of instances of the head and a number of Magic.

 

KeyValue format:

 

The structure to represent two numeric key length and fixed length value of the length of the start key includes a row key, a column name and column qualifier group, timestamp.

 

Write-ahead log WAL:

Each update log is written to inform the client successfully operating success will only be written, then the server can demand the freedom to batch processing or aggregated data in memory.

Editing and workflow process between memstore WAL shunt:

 

Process: The client sends via RPC calls to HRegionServer KeyValue object instance containing the matching region. These examples are then sent to the respective line management HRegion instance, data is written to the WAL, and is put into storage MemStore actually owns the record file. When data memstore reaches a certain size, the data asynchronously sequential writes to the file system, WAL ensure data is not lost in this process.

 

2、HBase和MapReduce

Application access from MapReduce HBase There are three ways:

作业开始时可以用HBase作为数据源,作业结束时可以用HBase接收数据,任务过程中用HBase共享资源。

  • 使用HBase作为数据源

阶段map

protected void map(ImmutableBytesWritable rowkey,Result result,Context context){

};

从HBase表中读取的作业以[rowkey:scan result]格式接收[k1,v1]键值对,对应的类型是ImmutableBytesWritable和Result。

创建实例扫描表中所有的行

Scan scan = new Scan();

scan.addColumn(…);

接下来在MapReduce中使用Scan实例。

TableMapReduceUtil.initTableMapperJob(tablename,scan,map.class,

输出键的类型.class,输出值的类型.class,job);

  • 使用HBase接收数据

reduce阶段

protected void reduce(

ImmutableBytesWritable rowkey,Iterable<put>values,Context context){

};

把reducer填入到作业配置中,

TableMapReduceUtil.initTableReducerJob(tablename,reduce.class,job);

 

3、HBase实现可靠性和可用性

HDFS作为底层存储,为集群里的所有RegionServer提供单一命名空间,一个RegionServer读写数据可以为其它所有RegionServer读写。如果一个RegionServer出现故障,任何其他RegionServer都可以从底层文件系统读取数据,基于保存在HDFS里的HFile开始提供服务。接管这个RegionServerz服务的region。

 

四、优化HBase

 

1、随机读密集型

优化方向:高效利用缓存和更好的索引

  • 增加缓存使用的堆的百分比,通过参数 hfile.block.cache.size 配置。

  • 减少MemStore占用的百分比,通过hbase.regionserver.global.memstore.lowerLimit和hbase.regionserver.global.memstore.upperLimit来调节。

  • 使用更小的数据块,使索引的粒度更细。

  • 打开布隆过滤器,以减少为查找指定行的Key Value对象而读取的HFile的数量。

  • 设置激进缓存,可以提升随机读性能。

  • 关闭没有被用到随机读的列族,提升缓存命中率。

     

2、顺序读密集型

优化方向:减少使用缓存。

  • 增大数据块的大小,使每次硬盘寻道时间取出的数据更多。

  • 设置较高的扫描器缓存值,以便在执行大规模顺序读时每次RPC请求扫描器可以取回更多行。 参数 hbase.client.scanner.caching 定义了在扫描器上调用next方法时取回的行的数量。

  • 关闭数据块的缓存,避免翻腾缓存的次数太多。通过Scan.setCacheBlocks(false)设置。

  • 关闭表的缓存,以便在每次扫描时不再翻腾缓存。

     

3、写密集型

优化方向:不要太频繁刷写,合并或者拆分。

  • 调高底层存储文件(HStoreFile)的最大大小,region越大意味着在写的时候拆分越少。通过参数 hbase.hregion.max.filesize设置。

  • 增大MemStore的大小,通过参数hbase.hregion.memstore.flush.size调节。刷写到HDFS的数据越多,生产的HFile越大,会在写的时候减少生成文件的数量,从而减少合并的次数。

  • 在每台RegionServer上增加分配给MemStore的堆比例。把upperLimit设为能够容纳每个region的MemStore乘以每个RegionServer上预期region的数量。

  • 垃圾回收优化,在hbase-env.sh文件里设置,可以设置初始值为:-Xmx8g -Xms8g -Xmn128m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC

   -XX:CMSInitiatingOccupancyFraction=70

  • 打开MemStore-Local Allocation Buffer这个特性,有助于防止堆的碎片化。 通过参数hbase.hregion.memstore.mslab.enabled设置

     

4、混合型

优化方向:需要反复尝试各种组合,然后运行测试,得到最佳结果。

 

影响性能的因素还包括:

  • 压缩:可以减少集群上的IO压力

  • 好的行键设计

  • 在预期集群负载最小的时候手工处理大合并

  • 优化RegionServer处理程序计数

发布了760 篇原创文章 · 获赞 636 · 访问量 11万+

Guess you like

Origin blog.csdn.net/qq_41946557/article/details/104319790