HBase underlying principle (system architecture, data model table, physical memory, reading and writing process, Region Management, Master mechanism)

table of Contents

First, the system architecture

Two, HBase table data model

Row Key

列族Column Family

Column Column

Timestamp

Cell

VersionNum

Third, the physical storage

1, the overall structure

2, STORE FILE & HFILE structure

3、Memstore与storefile

4, HLog (Wal wheel)

Fourth, reading and writing process

1, a read request procedure:

2, write request process:

Five, Region management

Six, Master mechanism


First, the system architecture

  • Client

1 contains access hbase interface, Client maintains some of the cache to speed up to hbase access , such as location information of regione.

  • Zookeeper

1 to ensure that any time there is only one master cluster

All memory addressing Region 2 inlet

3 real-time status monitoring Region Server, the information will be on-line and off-line real-time notification to Master Region server

Hbase 4 stores the schema, which comprises a table, which table each column family

  • Master Responsibilities

1 Region server assigned region

2 responsible region server load balancing

3 found that the failure of the region server and re-allocate their region on

Garbage collection on a 4 HDFS file

5 the update request processing schema

  • Region Server role

1 Region server maintenance Master assigned to its region , the handling of these region IO requests

2 Region server is responsible for splitting the region from becoming too large during operation

Can be seen, Client Access hbase process does not require the data master involvement (addressing and accessing zookeeper Server region, data read and write access regione server), master table and only the maintainer of the metadata information region, the load is low.

Two, HB ASE table data model

Row Key

And nosql database are the same, row key is the primary key used to retrieve records. Access hbase table in a row, only three ways:

1 accessed through a single row key

2 by the row key range

3 full table scan

Row key key row (Row key) can be any string ( maximum length 64KB , practical length typically 10-100bytes), inside hbase, row key stored as the byte array.

Hbase data table will sort according rowkey (dictionary order)

When stored, the data is stored ordered according Row key lexicographic order (byte order). When designing key, to fully consider the characteristics of this sort storage, store the rows often read together put together. (Location relevance)

note:

Lexicographical sort is the result of int

1,10,100,11,12,13,14,15,16,17,18,19,2,20,21, ..., 9,91,92,93,94,95,96,97,98,99. To keep the shaping of the natural order, row of keys must be left with 0 fill .

A write line is an atomic operation (regardless of how many columns a reader). The behavior of this design decision enables the user to easily understand program during concurrent update operations on the same line.

列族Column Family

Hbase each column in the table, are vested with a column family. Column group is part of a table schema (rather than columns), the table must be defined before use.

Column name as a prefix to the column family. For example courses: history, courses: math courses belong to this column family.

Access control, disk and memory usage statistics are carried out in a column family level.

The more column family, the more file when you get a row of data to be involved in IO, search, and so, if not necessary, do not set too many column families

Column C olumn

The following specific column column family, belongs to a ColumnFamily, similar to the concrete column we created among mysql

Timestamp

HBase as determined by a cell called a memory cell row and columns. Each cell holds are multiple versions of the same data. Version indexed by time stamp. Type of stamp is 64-bit integer. Timestamp may (automatically writing data) assigned by HBase , this time stamp is accurate to the current system time in milliseconds. Time stamp can also be explicitly assigned by the customer. If the application data to avoid version conflicts, it must generate their own unique stamp. Each cell, different versions of the data in time reverse order , ie the latest data at the top.

In order to avoid the presence of too many versions of data management caused (including storage and indexing) burden, hbase offers two versions of data recovery mode:

  • Save the last n versions of data
  • Save in the most recent version (the life cycle TTL setting data).

Users can be set for each column family.

Cell

From the {row key, column (= <family> + <label>), version} uniquely determined unit.

The cell data is not of the type, all bytecode stored.

V ersionNum

Version number of the data, there may be a plurality of each of the data version number, a time stamp is the system default, type Long

Third, the physical storage

1, the overall structure

 

All rows in Table 1 are arranged in the row key in lexicographic order.

2 is divided in the direction of the plurality of row Table HRegion .

3 region divided by size (default 1OG), only one of each table Region started, with the continuous data into a table, region growing, when increases to a threshold value, and the like will HRegion two new clubs Hregion. When the table rows in growing, there will be more and more Hregion.

. 4 HRegion is Hbase distributed memory and a minimum cell load balancing . It means a minimum unit may be distributed in different Hregion different HRegion server. But a Hregion is not split into a plurality of server on .

 

. 5 HRegion Although load balancing is the smallest unit, but not the smallest physical storage unit .

In fact, by one or more Store HRegion composition, each store to save a column Family .

Each Strore turn consists of a plurality of memStore and 0 to StoreFile composition. As FIG.

 

2, STORE FILE & HFILE structure

StoreFile stored on HDFS to HFile format.

Attached: HFile format:

First HFile file length is not fixed, the length of only one of the two fixed: Trailer and FileInfo. As shown in the figures, Trailer have pointers to other data blocks starting.

File Info Meta information is recorded in a number of documents, such as: AVG_KEY_LEN, AVG_VALUE_LEN, LAST_KEY, COMPARATOR, MAX_SEQ_ID_KEY and so on.

Meta Data Index and Index Data blocks are recorded for each block and the block starting Meta.

Data Block is the basic unit HBase I / O in order to improve efficiency, HRegionServer has the LRU Block Cache-based mechanisms. Each Data block size can be specified by the parameters in the creation of a Table when large order in favor of Block Scan, trumpet Block conducive to random queries. Data of each block is in addition to the one at the beginning of Magic KeyValue for splicing, Magic content is some random numbers, in order to prevent data corruption.

HFile inside each KeyValue a simple byte array to that. But this byte array which contains a number of items, and we have a fixed structure. We take a look inside the concrete structure:

Initially two fixed-length values, respectively, and the length of the Key Value. Followed Key, initially fixed length value representing the length of RowKey, followed RowKey, then the value of a fixed length indicating the length of the Family, then the Family, followed Qualifier (qualifiers), followed by two fixed-length value, and a Time Stamp represents the Key Type (Put / Delete). Value is not part of such a complex structure, is a pure binary data.

HFile is divided into six parts:

Data Block segment - data stored in the table, this part may be compressed

Meta Block segment (optional) - kv stored on the user-defined, may be compressed.

File Info section -Hfile meta-information, it is not compressed, the user can also add your own meta-information in this section.

Data Block Index Block of section -Data index. Each index key is the first key record in the index block.

Meta Block Index section (optional) index -Meta Block of.

Trailer- This paragraph is fixed length. Save the offset of each segment, a read HFile, first reads Trailer, Trailer save the start position of each segment (segment Magic Number Check used for safety), then, DataBlock Index may be read. take into memory, so that when retrieving a key, do not need to scan the entire HFile, but only where to find the key block from memory, io the entire block is read into memory by a disk, and then need to find the key . DataBlock Index uses LRU mechanism eliminated.

HFile of Data Block, Meta Block compression usually stored after compression can significantly reduce network and disk IO IO, followed by the cost of course it takes to compress and decompress cpu.

Hfile target compression support in two ways: Gzip, Lzo.

3、Memstore与storefile

A region of a plurality of store , each store comprises a column family all data

Store memory includes memstore located hard storefile

Memstore write operation to write, when the amount of data memstore reaches a certain threshold, Hregionserver write process starts flashcache StoreFile, each write a StoreFile formed separately, the outputs the plurality StoreFile, StoreFile when the number reaches a threshold, a plurality a combined into one large storefile.

When storefile size exceeds a certain threshold, the current will be divided into two region by the region assigned to the corresponding Hmaster server load balancing

When clients retrieve data in the first memstore find, can not find find storefile

4, HLog (Wal wheel)

WAL meaning Write ahead log (http://en.wikipedia.org/wiki/Write-ahead_logging), similar to the mysql binlog, used to do with disaster recovery, all changes Hlog recorded data, once the data changes, you you can be recovered from the log.

Each Region Server maintains a Hlog, instead of each Region a . Such different logs region (different from a table) will be mixed together, the aim is to continue adding a single file with respect to a plurality of simultaneously writing files, the disk access times can be reduced, it is possible to improve table write performance. Leads to a problem, if a region server offline, in order to restore thereon Region , need region server on the log is split, then distributed to other recovery region server.

HLog file is an ordinary Hadoop Sequence File :

  • HLog Sequence File Key is HLogKey of objects, HLogKey home information recorded in the write data, in addition to the name and region table, and further comprising a sequence number and a timestamp, timestamp is the "write time", the start value for the sequence number 0, or is stored in the file system in the last sequence number.
  • HLog Sequece File is the Value of HBase KeyValue objects, i.e. corresponding to the KeyValue HFile, see above description.

Fourth, reading and writing process

1, a read request procedure:

HRegionServer holds the meta tables and table data, table data to be accessed, Client go first visit zookeeper, obtain location information from the meta table where the zookeeper inside that holds find this meta table on which HRegionServer .

Next Client to access HRegionServer Meta table located just acquired HRegionServer by the IP, thereby reading the Meta, and then obtain the meta data Meta is stored in the table.

Client through the data stored in the information element, to access the corresponding HRegionServer, then scan the Memstore where HRegionServer Storefile and to query the data.

Finally HRegionServer to query the data in response to the Client.

View meta table information

hbase(main):011:0> scan 'hbase:meta'

2, write request process:

Client is the first visit zookeeper, find Meta table and get Meta table metadata.

Determining a current to be written and the data corresponds HRegion HRegionServer server.

Client to server initiated HRegionServer write request, and in response to receiving the request and HRegionServer.

C L Ient first data is written to HLog, to prevent data loss.

And writing data to memstore .

If HLog and Memstore are written to success, then this data is written to success

If Memstore reaches a threshold, the data will Memstore to flush in the Storefile.

As more and more Storefile, will trigger Compact merge operation, the excess Storefile merge into one big Storefile.

When Storefile growing, Region will be growing, after the threshold is reached, it triggers Split operation, the Region into two.

detailed description:

hbase use and update MemStore StoreFile storage to the table.

Data is first written when updating Log (WAL log) and (MemStore) in memory, the data is sorted MemStore, when the time MemStore accumulated to a certain threshold, it will create a new MemStore , and will add to the old MemStore flush the queue, flush by a separate thread to disk, to become a StoreFile. At the same time, the system will record a redo point in the zookeeper, represents the change before this time has persisted up.

When the system is unexpected, it may result in data memory (MemStore) is lost, this time using Log (WAL log) to recover data after checkpoint.

StoreFile is read-only, after once created can no longer be modified. So in fact, it is constantly updated Hbase additional operations. When a Store in StoreFile reaches a certain threshold, it will perform a merge (minor_compact, major_compact), will be merged with modification a key to together to form a large StoreFile, when the size of StoreFile reaches a predetermined threshold value, and StoreFile will be split, divided into two StoreFile.

Since the table is updated constantly added, while compact, need access to all of StoreFile and MemStore in Store, they will be merged by row key, because the StoreFile and MemStore are sorted, indexed and StoreFile with memory consolidation the process is relatively fast.

Five, Region management

  • region allocation

Any time, a region can only be assigned to a Server region . master records which are currently available region server. And a region which is currently assigned to which region server, which region has not been assigned. When a new region to be assigned, and there is space available on a time region Server, this region give Master Server sends a mount request, the region assigned to the region server. After the region server to get the request, they begin to provide services to this region.

  • region server on-line

master zookeeper use to track the region server status . When a region server starts, first established on behalf of their znode in the server directory on the zookeeper. As the master subscribed to change the message on the server directory, when the file server directory appears to add or delete, master can get real-time notification from the zookeeper. Therefore, once the on-line server region, master can immediately get the message.

  • region server offline

When the region server offline, it disconnected session zookeeper, zookeeper and on behalf of this server is automatically released an exclusive lock on the file. master can determine:

1 region between the network server and zookeeper disconnected.

2 region server hung up.

In either case, region server can not continue to provide services for its region, this time will be removed znode master data representative of this region server under the server directory, and assign this region server region to other comrades alive .

Six, Master mechanism

  • master on the line

master start the following steps:

1 from the zookeeper acquire a lock on behalf of the sole of the Active master , master be used to prevent other master.

parent server on the second scanning zookeeper, server to obtain a list of currently available region.

Server and each of the communication region 3, a corresponding relationship of the region and the region Server currently allocated.

4 Scan collection .META.region calculated to obtain the current region has not been allocated will be placed in their assigned region list.

  • master offline

As the master table and only maintenance of metadata region , without participating in the process table data IO, master off the assembly line led to only modify all metadata is frozen (can not create deleted table, the table can not modify the schema, the region can not load balance , off the assembly line can not handle the region, can not be combined region, the only exception is the region of the split can be normal, because only region server involved), read and write data table can also be normal . So master off the assembly line in a short time has no effect on the entire hbase cluster .

Can be seen from the process line, all information stored may be redundant master information (it can be collected from other parts of the system or calculated)

Thus, the general hbase cluster there is always a master at time of service, there is more than one 'master' to seize its position waiting for an opportunity.

Published 81 original articles · won praise 21 · views 2221

Guess you like

Origin blog.csdn.net/qq_44065303/article/details/103533473