How HBase implements storage and read optimization

Introduction

This article will introduce the storage optimization and retrieval optimization of HBase. Focus on what is a Bloom filter, and how to select and use it.

1. Storage optimization

Utilize [MemStore](https://blog.csdn.net/x541211190/article/details/108387698 and WAL write-ahead log to achieve fast sequential data insertion, reduce latency, and utilize the multi-level storage principle of LSM tree .

Here is only a summary, and click the keywords above to jump to the corresponding article for detailed implementation.

2. Use Bloom Filter (BloomFilter) for retrieval optimization

1. What is a Bloom filter?

A bloom filter is a long binary vector and a series of random outlier functions.

What does a Bloom filter do?

Used to retrieve whether an element is in a collection.

What are the advantages and disadvantages of bloom filters?

Advantages: space efficiency and query time far exceed the general algorithm;

Disadvantage: There is a certain misrecognition rate and difficulty in deletion.

How does the Bloom filter achieve fast lookup?

Through a fast search algorithm with multiple hash functions, it is usually used to determine whether an element belongs to a certain set, but the accuracy is not required to be 100% correct. It is usually to determine whether an element "is not" in a set.

Example: Determine whether the set {x,y,z} contains x elements?

Parse:

image

(Hash function check map of Bloom filter)

1. Initialize the bit array first, and set each bit to 0;

2. Then through 3 (n functions, the number is not fixed) hash functions, map the 3 elements in the {x, y, z} set in turn, and each mapping obtains a value (that is, the x element can generate 3 value on the bit array), this value corresponds to a point on the bit array, and the value of that point is set to 1.

3. Then the x element is mapped to three points on the bit array by the hash function in the same way. If the value of one of the three points is not 1, the x element must not be in the set (strong judgment). If all three points are 1, then x may exist in the set (exactly all three points of x are 1 on the bit array).

Attention to the use of BloomFilter in HBase

1. HBase can improve performance by using Bloom filter in the process of random reading, but it has little effect on sequential reading.

2. The bloom filter is a column family-level configuration property. If the bloom filter is set in the table, HBase will StoreFileinclude a copy BloomFilterof the index when it is generated.

3. Turning on the Bloom filter will have a certain amount of storage and memory overhead. The HFllelarger the size, BloomFilterthe longer the bit array and the larger the space occupied; when it is too large, it is not suitable for loading into memory, and the HFilebit array will be changed. Arrays are split by order, and some of them use the same bit array RowKeyconsecutively , so multiple bit arrays are included in the array. When querying, according to the query to a corresponding bit array, only the bit array is loaded into the memory for filtering to reduce memory usage.KeyHFileKey

2. Bloom filter types

①. The row type (Row)
filters the StoreFile according to the RowKey in the Key-Value. Used for column families and columns that are the same, only RowKeydifferent, to filter.

Usage scenario: If there is only a row key, it is better to set this filter; in most cases, set it as a Rowfilter.

②. RowCol is filtered
by RowKeyand column descriptor . Used to filter when the column family is the same, the column and the column are different.QualifierStoreFileRowKey

Usage scenario: Only valid for random read of specified columns. If only the row key is specified, but not specified Qualifier, the setting is RowColinvalid.

Note: RowCol not Rowan extension of , these are two different types.

image

(Key-Value structure diagram in StoreFile)

Example: When using RowCol, if r1 (rowKey) and q1 (column) in the above figure are obtained, StoreFile2 will be filtered out; if r3 and q2 are obtained, StoreFile1 will be filtered out.

Summarize

This article explains the optimization strategy of HBase storage and reading, focusing on the Bloom filter, which is a filtering method that does not exist in other relational databases. For more HBase information, please read and follow other articles in the HBase column.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324142177&siteId=291194637