HBase knowledge points (3)

What are the characteristics of HBase?

  • Large: A table can have billions of rows and millions of columns;
  • Modeless: Each row has a sortable primary key and any number of columns. The columns can be dynamically added as needed. Different rows in the same table can have completely different columns;
  • Column-oriented: column (family) storage and permission control, column (family) independent retrieval;
  • Sparse: empty (null) columns do not occupy storage space, the table can be designed to be very sparse;
  • Multiple versions of data: The data in each cell can have multiple versions. The version number is automatically assigned by default, which is the timestamp when the cell is inserted;
    -Single data type: All data in Hbase is a string without a type.

The difference between HBase and Hive?

Hive and Hbase are two different technologies based on Hadoop-Hive is a SQL-like engine that runs MapReduce tasks, and Hbase is a NoSQL Key/vale database on top of Hadoop. Of course, these two tools can be used at the same time. Just like using Google to search and FaceBook for social networking, Hive can be used for statistical query, HBase can be used for real-time query, data can also be written from Hive to Hbase, settings and then written from Hbase back to Hive.

What kind of scenarios is HBase suitable for?

  • ① Semi-structured or unstructured data
    The data structure fields are not sufficiently certain or disorderly and it is difficult to extract data according to a concept suitable for HBase. Take the above example as an example. When business development needs to store the email, phone, and
    address information of the author , the RDBMS needs to be shut down for maintenance, while HBase supports dynamic increase.
  • ② Record
    how many columns of a very sparse RDBMS row are fixed, which wastes storage space for null columns. As mentioned above, Columns whose HBase is null will not be stored, which saves space and improves read performance.
  • ③ Multi-version data. As mentioned above, the value located according to Row key and Column
    key can have any number of version values. Therefore, it is very convenient to use HBase for data that needs to store change history.
    For example, the address of the author in the above example will change. Generally, only the latest value is needed in business, but sometimes the historical value may need to be queried.
  • ④ Large amount of data
    When the amount of data becomes larger and larger, the RDBMS database cannot support it, and a read-write separation strategy appears. A Master is dedicated to write operations and multiple Slaves are responsible for read operations. The server cost doubles.
    As the pressure increases, the Master can't hold it anymore. At this time, it is necessary to split the database. Separate the data with little correlation. Some join queries cannot be used, and the middle layer is needed. As the amount of data further increases, the
    number of records in a table becomes larger and larger, and the query becomes very slow, so you have to divide the table, such as dividing the table into multiple tables based on the ID to reduce the number of records in a single table. Anyone who has experienced these things knows how arduous the process is.
    Using HBase is simple, just add a machine, HBase will automatically split and expand horizontally, and seamless integration with Hadoop guarantees its data reliability (HDFS) and high performance of massive data analysis (MapReduce).

Describe the design principles of HBase's rowKey?

  • (1) Rowkey length principle

Rowkey is a binary code stream. Many developers suggest that the length of Rowkey should be 10~100 bytes. However, it is recommended that the length be as short as possible and not exceed 16 bytes.
The reasons are as follows:
① The data persistent file HFile is stored according to KeyValue. If the Rowkey is too long, such as 100 bytes, 10 million columns of data will occupy 100*10 million = 1 billion bytes, which is nearly 1G of data. This will greatly affect the storage efficiency of
HFile ; ② MemStore will cache part of the data in the memory. If the Rowkey field is too long, the effective utilization of the memory will be reduced, and the system will not be able to cache more data, which will reduce the retrieval efficiency. Therefore, the shorter the byte length of Rowkey, the better.
③ The current operating systems are all 64-bit systems, and the memory is aligned to 8 bytes. The control is 16 bytes, and integer multiples of 8 bytes utilize the best features of the operating system.

  • (2) Rowkey hashing principle

If Rowkey is incremented by timestamp, do not put the time in front of the binary code. It is recommended to use the high bit of Rowkey as a hash field, generated by the program loop, and put the time field in the low bit, which will improve the balanced distribution of data in each Regionserver The probability of achieving load balancing. If there is no hash field, the first field is the time information directly, which will generate a hot phenomenon that all new data is accumulated on a RegionServer. In this way, the load will be concentrated on individual RegionServers when data retrieval is performed, reducing query efficiency.

  • (3) The unique principle of Rowkey

The uniqueness must be guaranteed in design.

Describe the functions and similarities and differences of scan and get in HBase?

HBase's query implementation only provides two ways:
1) Get the only record according to the specified RowKey, the get method (org.apache.hadoop.hbase.client.Get) The Get method can be processed in two ways: ClosestRowBefore is set and ClosestRowBefore is not set Rowlock. It is mainly used to ensure the transactional nature of the row, that is, each get is marked with a row. There can be many families and columns in a row.
2) Obtain a batch of records according to the specified conditions. The scan method (org.apache.Hadoop.hbase.client.Scan) realizes the conditional query function using the scan method.
(1) Scan can use setCaching and setBatch methods to increase the speed (change space for time);
(2) Scan can use setStartRow and setEndRow to limit the range ([start, end) start is a closed interval, end is an open interval). The smaller the range, the higher the performance.
(3) Scan can add filters through the setFilter method, which is also the basis for paging and multi-condition queries.

Please describe the structure of a cell in HBase in detail?

A storage unit determined by row and columns in HBase is called a cell.
Cell: The cell uniquely determined by {row key, column(= + ), version}. The data in the cell has no type, and is all stored in bytecode format.

Briefly describe what is the purpose of compact in HBase, when is it triggered, which two are divided into, what are the differences, and what are the relevant configuration parameters?

In hbase, whenever memstore data is flushed to the disk, a storefile is formed. When the number of storeFiles reaches a certain level, the storefile file needs to be compacted.

The role of Compact:

  • ① Combine files
  • ② Clear outdated and redundant version data
  • ③ Improve the efficiency of reading and writing data

Two compaction methods are implemented in HBase: minorand major. The difference between these two compaction methods is:

  • Minor operation is only used to merge some files and clean up expired versions including minVersion=0 and set ttl, and do not do any cleanup of deleted data and multi-version data.
  • The major operation is to perform a merge operation on all StoreFiles under the HStore under the Region, and the final result is to sort and merge a file.

Tens of billions of data are stored in HBase every day. How to ensure that the data is stored correctly and that all the data is entered within the specified time without residual data?

Demand analysis:
1) Ten billion data: prove that the amount of data is very large;
2) Store in HBase: Prove it is related to the data written by HBase;
3) Ensure the correctness of the data: Design the correct data structure to ensure the correctness;
4) Completed within the specified time: There is a requirement for the deposit speed.

Solutions:
1) What is the concept of tens of billions of data? Assuming that data is written in 60x60x24 = 86400 seconds throughout the day, then the number of writes per second is as high as 1 million. Of course, HBase cannot support millions of data per second, so these tens of billions of data may not be in real time. Write locally, but import them in batches. BulkLoad is recommended for bulk import (recommended reading: Spark's Read and Write HBase), and the performance is several times more than that of ordinary writing;

2) Store in HBase: Normal writing is realized by JavaAPI put, and BulkLoad is recommended for bulk import;

3) Ensure the correctness of the data: Here need to consider the design of RowKey, pre-built partition and column family design and other issues;

4) It is completed within the specified time, that is, the deposit speed cannot be too slow, and of course the faster the better, use BulkLoad.

How to pre-build regions in Regions?

The purpose of pre-partitioning is mainly to specify the number of partitions when creating the table. Plan in advance that the table has multiple partitions and the interval range of each partition. In this way, rowkey is stored according to the interval of the partition when storing, which can avoid the region hotspot problem.

There are usually two schemes:
scheme 1: shell method
create'tb_splits', {NAME =>'cf',VERSIONS=> 3},{SPLITS => ['10','20','30']}
scheme 2 : JAVA program control
① Sampling, first randomly generate a certain number of rowkeys, and sort the sampled data into a collection in ascending order;
② According to the number of pre-partitioned regions, divide the entire collection equally, that is, the relevant splitKeys;
③ HBaseAdmin .createTable(HTableDescriptor tableDescriptor,bytesplitkeys) can specify the splitKey of the pre-partition, that is, the critical value of rowkey between the specified regions.

How to deal with HRegionServer downtime?

1) ZooKeeper will monitor the online and offline status of HRegionServer, and when ZK finds that a certain HRegionServer is down, it will notify HMaster for failover;

2) The HRegionServer will stop providing external services, that is, the region it is responsible for temporarily suspending external services;

3) HMaster will transfer the region responsible for the HRegionServer to other HRegionServers, and will restore the data stored in the memstore on the HRegionServer that has not been persisted to disk;

4) This restoration is done by WAL replay. The process is as follows:
① wal is actually a file, which exists in the path of /hbase/WAL/ corresponding to the RegionServer.
② When a downtime occurs, read the wal file in the path corresponding to the RegionServer, and then divide it into different temporary files recover.edits according to different regions.
③ When the region is allocated to the new RegionServer, the RegionServer will check whether there is recover.edits when reading the region, and if there is recovery.edits, it will be restored.

HBase read and write process?

  • read:

① HRegionServer stores meta tables and table data. To access table data, the Client first accesses zookeeper and obtains the location information of the meta table from zookeeper, that is, finds which HRegionServer the meta table is stored on.

② Then the Client accesses the HRegionServer where the Meta table is located through the IP of the HRegionServer just obtained, so as to read the Meta, and then obtain the metadata stored in the Meta table.

③ The Client accesses the corresponding HRegionServer through the information stored in the metadata, and then scans the Memstore and Storefile of the HRegionServer to query the data.

④ Finally, HRegionServer responds to the client with the queried data.

  • write:

① Client first visits zookeeper, finds the Meta table, and obtains the metadata of the Meta table.

② Determine the HRegion and HRegionServer servers corresponding to the data currently to be written.

③ The Client initiates a data write request to the HRegionServer server, and then the HRegionServer receives the request and responds.

④ Client writes data to HLog first to prevent data loss.

⑤ Then write the data to Memstore.

⑥ If both HLog and Memstore are written successfully, the data is written successfully.

⑦ If Memstore reaches the threshold, the data in Memstore will be flushed to Storefile.

⑧ When there are more and more Storefiles, it will trigger the Compact merge operation to merge too many Storefiles into one big Storefile.

⑨ When the Storefile gets bigger and bigger, the Region will get bigger and bigger. When the threshold is reached, the Split operation will be triggered to divide the Region into two.

What is the internal mechanism of HBase?

Hbase is a database system that can adapt to online business

  • Physical storage: Hbase's persistent data stores data on HDFS.
  • Storage management: A table is divided into many regions. These regions are stored in many region servers in a distributed manner. Regions can also be divided into stores. There are memstore and storefile in the store.
  • Version management: The data update in hbase is essentially the continuous addition of new versions, and the split of the file merge region between versions is done through the compact operation.
  • Cluster management: ZooKeeper + HMaster + HRegionServer.

What is the memstore in Hbase used for?

In order to ensure the performance of random reading, the rowkeys in hfile are ordered.

After the client's request arrives at the regionserver, in order to ensure the orderliness of the write rowkey, the data cannot be written to the hfile immediately, but each change operation is stored in the memory, that is, in the memstore.

Memstore can easily support random insertion of operations and ensure that all operations are ordered in memory.

When the memstore reaches a certain amount, the data in the memstore will be flushed to the hfile, which can make full use of hadoop's performance advantages for writing large files and improve the write performance.

Since memstore is stored in memory, if the regionserver dies for some reason, data in the memory will be lost.

All in order to ensure that the data is not lost, hbase will write the update operation to a write ahead log (WAL) before writing to the memstore.

WAL files are appended and written sequentially. There is only one WAL per regionserver, and all regions on the same regionserver are written to the same WAL file.

In this way, when a regionserver fails, all operations can be reloaded into memstore through the WAL file.

What is the focus of HBase in model design? How many Column Family is most appropriate to define in a table? why?

The number of Column Family depends on the data in the table. Generally speaking, the classification standard is based on the frequency of data access. For example, some columns in a table are accessed relatively frequently, while other columns are rarely accessed. At this time, this table can be divided into The two column families are stored separately to improve access efficiency.

How to improve the read and write performance of the HBase client? Please give an example

① Turn on the bloomfilter filter. Turning on bloomfilter is 3 or 4 times faster than not turning it on.
Hbase has special requirements for memory. If the hardware allows it, allocate enough memory for it.
③ By modifying the export in hbase-env.sh HBASE_HEAPSIZE=3000 #The default here is 1000m
④ Increase the number of RPCs
By modifying the hbase.regionserver.handler.count attribute in hbase-site.xml, the number of RPCs can be appropriately enlarged. The default value of 10 is a bit small.

What are the precautions for HBase cluster installation?

① HBase needs HDFS support, so make sure that the Hadoop cluster is installed before installing HBase;
② HBase needs ZooKeeper cluster support, so make sure that the ZooKeeper cluster is installed before installing HBase;
③ Pay attention to the version compatibility of HBase and Hadoop;
④ Pay attention to hbase-env The correct configuration of the .sh configuration file and hbase-site.xml configuration file;
⑤ Pay attention to the modification of the regionservers configuration file;
⑥ Note that the time of each node in the cluster must be synchronized, otherwise an error will be reported when starting the HBase cluster.

Using the timestamp directly as the row key, hot issues will occur when writing to a single region, why?

The rowkey in the region is stored in an orderly manner, if the time is concentrated. Will be stored in a region, such a region has more data, and other regions have very little data, and the data will be loaded very slowly. This problem will not be alleviated until the region is split.

Please describe how to solve the conflict caused by too small and too large regions in HBase?

Multiple compactions occur when the region is over, read the data once and rewrite it to hdfs, occupying io, too small a region will cause multiple splits, and the region will go offline, affecting access services. The best solution is to adjust hbase.hregion. max.filesizeto 256m .

Why is it not recommended to use too many column families in HBase?

(It CFis ColumnFamilyan abbreviation for column family)

In the Hbase table, each column family corresponds to a Store in the Region. When the size of the Region reaches the threshold, it will split. Therefore, if there are multiple column families in the table, the following phenomena may occur:

1. There are multiple stores in a region. If the data volume of each CF is unevenly distributed, for example, CF1 is 1 million and CF2 is 10,000, the data volume of CF2 in each region is too small when the region is split. When querying CF2, it will span multiple regions, resulting in reduced efficiency.

2. If the data of each CF is evenly distributed, such as CF1 has 500,000, CF2 has 500,000, and CF3 has 500,000, the data volume of each CF in the Region will be less when the Region is split, and it will be caused when a CF is inquired The probability of crossing multiple regions increases.

3. Multiple CFs represent multiple Stores, which means that there are multiple MemStores (2MB), which leads to increased memory consumption and decreased usage efficiency.

4. Cache refresh and compression in Region are basic operations, that is, if a CF has a cache refresh or compression operation, other CFs will also do the same operation at the same time. When there are too many column families, it will cause frequent IO problems.

Guess you like

Origin blog.csdn.net/weixin_42072754/article/details/109294763