Hbase interview questions (face-to-face) finishing

å¨è¿éæå ¥ å¾çæè¿ °
1. What is Hbase? What are the characteristics of hbase?

  1. Hbase is a distributed database based on columnar storage, based on Hadoop's hdfs storage, and managed by zookeeper.
  2. Hbase is suitable for storing semi-structured or unstructured data, for data whose structure fields are not sufficiently certain or disorderly and difficult to extract according to a concept.
  3. Records whose Hbase is null will not be stored.
  4. The base table contains rowkey, timestamp, and column family. When new data is written, the timestamp is updated, and the previous version can be queried at the same time.
  5. hbase is a master-slave architecture. hmaster is the master node, and hregionserver is the slave node.

2. How does hbase import data?

  1. Write data in batches through HBase API;
  2. Use the Sqoop tool to batch derivatives to the HBase cluster;
  3. Use MapReduce to batch import;
  4. The way of HBase BulkLoad.

3. What is the storage structure of hbase?

Each table in Hbase is divided into multiple sub-tables (HRegion) according to a certain range by the row key (rowkey). By default, an HRegion over 256M will be divided into two, managed by HRegionServer, which manages which HRegions are allocated by Hmaster . When HRegion accesses a sub-table, it creates an HRegion object, and then creates a store instance for each Column Family of the table. Each store will have 0 or more StoreFiles corresponding to it, and each StoreFile will Corresponding to an HFile, HFile is the actual storage file. Therefore, an HRegion also has a MemStore instance.

4. What is the difference between Hbase and hive? What is the underlying storage of hive and hbase? What is the reason for hive? What is the purpose of habase to make up for the deficiencies of hadoop?

Common points:

  1. Both hbase and hive are based on hadoop. All use hadoop as the underlying storage.

the difference:

  1. Hive is a batch processing system built on Hadoop in order to reduce the writing of MapReducejobs. HBase is to support the project to compensate for the shortcomings of Hadoop on real-time operations.
  2. Imagine you are operating an RMDB database, if it is a full table scan, use Hive+Hadoop, if it is index access, use HBase+Hadoop;
  3. Hive query means that MapReduce jobs can last from 5 minutes to several hours. HBase is very efficient, and it is definitely much more efficient than Hive;
  4. Hive itself does not store and calculate data, it completely relies on HDFS and MapReduce, the pure logic of tables in Hive;
  5. Hive borrows MapReduce from Hadoop to complete the execution of some commands in hive;
  6. Hbase is a physical table, not a logical table. It provides a large memory hash table, which is used by search engines to store indexes to facilitate query operations;
  7. Hbase is a column storage;
  8. hdfs is the underlying storage, hdfs is the file storage system, and Hbase is responsible for organizing files;
  9. Hive needs to use hdfs to store files and MapReduce computing framework.

5. Explain the principle of hbase real-time query

Real-time query can be considered as query from memory, and the response time is generally within 1 second. The mechanism of HBase is that data is first written to the memory. When the amount of data reaches a certain amount (such as 128M), then it is written to the disk. In the memory, data is not updated or merged, only the data is added. The user's write operation can be returned immediately as long as it enters the memory, ensuring the high performance of HBase I/O.

6. Describe the design principles of Hbase's rowKey

Contact the description of the relationship between region and rowkey, the design can refer to the following three principles.

     1. Rowkey length principle
Rowkey is a binary code stream, which can be any string, with a maximum length of 64kb. In practical applications, it is generally 10-100 bytes. It is stored in the form of byte[] and is generally designed as a fixed length. It is recommended to be as short as possible and not to exceed 16 bytes. The reasons are as follows:

The data persistent file HFile is stored according to KeyValue. If the rowkey is too long, it will greatly affect the storage efficiency of HFile. MemStore will cache part of the data in the memory. If the rowkey field is too long, the effective utilization of the memory will be reduced, and the system cannot Cache more data, which will reduce retrieval efficiency

     2. The principle of rowkey hashing
If the rowkey is incremented according to the timestamp, do not put the time in front of the binary code. It is recommended to use the high bit of the rowkey as the hash field, which is randomly generated by the program, and the low bit is the time field, which will improve the data balance Distributed in each RegionServer to achieve the probability of load balancing. If there is no hash field, the first field is directly the time information, and all data will be concentrated on one RegionServer. In this way, the load will be concentrated on individual RegionServers during data retrieval, causing hot issues and reducing query efficiency.

     3. The unique principle of rowkey
must be designed to ensure its uniqueness. Rowkey is sorted and stored in lexicographical order. Therefore, when designing rowkey, make full use of the characteristics of this sorting and store frequently read data together. The data that may be accessed are put together.

7. Describe the functions and similarities and differences of scan and get in Hbase

  1. Get the only record according to the specified RowKey. The get method (org.apache.hadoop.hbase.client.Get) The Get method can be processed in two ways: ClosestRowBefore is set and the rowlock not set is mainly used to ensure the transactional nature of the row, that is Each get is marked with a row. There can be many families and columns in a row.
  2. Obtain a batch of records according to the specified conditions. The scan method (org.apache.Hadoop.hbase.client.Scan) realizes the conditional query function and uses the scan method
1. scan 可以通过 setCaching 与 setBatch 方法提高速度(以空间换时间);
2. scan 可以通过 setStartRow 与 setEndRow 来限定范围([start,end]start? 是闭区间,end 是开区间)。范围越小,性能越高;
3. scan 可以通过 setFilter 方法添加过滤器,这也是分页、多条件查询的基础。 3.全表扫描,即直接扫描整张表中所有行记录。


8. Please describe in detail the structure of a Cell in Hbase

A storage unit determined by row and columns in HBase is called a cell. Cell: {row key, column(= + ), version} is the only cell. The data in the cell has no type and is stored in bytecode.

9. Briefly describe what is the purpose of compact in HBASE, when is it triggered, which two are divided, what are the differences, and what are the relevant configuration parameters?

In hbase, whenever memstore data is flushed to the disk, a storefile is formed. When the number of storeFiles reaches a certain level, the storefile file needs to be compacted. The role of Compact:

  1. Merge files
  2. Clear outdated, redundant version data
  3. Improve the efficiency of reading and writing data

10. Two compaction methods are implemented in HBase: The difference between the two compaction methods minor and major is:

  1. Minor operation is only used to merge some files and clean up expired versions including minVersion=0 and set TTL, and do not do any cleanup of deleted data and multi-version data;
  2. The major operation is to perform a merge operation on all StoreFiles under the HStore under the Region, and the final result is to sort and merge a file.

11. Briefly describe the implementation principle of Hbase filter? Combined with actual project experience, write a few scenarios of using filters.

HBase provides a set of filters for filtering data. Through this filter, the data can be filtered on multiple dimensions (rows, columns, data versions) of the data in HBase, which means that the filter can finally filter Data can be refined to a specific storage cell (located by row key, column name, and timestamp).

RowFilter, PrefixFilter. The filter of hbase is set by scan, so it is filtered based on the query result of scan. There are many types of filters, but they can be divided into two categories-comparison filters and special filters. The function of the filter is to determine whether the data meets the conditions on the server side, and then only return the data that meets the conditions to the client; for example, when developing orders, we use rowkeyfilter to filter out all orders of a certain user.

12. What is the internal mechanism of Hbase?

Whether adding new rows or modifying existing rows in HBase, the internal process is the same. HBase saves the change information after receiving the command, or the write fails and an exception is thrown. By default, two places are written when writing is performed: write-ahead log (also known as HLog) and MemStore. HBase's default method is to record write actions in these two places to ensure data persistence. Only when the change information in these two places is written and confirmed, the writing action is considered complete.

MemStore is a write buffer in the memory, where data in HBase accumulates before being permanently written to the hard disk. When the MemStore is full, the data in it will be flashed to the hard disk, generating an HFile. HFile is the underlying storage format used by HBase. HFile corresponds to the column family. A column family can have multiple HFiles, but one HFile cannot store data from multiple column families. On each node of the cluster, there is one MemStore for each column family. Hardware failures are common in large distributed systems, and HBase is no exception.

Imagine that if the MemStore has not been flashed, the server will crash and the data in the memory that is not written to the hard disk will be lost. HBase's solution is to write to WAL before the write action is completed. Each server in the HBase cluster maintains a WAL to record changes. WAL is a file on the underlying file system. The write action is not considered to be completed successfully until the new WAL record is successfully written. This can ensure that HBase and the file system supporting it meet durability.

In most cases, HBase uses the Hadoop Distributed File System (HDFS) as the underlying file system. If the HBase server goes down, the data that has not been flashed to HFile from MemStore can be restored by replaying WAL. You don't need to do it manually. There is a recovery process part in the internal mechanism of Hbase to deal with. Each HBase server has a WAL, and all tables (and their column families) on this server share this WAL. You might think that skipping WAL when writing should improve writing performance. However, we do not recommend disabling WAL unless you are willing to lose data if something goes wrong. If you want to test it, the following code can disable WAL: Note: Not writing WAL will increase the risk of data loss when the RegionServer fails. Turn off WAL, HBase may not be able to recover data in the event of a failure, and all written data that has not been flushed to the hard disk will be lost.

13. How to deal with HBase downtime?

Downtime is divided into HMaster downtime and HRegisoner downtime.

If HRegisoner goes down, HMaster will redistribute the regions it manages to other active RegionServers. Since the data and logs are persisted in HDFS, this operation will not cause data loss, so the consistency and security of the data are Guaranteed.

If HMaster is down, HMaster has no single point of problem. Multiple HMasters can be started in HBase, and there is always one Master running through Zookeeper's Master Election mechanism. That is, ZooKeeper will ensure that there will always be an HMaster to provide external services.

14. How to deal with HRegionServer downtime?

  1. ZooKeeper will monitor the online and offline status of HRegionServer. When ZK finds that a HRegionServer is down, it will notify HMaster to perform failover;
  2. HRegionServer will stop providing external services, that is, the region it is responsible for temporarily suspending external services
  3. HMaster will transfer the region that HRegionServer is responsible for to other HRegionServer, and will restore the data stored in memstore on HRegionServer that has not been persisted to disk;
  4. This restoration is done by WAL replay. The process is as follows:
1. wal实际上就是一个文件,存在/hbase/WAL/对应RegionServer路径下
2. 宕机发生时,读取该RegionServer所对应的路径下的wal文件,然后根据不同的region切分成不同的临时文件recover.edits
3. 当region被分配到新的RegionServer中,RegionServer读取region时会进行是否存在recover.edits,如果有则进行恢复


15 hbase write data and read data process to
obtain region storage location information

Writing data and reading data generally obtain the location information of the hbase region. The approximate steps are:

  1. Obtain the location information of the .ROOT. table from zookeeper, and the storage location in zookeeper is /hbase/root-region-server;
  2. Obtain the location information of the .META. table according to the information in the .ROOT. table;
  3. The data stored in the META. table is the storage location of each region;

Insert data into the hbase table

The cache in hbase is divided into two layers: Memstore and BlockCache

  1. First write to the WAL file, the purpose is to not lose the data;
  2. Then insert the data into the Memstore cache. When the Memstore reaches the set size threshold, the flush process will be performed;
  3. During the flush process, it is necessary to obtain the storage location of each region.

Read data from hbase

BlockCache is mainly provided for reading. The read request first checks the data in Memtore, and then checks it in the BlockCache if it is not found, and then reads it on the disk if it is not found, and puts the read result into the BlockCache.

The algorithm used by BlockCache is LRU (Least Recently Used Algorithm), so when BlockCache reaches the upper limit, the elimination mechanism will be activated to eliminate the oldest batch of data.

There is a BlockCache and N Memstores on a RegionServer, and the sum of their sizes cannot be greater than or equal to heapsize * 0.8, otherwise hbase cannot be started. The default BlockCache is 0.2 and Memstore is 0.4. For systems that pay attention to read response time, BlockCache should be set larger, such as setting BlockCache = 0.4, Memstore = 0.39. This will increase the cache hit rate.

15. HBase optimization method

The optimization methods mainly have the following four aspects

1. Reduce adjustments

How to understand the reduction of adjustment? There are several contents in HBase that will be dynamically adjusted, such as region (partition), HFile, so some methods are used to reduce these adjustments that will bring I/O overhead.


If the Region does not have pre-built partitions, then as the number of regions in the region increases, the region will be split, which will increase the I/O overhead, so the solution is to pre-build the partitions according to your RowKey design to reduce the dynamics of the region Split.

HFile
HFile is the underlying data storage file. When each memstore is refreshed, an HFile will be generated. When the HFile increases to a certain extent, HFiles belonging to a region will be merged. This step will bring overhead but is inevitable, but merge After the HFile size is greater than the set value, then the HFile will split again. In order to reduce such unnecessary I/O overhead, it is recommended to estimate the size of the project data and set an appropriate value for HFile

2. Reduce start and stop

The database transaction mechanism is to better implement batch writes and reduce the overhead caused by the opening and closing of the database, so there are also problems caused by frequent opening and closing in HBase.

     1. Close Compaction, and perform manual Compaction when idle

Because there are Minor Compaction and Major Compaction in HBase, that is, HFile is merged. The so-called merge is I/O reading and writing. A large number of HFiles will definitely bring I/O overhead and even I/O storms, so in order to avoid this An uncontrolled accident occurs. It is recommended to turn off the automatic compaction and perform compaction in idle time.

     2. Use BulkLoad when writing batch data

If a large amount of data is written through HBase-Shell or JavaAPI's put, poor performance is certain and it may also bring some unexpected problems. Therefore, it is recommended to use BulkLoad when writing a large amount of offline data.

3. Reduce the amount of data

Although we are developing big data, if we can reduce the amount of data while ensuring the accuracy of the data in some way, why not do it?

     1. Turn on filtering to improve query speed.
Turn on BloomFilter. BloomFilter is a column-family-level filtering. When a StoreFile is generated, a MetaBlock will be generated at the same time to filter data during query.

     2. Use compression: Snappy and LZO compression are generally recommended

4. Reasonable design

The design of RowKey and ColumnFamily in an HBase table is very important. Good design can improve performance and ensure data accuracy

  1. RowKey design: should have the following attributes
  • Hashability: Hashability can ensure that the same and similar rowkeys are aggregated, and different rowkeys are scattered, which is conducive to querying
  • Shortness: The rowkey is stored in HFile as part of the key. If the rowKey is designed to be too long for readability, it will increase the storage pressure
  • Uniqueness: rowKey must have obvious distinction
  • Business: Give some examples

If my query conditions are more, and not for column conditions, then the design of rowKey should support multi-condition query.
If my query requires the most recently inserted data first, then rowKey can be called Long.Max-timestamp Way, so that rowKey is arranged in descending order
     2. Column family design

      The design of the column family depends on the application scenario

      The pros and cons of multi-column design

Advantages: The
data in HBase is stored in columns, so when querying a column of a certain column family, you don’t need to scan the entire disk. You only need to scan a certain column family, which reduces the read I/O;
in fact, the multi-column family design reduces read I/O. It's not very obvious, and it is suitable for scenarios where read more and write less.

Disadvantages:
reduced write I/O performance. The reason is as follows: after the data is written to the store, it is first cached in the memstore. If there are multiple column families in the same region, there are multiple stores. Each store has a memstore. When the memstore is flushed, it belongs to
the store in the same region . The memstore in will flush, which increases I/O overhead.

16. Why is it not recommended to use too many column families in HBase

In the Hbase table, each column family corresponds to a Store in the Region. When the size of the Region reaches the threshold, it will split. Therefore, if there are multiple column families in the table, the following phenomena may occur:

  1. There are multiple stores in a region. If the data volume of each CF is unevenly distributed, for example, CF1 is 1 million and CF2 is 10,000, the data volume of CF2 in each region is too small when the region is split, query CF2 Sometimes it will span multiple Regions, resulting in reduced efficiency.
  2. If the data of each CF is evenly distributed, for example, CF1 has 500,000, CF2 has 500,000, and CF3 has 500,000, the data volume of each CF in the Region will be less when the region is split, and the query of a CF will cause the span The probability of multiple regions increases.
  3. Multiple CFs represent multiple Stores, which means that there are multiple MemStores (2MB), which leads to increased memory consumption and decreased usage efficiency.
  4. Cache refresh and compression in the Region are basic operations, that is, if a CF has a cache refresh or compression operation, other CFs will also do the same operation at the same time. When there are too many column families, it will cause frequent IO problems.

reference:

  1. https://www.boxuegu.com/news/470.html
  2. https://www.boxuegu.com/news/480.html
  3. https://www.boxuegu.com/news/481.html
  4. https://blog.csdn.net/u013384984/article/details/80808353
  5. https://www.jianshu.com/p/e405ed781cab
  6. https://www.cnblogs.com/parent-absent-son/p/10895365.html

 

Guess you like

Origin blog.csdn.net/qq_32445015/article/details/101926881