Hbase wide table and high table and advantages and disadvantages

In hbase:

  • Wide table: refers to many columns and fewer rows, that isMore columns and fewer rows, The amount of data in a row is large, and the number of rows is small;
  • High table: refers to many rows and fewer columns, that isMore rows and fewer columns, The amount of data in a row is small, and the number of rows is large.

The row key of hbase is a distributed index and the basis for fragmentation.

hbase of row key + column family + column qualifier + timestamp + valuea data arrangement in accordance HFile.

Based on this, HFile indexes the data to the data block level instead of the row level. Therefore, this key is the primary key of the coarse-grained (data block granularity) local index inside the HFile.

The advantages and disadvantages of using wide and high tables in HBase are summarized as follows:

Query performance

High table is better, Because the query conditions are all in the row key, which is part of the global distributed index. There is less data in a row of the high table. and soQuery cache BlockCache can cache more rows, The throughput in terms of the number of rows will be higher.

Sharding ability

High table fragmentation granularity is finer, The size of each shard is more balanced. Because the high table has less data in one row, the wide table has more data in one row. HBase shards by rows.

Metadata overhead

High table metadata is more expensive. There are many rows in the high table and many row keys, which may result in a large number of regions, and the-root-and .meta tables have a larger data volume. Excessive metadata overhead may cause instability of the HBase cluster and a greater burden on the master (this aspect will be summarized later).

Business ability

Wide tables are more transactional. HBase writes to a row (Put) are transactional atomic, and all columns of a row are either successfully written or none are written. But there is no transactional guarantee between updates of multiple rows.

Data compression ratio

If we compress the data in a row,Wide table can get higher compression ratio. Because of the large amount of data in a row in a wide table, there are often more similar binary bytes, which helps to improve the compression ratio. Through compression, the problem of too large a row of wide table data and uneven fragment size is alleviated. When querying, we find the compressed data according to the row key and decompress it. And decompression can be done on the HBase server through the coproesssor, rather than on the server of the business application, in order to fully utilize the CPU capacity of the HBase cluster.

summary

When designing a watch, you can not absolutely pursue a tall watch and a wide watch, but make a balance between the two.

According to the query mode, the query fields that require distributed indexing, sharding, and high selectivity (that is, a small range of rows can be quickly locked based on the query conditions), should be placed in the row key; the data words can be divided evenly The field of the number of sections should also be placed in the row key as the basis for fragmentation. The query fields that have low selectivity and do not need to be used as the basis for fragmentation are put in column family and column qualifier instead of row key.

Wide table summary

Advantages of a wide watch

The benefits of being generous

In the current project, a large number of wide tables are used. There are five wide tables with more than one hundred and fifty fields. They are the customer organization-level information table, the customer customer manager-level information table, the customer manager information table, and the group customer information. Form, strategic customer information form.

As can be guessed from the table name above, this is a CRM project, and the advantages I can list here are also the advantages reflected in the project.

What are the benefits of extensive use of wide tables?

  • Get a glimpse of the whole picture
  • The query speed is fast (I have always kept doubts here)

Get a glimpse of the whole picture: This benefit is obvious, especially when it is welcomed by report developers. They don't care about how the data is stored, but only whether the report is very convenient and can be developed soon. There is also data analysts. Most data analysts in the banking industry hate to do table associations. They like to see the full picture of a customer through a record.

Fast query speed: This is true in most cases. Most SQL slowness is caused by table associations, so everyone is thinking about not associating the results. As a result, some commonly used information has been made into a wide table.

The inconvenience of the second wide watch

From the ETL point of view, the pain of using a wide watchband is far greater than the happiness it brings, and I explain the problems caused by the wide watch from a systematic perspective.

The project is a CRM system that uses warehouse data (TD) as the basis for modeling and data processing. After landing as a bazaar in the warehouse, this part of the data is synchronized to the query server (DB2), and the data is sent from TD every day. The amount of data to DB2 is about 16G. The method of synchronizing data is to export data to TD to generate txt, upload it to the FTP server and then load it to DB2 through SHELL.

Knowing the background of the project, you can list the following inconveniences when using wide tables

  • The script is too long to maintain
  • The addition and deletion of fields is too complicated

Because you want to synchronize two tables with one piece of data in the two systems, you must consider consistency when adding or deleting fields, as well as how to deal with historical data. Each time you need to go through the following steps:

  1. Modify the TD table structure
  2. Modify TD script
  3. Modify the export data script
  4. Modify the DB2 table structure
  5. Modify the shell script of DB2 to load data

Each change is accompanied by a lot of work. At the same time, it is necessary to load new historical data or update the content of this field because of this field. This brings a lot of work in the post-maintenance of the project.

Three how to use wide tables gracefully

Wide tables are widely used in the summary layer of the warehouse. General entities such as customers, deposits, loans, etc. are all designed as wide tables. Will these wide tables face the problems I mentioned above?

Obviously not. The warehouse tables rarely add or delete fields due to personal needs. At the same time, warehouses are mostly only responsible for the storage of raw data and do not involve too complex field processing logic. Therefore, most of the summary layer tables are designed to be wide. table.

To use wide tables gracefully, I think we need to pay attention to the following:

Will the fields increase or decrease frequently?
For tables that are not involved in transactions and fields do not increase frequently, it is recommended to design a wide table, especially for BI systems.

Guess you like

Origin blog.csdn.net/qq_32727095/article/details/114023121