Why is not recommended to use too many columns in the cluster hbase

We know, hbase can set up a table to cluster multiple columns (column families), but the fewer the better why the column cluster it?

 

Original official website:

HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed even though the amount of data they carry is small. When many column families exist the flushing and compaction interaction can make for a bunch of needless i/o (To be addressed by changing flushing and compaction to work on a per column family basis).

 

Recalling the hbase table, each table will be cut into a plurality of region, each region is a subset of the data table, will be distributed to the region hbase RegionServer cluster;

the data for each region composed of a columnFamily Store. Store and each of a plurality of Memstore HFile composition (a cluster corresponding to a column memstore HFile and N);

Flush condition is reached when each flush memstore will generate a HFile file; Also with files generated HFile background minorCompact thread will trigger merger HFile file;

Focus here! flush and compact are carried out on the basis of the region! ! !

 

For example, in flush times, if there are multiple memstore (multiple columns clusters), as long as there is a memstore achieve flush condition, the other small memstore even if the data should be followed by flush, which also led to a lot of unnecessary I / O overhead. Flush trigger conditions are as follows:

  1. Memstore level limit: When any one of MemStore Region reached the upper limit of the size (hbase.hregion.memstore.flush.size, default 128MB), will trigger Memstore refresh.
  2. Region level limit: When the Region in the size of the sum of all Memstore reached the upper limit (hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size, default 2 * 128M = 256M), will trigger memstore refresh.
  3. Region Server-level restrictions: When a Region Server in the size of the sum of all Memstore reached the upper limit (hbase.regionserver.global.memstore.upperLimit * hbase_heapsize, 40% of the default JVM memory usage), part Memstore will trigger a refresh. Flush descending order is executed in accordance with Memstore, the first Flush Memstore largest Region, and then execute the next largest, until Memstore overall memory usage is below the threshold (hbase.regionserver.global.memstore.lowerLimit * hbase_heapsize, default 38% JVM memory usage).
  4. When in a Region Server HLog number reaches the upper limit (configurable parameter hbase.regionserver.maxlogs), the system will select a HLog a corresponding plurality of first or be flush Region
  5. HBase refreshed periodically Memstore: The default period is one hour, not for a long time did not ensure Memstore persistence. In order to avoid all the problems MemStore have carried out due to flush at the same time, regular flush operation random delay of around 20,000.

 

Also in the compact time, because it is built on the basis of the region, it will also generate unnecessary I / O overhead, triggering compcat (minor_compact) conditions:

hbase.hstore.compactionThreshold

  Description

  If more than this number of HStoreFiles in any one HStore (one HStoreFile is written per flush of memstore) then a compaction is run to rewrite all HStoreFiles files as one.     Larger numbers put off compaction but when it runs, it takes longer to complete.

  default 3

 

Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows). If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA’s data will likely be spread across many, many regions (and RegionServers). This makes mass scans for ColumnFamilyA less efficient.

Further, if more than one column family exist in the table, note that the amount of data (i.e., number of rows). If ColumnFamilyA 100 million lines, while ColumnFamilyB 10 billion rows, ColumnFamilyA data is likely to be distributed in many, many regions (and regionservers). This makes large-scale scan efficiency ColumnFamilyA reduced. (We know hbase split is controlled by parameters hbase.hregion.max.filesize value, however, trigger region split is not to say that all HFile file size of this region to achieve this value will trigger the split, but the region under a HFile file reaches this value split will be implemented, that is to say here ColumnFamilyB doing split time, the amount of data ColumnFamilyA is also very small, but will also be executed with a split, which would also lead to more small HDFS file and dispersed into the region and more regionservers)

 

Guess you like

Origin www.cnblogs.com/dtmobile-ksw/p/11373986.html