block locality

Because the DataNode RegionServer and usually deployed on the same machine, there arises such a concept Locality.

Locality copy of HBase is achieved through the HDFS Block. When copying Block, HBase is a copy of the selected location:

第一个副本写到本地节点上;
第二个副本写到另一个机架的随机节点上;
第三个副本写到相同机架的一个随机选择的其他节点上;
如果还有更多的副本,这些副本将会写到集群上的随机节点上。


Even so, in the flush or compact, HBase the Region to achieve the Locality.

When RegionServer in the case of a failover of (rebalance or restart), it may not be assigned to some of the Region's local StoreFiles (because the local copy is not available at this time). However, new data is then written to the Region's time, or when the table is compact, StoreFiles will be rewritten, the Region will become RegionServer of "local" Region again.

There is a relevant indicator "data locality", that is, save the Region in the percentage of local StoreFile.

In fact, my understanding is: A region is a table, a part of the data on the machine. However, this should be a logical concept, there will be a region more HStore (column families), the following will be a plurality HStore hfile. These hfile is really stored files, these are hfile on hdfs to block storage, which may block on different machines, which have the localization rate this concept.

datanode and regionserver usually deployed on the same machine, so the region region server management will give priority stored locally to save network overhead. If the block locality less likely just had balance or just restart, after a compact region of the data will be written datanode current machine, block locality will gradually reach close to 100.

That is the localization rate is low, it is possible to improve the localization rate by compaction.
 

发布了131 篇原创文章 · 获赞 79 · 访问量 31万+

Guess you like

Origin blog.csdn.net/qq_31780525/article/details/101068977