Big Data Interview Series --Hbase

Hbase is a distributed database storage column
1. Hbase talk about the characteristics of
1. distributed architecture, Hbase by clustered storage data, will eventually fall to the HDFS
2. NoSQL is a non-relational database, not Paradigm meet relational database
3. the column oriented storage, based on the underlying structure of key-value
4 adapted to store semi-structured, unstructured data
5. sparse suitable for storing data, the data is not null space
6. provide real time additions and deletions to change search capabilities, but does not provide strict transaction mechanism, can only provide line-level transaction

2.Hbase architectural composition and role
1.Zookeeper, as a distributed coordination. RegionServer will put their information in written ZooKeeper.
2.HDFS the underlying file system Hbase run
3.RegionServer, understood as data nodes to store data
4.Master RegionServer to report real-time information to the Master. Master know RegionServer global operation can be controlled failover and Region RegionServer of segmentation

3. talk characteristics stored row and column-
1 is stored on disk memory row is continuous; rank stored on disk discontinuous
2. From the comparison of the write performance, the less number of writing performance higher. Because every write for the disk head must scheduling occurs, resulting in seek time. Because the line memory is written only once but many times to write a column store, so the advantage line is stored on write performance
3. contrast from the read performance:
the more if the reading is a whole table, the line storage performance. high
b. If the specified column is read, the redundant column line memory is generated, and the elimination of redundant columns in memory occurs. And the column does not exist a redundant column memory
4. When storage of data, if the line-based storage, since the line of data field types may be different, so will produce frequent data type conversion; if the column is based on storage, since the same column types of data are generally consistent, you can avoid frequent data type conversion, and can consider some better compression algorithm to a data compression

Concept 4.Hbase row key column family, the physical model, design principles table
row key: hbase is carrying table, each row corresponding to a key data.

Column family: create a table is specified as a set of columns, each column group as a separate file is stored, the stored data is an array of bytes, wherein the data can have many, distinguished by the timestamp.

Physical Model: hbase entire table is split into a plurality of region, the starting point of each row of keys recorded region stored on different nodes, the parallel query is to query each node, using the region table used when large .META each region of memory starting point, -ROOT but also the starting point of storage .META.

Rowkey design principles: the balance data of each column group, the length of the principle, adjacent principle, create a table when the table is provided into the cache regionserver, growth and avoid automatic time, instead of using an array of bytes string, the maximum length of 64KB, preferably 16 within byte table by talent, two byte hashed, four bytes of storage division milliseconds.

Column group design principles: as little as possible (this is stored in columns Group, read according Region, unnecessary operation io), often two types of data are not frequently used group into different columns, the column name as a group short

5.HBase simple write process
reads:
find RegionServer region data to be read is located, and then read in the following order: go BlockCache read BlockCache if not, to read Memstore, Memstore if not, then to HFile in to read.
Write:
find RegionServer region where the write data, then the data is first written to the WAL (Write-Ahead Logging, write-ahead log system), then data is written to Memstore wait refresh, complete written reply to the client.

6. Describe how to solve Hbase result in the region are too small and too large to bring the region
Region occurred many times over the General Assembly compaction, the data read and write again and again to the hdfs, occupied io, region is too small will cause multiple split, region will be off the assembly line, affect access to services, adjusted hbase.heregion.max.filesize to 256m.

Design principles 7.Hbase Table
1, the number of columns and column family clan potential
recommendations as possible the number of column families HBase settings. When strong, for two or more columns and can not handle Group HBase well. This is due to the HBase Flushing and compression is based on the Region. When the data stored in a column family Flushing reaches a threshold value, and all the columns in the table Group Flushing operation performed simultaneously. This will lead to more unnecessary I / O overhead, column family, the greater the impact this feature brings.
Also, considering the difference in the number of records stored in different columns of the same group in a table, i.e., a column potential family (Cardinality). When the number of two columns would contain large difference in Group Number Group recorded fewer columns of data distributed over a plurality Region, and Region there may be stored on different RegionServer. Thus, when the query or scan operation, system efficiency will be affected.

2, the row key (the RowKey) designed
first timing should be avoided or monotonically (decremented / incremented) row of keys. Because when the data comes, HBase first need to be determined row key record stored position, i.e., Region position, if a timing or monotonous row of keys, the continuous arrival of data will be assigned to the same Region, whereas this other Region when the system / RegionServer is idle, which is distributed not want to see most of the state.

3, as far as possible to minimize the size of the row and column group key
column (column family: column) in the HBase, a specific value from the stored values in a row key, and a time stamp corresponding to the determined value. HBase in order to accelerate the speed of the index is immediately accessible, to create the index is based on the "OK key + column family: timestamp column + + value", if the size of the row and column family bond is too large, even more than the size of the value itself, wonder will increase the size of the index. HBase data recording and often very much, duplicate key rows, columns will not only make the index size is too large, the system will also increase the burden

4, the number of versions
by default three, may be set by HColumnDescriptor, recommended not set too high

Published 27 original articles · won praise 9 · views 20000 +

Guess you like

Origin blog.csdn.net/I_Demo/article/details/104187549