Big Data interview questions (six) ---- HBASE interview questions

"I stumbled on a giant cow artificial intelligence course, could not help but share to everyone. Tutorial is not only a zero-based, user-friendly, and very humorous, like watching a fiction! Think too much bad, so to others. the point where you can jump to the tutorial.. "

Big Data Collection catalog interview, please click

table of Contents

1. HBase is characterized by what?
2. HBase and Hive difference?
3. Describe the design principles of rowKey of HBase?
4. Describe in HBase scan and get the functional similarities and differences and achieve?
5. The the Apache HBase split Region


1. HBase is characterized by what?
1) 大:一个表可以有数十亿行,上百万列;
2) 无模式:每行都有一个可排序的主键和任意多的列,列可以根据需要动态的增加,同一张表中不
同的行可以有截然不同的列;
3) 面向列:面向列(族)的存储和权限控制,列(族)独立检索;
4) 稀疏:空(null)列并不占用存储空间,表可以设计的非常稀疏;
5) 数据多版本:每个单元中的数据可以有多个版本,默认情况下版本号自动分配,是单元格插入时
的时间戳;
6) 数据类型单一:Hbase 中的数据都是字符串,没有类型。
2. HBase and Hive difference?

Here Insert Picture Description
1) What two are?
       Apache Hive is a data warehouse is built on top of Hadoop infrastructure. It can be used by Hive HQL
       language query data stored in the on HDFS. HQL is a SQL-like language, which was eventually converted to Map / Reduce though Hive provides SQL queries, but not be able to interact Hive query - because it can execute batch of Hadoop on Haoop.
       Apache HBase is a Key / Value system, which runs on top of HDFS. And Hive is not the same, Hbase can run on its database in real time, instead of running MapReduce tasks. Hbase is partitioned into a table, the table has been further divided into columns clusters. Column cluster must use the schema definition, the column set up a cluster type column (the column is not required schema definitions). For example, "message" column of the cluster may contain:. "To", "from " "date", "subject", and "body" of each key / value pairs are defined in Hbase as a cell, each key of row -key, cluster column, columns, and timestamps. In Hbase, the row is a collection of key / value mappings, the mapping is uniquely identified by a row-key. Hbase use of Hadoop infrastructure can be extended level of use of common equipment.
2) the characteristics of both
       Hive help people who are familiar with SQL to run MapReduce tasks. Because it is compatible with JDBC, but it can also be integrated with existing SQL tools together. Run Hive query takes a long time, because it will default all data traversing the table. In spite of this disadvantage, the amount of data traversing mechanism may be controlled by partitioning the Hive. Partitioning allows filtering operation on the query data set, the folder of data stored in different files, query time only to traverse the specified folder data (partition) was added. This mechanism can be used, for example, only processes a document within a certain time range, as long as the file name includes a time format.
       HBase works by storing key / value. It supports four main operations: add or update row, view cell within a range, access to the specified line, delete the specified row, column or row version. Version information used to obtain historical data (historical data of each row can be deleted, then you can release the space by Hbase compactions). Although HBase including tables, but the schema is only required to form clusters and columns, the column does not require schema. Hbase table comprises increasing / counting function.
3) both limit
       Hive does not currently support update operations. In addition, due to the hive to run batch operations on hadoop, it takes a long period of time, usually a few minutes to a few hours before they can get to the results of the query. Hive must provide a pre-defined schema to
map files and directories to the column, and the Hive is not compatible with ACID.
       HBase query is written by a specific language that need to be relearned. SQL-like functionality by Apache Phonenix achieved, but at the expense of the schema must be provided. In addition, Hbase is also not compatible with all ACID properties, although it supports some features. Last but not the most important - in order to run Hbase, Zookeeper is a must, zookeeper is used for a distributed coordination of services that include configuration services, maintenance and meta-information service namespace.
4) both scenarios
       Hive suitable for data analysis over time query, for example, used to calculate the trend of the log or site.
       Hive should not be used for real-time queries. Because it takes a long time before they can return results.
       Hbase very suitable for real-time query large data. Facebook messages and conduct real-time analysis Hbase. It can also be used to count the number of Facebook Connect.
5) summarizes the
       Hive and Hadoop-based Hbase are two different technologies -Hive is a SQL-like engine and run MapReduce tasks, Hbase is a kind of on top of Hadoop, NoSQL of Key / vale database. Of course, these two tools can be used simultaneously. Like with Google to search, socialize with FaceBook Like, Hive can be used for statistical inquiry, HBase can be used for real-time query, data can be written from the Hive Hbase, and then set the write-back from Hbase Hive.

3. Describe the design principles of rowKey of HBase?

1) Rowkey length principle
       Rowkey is a binary stream, the length of Rowkey been suggested that many developers design in 10 to 100 bytes, but the proposal is as short as possible, not more than 16 bytes. For the following reasons:

1) 大:一个表可以有数十亿行,上百万列;
2) 无模式:每行都有一个可排序的主键和任意多的列,列可以根据需要动态的增加,同一张表中不
同的行可以有截然不同的列;
3) 面向列:面向列(族)的存储和权限控制,列(族)独立检索;
4) 稀疏:空(null)列并不占用存储空间,表可以设计的非常稀疏;
5) 数据多版本:每个单元中的数据可以有多个版本,默认情况下版本号自动分配,是单元格插入时
的时间戳;
6) 数据类型单一:Hbase 中的数据都是字符串,没有类型。

2)Rowkey 散列原则
       如果Rowkey 是按时间戳的方式递增,不要将时间放在二进制码的前面,建议将Rowkey 的高位作为散列字段,由程序循环生成,低位放时间字段,这样将提高数据均衡分布在每个Regionserver 实现负载均衡的几率。如果没有散列字段,首字段直接是时间信息将产生所有新数据都在一个RegionServer上堆积的热点现象,这样在做数据检索的时候负载将会集中在个别RegionServer,降低查询效率。
3)Rowkey 唯一原则
       必须在设计上保证其唯一性。

4. 描述HBase 中scan 和get 的功能以及实现的异同?

HBase 的查询实现只提供两种方式:
1) 按指定RowKey 获取唯一一条记录,
get 方法(org.apache.hadoop.hbase.client.Get)

       Get 的方法处理分两种: 设置了ClosestRowBefore 和没有设置ClosestRowBefore 的rowlock。主要是用来保证行的事务性,即每个get 是以一个row 来标记的。一个row 中可以有很多family和column。
2) 按指定的条件获取一批记录,
scan 方法(org.apache.Hadoop.hbase.client.Scan)实现条件查询功能

使用的就是scan 方式。
(1)scan 可以通过setCaching 与setBatch 方法提高速度(以空间换时间);
(2)scan 可以通过setStartRow 与setEndRow 来限定范围([start,end)start 是闭区间,end 是开区
间)。范围越小,性能越高。
(3)scan 可以通过setFilter 方法添加过滤器,这也是分页、多条件查询的基础。
发布了422 篇原创文章 · 获赞 357 · 访问量 124万+

Guess you like

Origin blog.csdn.net/silentwolfyh/article/details/103864901